Who are we?

Diversity Mining Laboratory at Nagasaki University, led by Prof. Tomonari Masada, is a vigorous community working in data mining.
Our aim is to explore and organize the content diversity latent in massive data for helping people to have a better insight on the data.
We apply various probabilistic modeling techniques, especially Topic Modeling, to achieve our aim.

Our Research Topics

This work tried to extract latent topic transitions from linked documents as a single transition matrix between latent topics.
The paper was accepted as a short paper for CIKM 2012.
- The original long version of the paper is available at here.
This is a joint work with Prof. Atsuhiro Takasu in NII.

Clustering of pixel columns and pixel rows	of hand-written digit images





This is a visualization of clustering [Masada+ ICONIP2013]. The result given above is obtained for the MNIST dataset.	Black pixels are the pixels detected as irrelevant by our method. We can use clustering results for classifying test images.

visual topics extracted by our GPU-based implementation of collapsed variational Bayesian inference for LDA

We implement CVB for LDA with CUDA [Masada+ IEA/AIE2009] and run an inference over 32,000 images.
- This is a subset of Tiny Images dataset^[1] .
Here we give the topic extraction results only for tens of images. The full set of results can be browsed at this Web page.
Gray scale pixel values show topic probabilities at each pixel in each image.

We propose a completely new citation segmentation method based on a proposal by Chen et al.^[2]
- Generalized Mallows model is used effectively for extending LDA to realize topic sequence mining.
We proposed an unsupervised method [Masada+ WISS2010] and its semi-supervised version [Masada+ ICADL2011].
- The above figure presents a segmentation example obtained by our semi-supervised segmentation.
This is a joint work with Prof. Atsuhiro Takasu in NII.

vanilla LDA

We propose two methods, i.e., LYNDA and BToT, for extracting topical trends from a document set.
The three figures above give the results obtained by LDA, LYNDA[Masada+ CIKM2009], and BToT[Masada+ ISNN2010].
- Each colored region corresponds to a different topic.
- Vertical axis represents document dates ranging from Jan. 1, 2002 to Dec. 31, 2005.
- Horizontal axis represents topic popularity at each date.
- The analyzed data is a set of Yomiuri newspaper articles.
- This is a joint research with Prof. Atsuhiro Takasu in NII.

We propose a new topic model [Masada+ PAKDD2011] for extracting temporal transitions of word probabilities for each topic.
Our model is extended for parallel corpus analysis and is applied to Chinese-English abstracts of computer science papers.
- The years of the abstracts range from 2000 to 2009.
- We only show five among tens of the extracted topics.
- Each topic is represented by the top three Chinese and English words of large probability in each year.
- No Chinese-English dictionaries are used.
- This is a joint research with Prof. Atsuhiro Takasu in NII. The dataset was collected and cleaned up by Haipeng Zhang.

^ A. Torralba and R. Fergus and W. T. Freeman. Tiny Images. MIT-CSAIL-TR-2007-024, 2007.
^ H. Chen, S.R.K. Branavan, R. Barzilay, D.R. Karger. Global Models of Document Structure Using Latent Permutations. NAACL/HLT 2009.