Who are we?

  • Diversity Mining Laboratory at Nagasaki University, led by Prof. Tomonari Masada, is a vigorous community working in data mining.
  • Our aim is to explore and organize the content diversity latent in massive data for helping people to have a better insight on the data.
  • We apply various probabilistic modeling techniques, especially Topic Modeling, to achieve our aim.

Our Research Topics


Topic Analysis of Minutes of the National Diet of Japan

gijiroku_bubble.png
  • We provide this visualization only in Japanese.


Extraction of Topic Evolutions from References in Scientific Articles and Its GPU Acceleration

TopicEvolveS.png
  • This work tried to extract latent topic transitions from linked documents as a single transition matrix between latent topics.
  • The paper was accepted as a short paper for CIKM 2012.
    • The original long version of the paper is available at here.
  • This is a joint work with Prof. Atsuhiro Takasu in NII.


Clustering of the MNIST training images of the digit "6"

img.log.msirm3.100.128-1diffG6.0.txt.0.png
img.log.msirm3.100.128-1diffG6.0.txt.1.png
img.log.msirm3.100.128-1diffG6.0.txt.2.png
img.log.msirm3.100.128-1diffG6.0.txt.3.png
img.log.msirm3.100.128-1diffG6.0.txt.4.png
img.log.msirm3.100.128-1diffG6.0.txt.5.png
img.log.msirm3.100.128-1diffG6.0.txt.6.png
img.log.msirm3.100.128-1diffG6.0.txt.7.png
img.log.msirm3.100.128-1diffG6.0.txt.8.png
img.log.msirm3.100.128-1diffG6.0.txt.9.png
img.log.msirm3.100.128-1diffG6.0.txt.10.png
img.log.msirm3.100.128-1diffG6.0.txt.11.png
img.log.msirm3.100.128-1diffG6.0.txt.12.png
img.log.msirm3.100.128-1diffG6.0.txt.13.png
img.log.msirm3.100.128-1diffG6.0.txt.14.png
img.log.msirm3.100.128-1diffG6.0.txt.15.png
img.log.msirm3.100.128-1diffG6.0.txt.16.png
img.log.msirm3.100.128-1diffG6.0.txt.17.png
img.log.msirm3.100.128-1diffG6.0.txt.18.png
img.log.msirm3.100.128-1diffG6.0.txt.19.png
img.log.msirm3.100.128-1diffG6.0.txt.20.png
img.log.msirm3.100.128-1diffG6.0.txt.21.png
img.log.msirm3.100.128-1diffG6.0.txt.22.png
img.log.msirm3.100.128-1diffG6.0.txt.23.png
img.log.msirm3.100.128-1diffG6.0.txt.24.png
img.log.msirm3.100.128-1diffG6.0.txt.25.png
img.log.msirm3.100.128-1diffG6.0.txt.26.png
img.log.msirm3.100.128-1diffG6.0.txt.27.png
  • The result given above is obtained for the MNIST dataset.
  • The number of clusters is controlled by Chinese restaurant process.


Clustering of pixel columns and pixel rows

of hand-written digit images

external image Img.log.msirm3.100.128-1diffG0.0.txt.png
external image Img.log.msirm3.100.128-1diffG1.0.txt.png
external image Img.log.msirm3.100.128-1diffG2.0.txt.png
external image Img.log.msirm3.100.128-1diffG3.0.txt.png
external image Img.log.msirm3.100.128-1diffG4.0.txt.png
external image Img.log.msirm3.100.128-1diffG5.0.txt.png
external image Img.log.msirm3.100.128-1diffG6.0.txt.png
external image Img.log.msirm3.100.128-1diffG7.0.txt.png
external image Img.log.msirm3.100.128-1diffG8.0.txt.png
external image Img.log.msirm3.100.128-1diffG9.0.txt.png
  • This is a visualization of clustering [Masada+ ICONIP2013].
  • The result given above is obtained for the MNIST dataset.
  • Black pixels are the pixels detected as irrelevant by our method.
  • We can use clustering results for classifying test images.

Extracting visual topics from tens of thousands of images with GPU

visual topics extracted by our GPU-based implementation of collapsed variational Bayesian inference for LDA
VisualTopics.jpg
  • We implement CVB for LDA with CUDA [Masada+ IEA/AIE2009] and run an inference over 32,000 images.
    • This is a subset of Tiny Images dataset[1] .
  • Here we give the topic extraction results only for tens of images. The full set of results can be browsed at this Web page.
  • Gray scale pixel values show topic probabilities at each pixel in each image.


Segmenting citation data with latent permutations

Seg.png
  • We propose a completely new citation segmentation method based on a proposal by Chen et al.[2]
    • Generalized Mallows model is used effectively for extending LDA to realize topic sequence mining.
  • We proposed an unsupervised method [Masada+ WISS2010] and its semi-supervised version [Masada+ ICADL2011].
    • The above figure presents a segmentation example obtained by our semi-supervised segmentation.
  • This is a joint work with Prof. Atsuhiro Takasu in NII.

Extracting topical trends

vanilla LDA
trends-lda.jpg
LYNDA [Masada+ CIKM2009]
trends-lynda.jpg
BToT [Masada+ ISNN2010]
trends-btot.jpg
  • We propose two methods, i.e., LYNDA and BToT, for extracting topical trends from a document set.
  • The three figures above give the results obtained by LDA, LYNDA[Masada+ CIKM2009], and BToT[Masada+ ISNN2010].
    • Each colored region corresponds to a different topic.
    • Vertical axis represents document dates ranging from Jan. 1, 2002 to Dec. 31, 2005.
    • Horizontal axis represents topic popularity at each date.
    • The analyzed data is a set of Yomiuri newspaper articles.
    • This is a joint research with Prof. Atsuhiro Takasu in NII.

Extracting per-topic temporal transitions of popular words from parallel corpora

WordRanking.jpg
  • We propose a new topic model [Masada+ PAKDD2011] for extracting temporal transitions of word probabilities for each topic.
  • Our model is extended for parallel corpus analysis and is applied to Chinese-English abstracts of computer science papers.
    • The years of the abstracts range from 2000 to 2009.
    • We only show five among tens of the extracted topics.
    • Each topic is represented by the top three Chinese and English words of large probability in each year.
    • No Chinese-English dictionaries are used.
    • This is a joint research with Prof. Atsuhiro Takasu in NII. The dataset was collected and cleaned up by Haipeng Zhang.

  1. ^ A. Torralba and R. Fergus and W. T. Freeman. Tiny Images. MIT-CSAIL-TR-2007-024, 2007.
  2. ^ H. Chen, S.R.K. Branavan, R. Barzilay, D.R. Karger. Global Models of Document Structure Using Latent Permutations. NAACL/HLT 2009.