Skip to content
/ dtmpy Public
forked from magsilva/dtm

This implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change.

License

Notifications You must be signed in to change notification settings

jeffmm/dtmpy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dtmpy

A Python module for doing fast Dynamic Topic Modeling.

This module wraps the original C/C++ code by David M. Blei and Sean M. Gerrish.

I've refactored the original code to wrap the main function call in a class DTM that has Python bindings. Other code changes are listed below.

Usage

Below is an example of how to use this package. Reference the dtmpy.DTM docstring for all possible keyword arguments.

import dtmpy

# Initialize DTM model flags
dtm = dtmpy.DTM(corpus_prefix="model_inputs/file_prefix",
                output_name="model_outputs",
                n_topics=10)

# Fit the model and write outputs from bag-of-words corpus and timestamps
dtm.fit(my_corpus, n_docs_per_timestamp)


# The proportion of topic 5 in document 3
topic_mixtures = dtm.read_topic_mixtures()
print(topic_mixtures[3, 5])


# The probability of term 32 being in topic 5 at time 7
topic5 = dtm.read_topic_term_probabilities(5)
print(topic5[32, 7])

There is also a sample seq and mult file in the example directory, which can be fit as test case using the below code:

import dtmpy

dtm = dtmpy.DTM(corpus_prefix="example/test", output_name="model_outputs")

# Load corpus and time intervals from existing seq and mult files
corpus, time_slices = dtm.read_corpus()

# Fit the dynamic topic model
dtm.fit(corpus, time_slices)

Install

Install dependencies:

  • cmake
  • libgsl-dev
  • pybind11-dev

These can be installed on Ubuntu with:

apt-get install cmake libgsl-dev pybind11-dev

or on macOS using Homebrew with:

brew install cmake gsl pybind11

After installing the dependencies, dtmpy can be installed with:

pip install dtmpy

or directly from source:

git clone https://github.com/jeffmm/dtmpy
cd dtmpy
pip install .

Model inputs

Fitting a corpus

The dtmpy.DTM class can fit a DTM model to a bag-of-words corpus that is the form of a list of lists (one list for each document), with each document list containing tuples of unique words in the document and their counts. As a simple example:

# document 1 is the sentence "foo bar bar baz"
# document 2 is the sentence "bar baz bang baz"

dictionary = {0: "foo",
              1: "bar",
              2: "baz",
              3: "bang"}

bow_corpus = [[(0, 1), (1, 2), (2, 1)],
              [(1, 1), (2, 2), (3, 1)]]

The bow_corpus parameter can be passed to the DTM.fit method. The dictionary is only used as a convenience for post processing. The documents in the corpus should be listed in chronological order.

DTM.fit also takes a list of number of documents to include per each time interval (time slice of the corpus) in the DTM model, time_slices. Optionally, you can also pass the number of time slices you want to use, n_time_slices, which will cause the documents to be evenly spread over all time intervals.

Mult and seq files

The original Dynamic Topic Model takes two files as inputs, which are automatically generated from the corpus and time slices when passed to the DTM.fit method:

  • foo-mult.dat (the mult file)
  • foo-seq.dat (the seq file)

If the above files are in the directory bar relative to the working directory, then the corpus_prefix argument will be bar/foo.

If the dataset consists of N docs, then the mult file is an N-length file where line i lists the number of unique words in document i and the index:count for each word in the document, like so:


unique_word_count_doc_1 index_11:count_11 index_12:count_12 ... index_1n:count_1n
unique_word_count_doc_2 index_21:count_21 index_22:count_22 ...
...
unique_word_count_doc_N index_N1:count_N1 index_N2:count_N2 ...

Where each word index corresponds to a unique word in the vocabulary (ie listed in a dictionary somewhere)

The order of the docs in mult should be chronological, in order to correspond to the documents numbered in the seq file.

The seq file gives the number of documents that correspond to each time slice in the DTM, and is of the form


n_times
number_docs_time_1
number_docs_time_2
...
number_docs_time_n_times

Two additional files that are useful to have somewhere are:

  • file with all of the words in the vocabulary, arranged in the same order as the word indices
  • a file with information on each of the documents, arranged in the same order as the docs in the mult file.

Loading previous inputs

Previously-generated mult and seq files can be read and used to generate corpus and time slices lists to be used in other model fits. This can be done by using the DTM.read_corpus method. Alternatively, the DTM.load method can be used to prepare the class to read previously-generated model outputs.

Model outputs

Reading model outputs

The raw output files (discussed in the next section) can be read into numpy arrays. The DTM.read_topic_term_probabilities method is used to see the estimated probabilities that terms within the corpus appeared in the specified topic at each time interval. DTM.read_topic_mixtures is used to see the estimated proportions of each topic within each document. See the Usage section above for an example.

Raw output files

The model outputs the following files to the directory specified by the corpus_prefix and output_name arguments.

  • topic-xxx-var-e-log-prob.dat: the e-betas (word distributions) for topic xxx for all times. This is in row-major form, e.g. (in R):
a = scan("topic-002-var-e-log-prob.dat")
b = matrix(a, ncol=10, byrow=TRUE)

# The probability of term 100 in topic 2 at time 3:
exp(b[100, 3])
  • gam.dat: The gammas associated with each document. Divide these by the sum for each document to get expected topic mixtures.
a = scan("gam.dat")
b = matrix(a, ncol=10, byrow=TRUE)
rs = rowSums(b)
e.theta = b / rs

# Proportion of topic 5 in document 3:
e.theta[3, 5]

If you are running this software in "dim" mode to find document influence, it will also create the following files:

  • influence_time-yyy : the influence of documents at time yyy for each topic, where time is based on in your seq file and the document index is given by the ordering of documents in the mult file.
a = scan("influence-time-010")
b = matrix(a, ncol=10, byrow=TRUE)

# The influence of the 2nd document on topic 5:
b[2, 5]

Changes to the original code

  • Refactored original main function to be fit method of class DTM
  • Removed dependence of gflags and converted global flags to use extern variables that are initialized by DTM.
  • Added Python bindings to DTM class using pybind11
  • Better error handling to prevent commands like exit from crashing the Python kernel
  • Better use of const to limit number of compiler warnings
  • Changed some useful outputs for monitoring model fitting convergence to redirect to a log.txt file in the output directory rather than stderr
  • Fixed bug that made LDA initialization finish before converging or reaching its max iteration
  • More sane defaults:
    • changed corpus_prefix and outname to be required arguments
    • changed ntopics from -1.0 to 10
    • changed alpha from -10 to 0.01
    • changed initialize_lda flag to true

About the original Dynamic Topic Model code

Dynamic Topic Models and the Document Influence Model

This implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change.

This code is the result of work by David M. Blei and Sean M. Gerrish.

(C) Copyright 2006, David M. Blei

(C) Copyright 2011, Sean M. Gerrish

It includes software corresponding to models described in the following papers:

[1] D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006.

[2] S. Gerrish and D. Blei. A Language-based Approach to Measuring Scholarly Impact. In Proceedings of the 27th International Conference on Machine Learning, 2010.

These files are part of DIM.

DIM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

DIM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


About

This implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 93.8%
  • Python 5.7%
  • Other 0.5%