Topic modeling with MALLET
MALLET is a well-known library in topic modeling. It also supports document classification and sequence tagging. More about MALLET can be found at http://mallet.cs.umass.edu/index.php. To download MALLET, visit http://mallet.cs.umass.edu/download.php (the latest version is 2.0.6). Once downloaded, extract MALLET in the directory. It contains the sample data in .txt format in the sample-data/web/en path of the MALLET directory.
The first step is to import the files into MALLET's internal format. To do this, open the Command Prompt or Terminal, move to the mallet directory, and execute the following command:
mallet-2.0.6$ bin/mallet import-dir --input sample-data/web/en --output tutorial.mallet --keep-sequence --remove-stopwordsThis command will generate the tutorial.mallet file.
Training
The next step is to use train-topics to build a topic model and save the output-state, topic-keys, and topics using the train-topics command:
mallet-2.0.6$ bin/mallet train-topics -...