The algorithm
We use the list_words() function to get a list of unique words which are more than three-characters long and in lower case:
def list_words(text):
words = []
words_tmp = text.lower().split()
for w in words_tmp:
if w not in words and len(w) > 3:
words.append(w)
return wordsTip
For a more advanced term-document matrix, we can use Python's textmining package from https://pypi.python.org/pypi/textmining/1.0.
The training() function creates variables to store the data needed for the classification. The c_words variable is a dictionary with the unique words and its number of occurrences in the text (frequency) by category. The c_categories variable stores a dictionary of each category and its number of texts. Finally, c_text and c_total_words store the total count of texts and words respectively:
def training(texts):
c_words ={}
c_categories ={}
c_texts = 0
c_total_words =0
#add the classes to the categories
for t in texts:
c_texts = c_texts + 1
...