The algorithm
We use the list_words() function to get a list of unique words which are more than three-characters long and in lower case:
def list_words(text):
  words = []
  words_tmp = text.lower().split()
  for w in words_tmp:
    if w not in words and len(w) > 3:
      words.append(w)
  return wordsTip
For a more advanced term-document matrix, we can use Python's textmining package from https://pypi.python.org/pypi/textmining/1.0.
The training() function creates variables to store the data needed for the classification. The c_words variable is a dictionary with the unique words and its number of occurrences in the text (frequency) by category. The c_categories variable stores a dictionary of each category and its number of texts. Finally, c_text and c_total_words store the total count of texts and words respectively:
def training(texts):
  c_words ={}
  c_categories ={}
  c_texts = 0
  c_total_words =0
  #add the classes to the categories
  for t in texts:
    c_texts = c_texts + 1
 ... 
                                             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
                