Preprocessing – similarity measured as similar number of common words
As we have seen previously, the bag-of-word approach is both fast and robust. However, it is not without challenges. Let's dive directly into them.
Converting raw text into a bag-of-words
We do not have to write a custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer does the job very efficiently. It also has a very convenient interface. Scikit's functions and classes are imported via the sklearn package as follows:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
The parameter min_df determines how CountVectorizer treats words that are not used frequently (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur less than that fraction of the overall dataset will be dropped. The parameter max_df works in...