Technical requirements
We will use the following packages in this chapter: spacy, matplotlib, wordcloud, and pyldavis. They are part of the poetry environment and the requirements.txt file.
We will be using two datasets in this chapter. The first is the BBC news dataset, located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_train.json and https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/data/bbc_test.json.
Note
This dataset is used in this book with permission from the researchers. The original paper associated with this dataset is as follows:
Derek Greene and Pádraig Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering,” in Proc. 23rd International Conference on Machine Learning (ICML’06), 2006.
All rights, including copyright, in the text content of the original articles...
 
                                             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
                