Reader small image

You're reading from  Python Artificial Intelligence Projects for Beginners

Product typeBook
Published inJul 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789539462
Edition1st Edition
Languages
Right arrow
Author (1)
Dr. Joshua Eckroth
Dr. Joshua Eckroth
author image
Dr. Joshua Eckroth

Joshua Eckroth is an Assistant Professor of Computer Science at Stetson University, where he teaches AI, big data mining and analytics, and software engineering. He earned his PhD from The Ohio State University in AI and Cognitive Science. Dr. Eckroth also serves as Chief Architect at i2k Connect, which focuses on transforming documents into structured data using AI and enriched with subject matter expertise. Dr. Eckroth has previously published two video series with Packt, Python Artificial Intelligence Projects for Beginners and Advanced Artificial Intelligence Projects with Python. His academic publications can be found on Google Scholar.
Read more about Dr. Joshua Eckroth

Right arrow

Chapter 3. Applications for Comment Classification

In this chapter, we'll overview the bag-of-words model for text classification. We will look at predicting YouTube comment spam with the bag-of-words and the random forest techniques. Then we'll look at the Word2Vec models and prediction of positive and negative reviews with the Word2Vec approach and the k-nearest neighbor classifier. 

In this chapter, we will particularly focus on text and words and classify internet comments as spam or not spam or to identify internet reviews as positive or negative. We will also have an overview for bag of words for text classification and prediction model to predict YouTube comments are spam or not using bag of words and random forest techniques. We will also look at Word2Vec models an k-nearest neighbor classifier.

But, before we start, we'll answer the following question: what makes text classification an interesting problem?

Text classification


To find the answer to our question, we will consider the famous iris flower dataset as an example dataset. The following image is of iris versicolor species. To identify the species, we need some more information other than just an image of the species, such as the flower's Petal lengthPetal widthSepal length, and Sepal width would help us identify the image better:

The dataset not only contains examples of versicolor but also contains examples of setosa and virginica as well. Every example in the dataset contains these four measurements. The dataset contains around 150 examples, with 50 examples of each species. We can use a decision tree or any other model to predict the species of a new flower, if provided with the same four measurements. As we know same species will have almost similar measurements. Since similarity has different definition all together but here we consider similarity as the closeness on a graph, if we consider each point is a flower. The following...

Detecting YouTube comment spam


In this section, we're going to look at a technique for detecting YouTube comment spam using bags of words and random forests. The dataset is pretty straightforward. We'll use a dataset that has about 2,000 comments from popular YouTube videos (https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection). The dataset is formatted in a way where each row has a comment followed by a value marked as 1 or 0 for spam or not spam.

First, we will import a single dataset. This dataset is actually split into four different files. Our set of comments comes from the PSY-Gangnam Style video:

Then we will print a few comments as follows:

Here we are able to see that there are more than two columns, but we will only require the content and the class columns. The content column contains the comments and the class column contains the values 1 or 0 for spam or not spam. For example, notice that the first two comments are marked as not spam, but then the comment subscribe to...

Word2Vec models


In this section, we'll learn about Word2Vec, a modern and popular technique for working with text. Usually, Word2Vec performs better than simple bag of words models. A bag of words model only counts how many times each word appears in each document. Given two such bag of words vectors, we can compare documents to see how similar they are. This is the same as comparing the words used in the documents. In other words, if the two documents have many similar words that appear a similar number of times, they will be considered similar.

But bag of words models have no information about how similar the words are. So, if two documents do not use exactly the same words but do use synonyms, such as please and plz, they're not regarded as similar for the bag of words model. Word2Vec can figure out that some words are similar to each other and we can exploit that fact to get better performance when doing machine learning with text.

 

In Word2Vec, each word itself is a vector, with perhaps...

Detecting positive or negative sentiments in user reviews


In this section, we're going to look at detecting positive and negative sentiments in user reviews. In other words, we are going to detect whether the user is typing a positive comment or a negative comment about the product or service. We're going to use Word2Vec and Doc2Vec specifically and the gensim Python library for those services. There are two categories, which are positive and negative, and we have over 3,000 different reviews to look at. These come from Yelp, IMDb, and Amazon. Let's begin the code by importing the gensim library, which provides Word2Vec and Doc2Vec for logging to note status of the messages:

First, we will see how to load a pre-built Word2Vec model, provided by Google, that has been trained on billions of pages of text and has ultimately produced 300-dimensional vectors for all the different words. Once the model is loaded, we will look at the vector for cat. This shows that the model is a 300-dimensional...

Summary


In this chapter, we introduced text processing and the bag of words technique. We then used this technique to build a spam detector for YouTube comments. Next, we learned about the sophisticated Word2Vec model and put it to task with a coding project that detects positive and negative product, restaurant, and movie reviews. That's the end of this chapter about text.

In the next chapter, we're going to look at deep learning, which is a popular technique that's used in neural networks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Artificial Intelligence Projects for Beginners
Published in: Jul 2018Publisher: PacktISBN-13: 9781789539462
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dr. Joshua Eckroth

Joshua Eckroth is an Assistant Professor of Computer Science at Stetson University, where he teaches AI, big data mining and analytics, and software engineering. He earned his PhD from The Ohio State University in AI and Cognitive Science. Dr. Eckroth also serves as Chief Architect at i2k Connect, which focuses on transforming documents into structured data using AI and enriched with subject matter expertise. Dr. Eckroth has previously published two video series with Packt, Python Artificial Intelligence Projects for Beginners and Advanced Artificial Intelligence Projects with Python. His academic publications can be found on Google Scholar.
Read more about Dr. Joshua Eckroth