Reader small image

You're reading from  Learning Predictive Analytics with R

Product typeBook
Published inSep 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781782169352
Edition1st Edition
Languages
Right arrow
Author (1)
Eric Mayor
Eric Mayor
author image
Eric Mayor

Eric Mayor is a senior researcher and lecturer at the University of Neuchatel, Switzerland. He is an enthusiastic user of open source and proprietary predictive analytics software packages, such as R, Rapidminer, and Weka. He analyzes data on a daily basis and is keen to share his knowledge in a simple way.
Read more about Eric Mayor

Right arrow

Chapter 13. Text Analytics with R

In the previous chapter, we examined how to deal with nested data using multilevel analyses. In Chapter 11, Classifiation Trees we discovered how to classify data using decision trees. Here, we will deal with textual data. This chapter will cover the following topics:

  • A brief introduction to text analytics

  • How to load and preprocess text

  • How to perform document classification

  • How to perform basic topic modeling to extract meaning

  • How to download news articles using R

An introduction to text analytics


It might come as a surprise, or not, textual data represents the greatest part of the overall data accessible to companies and data analysts. Textual data is often available only in unstructured form. Imagine, for instance, an e-mail, a company memo, or a post on a blog. What they have in common is that text is mostly presented in the form of words arranged in sentences arranged in paragraphs. More complex documents are also composed of sub-sections, sections, and chapters. Humans derive meaning from this basic structure and the relationships between these elements. But for machines to classify documents and extract meaning, text preprocessing is required.

There are several usual steps in the preprocessing of textual documents for classification. These include:

  1. Importing the corpus.

  2. Converting text to lowercase, so that, in the analyses, words that include capital letters are not distinguished from words that do not. For instance, the following words are the...

Loading the corpus


Before we start, let's perform some preliminary steps by running the following code:

1  URL = "http://www.cs.cornell.edu/people/pabo/
2  movie-review-data/review_polarity.tar.gz"
3  download.file(URL,destfile = "reviews.tar.gz")
4  untar("reviews.tar.gz")

This downloads the data you will use in a compressed file. Line 1 and 2 here should be typed on the same line in your console or script window with nospace between the quotation marks. Next, the file is uncompressed in a folder called txt_sentoken in your working directory. Change your working directory to point to this folder by using the following code line:

setwd("txt_sentoken")

The folder contains the subfolders pos and neg. The pos folder contains 1,000 positive film reviews, whereas the neg folder contains 1,000 negative film reviews. The reviews were collected by researchers at Cornell University. We will analyze these texts here. The first thing we will do is load both corpora into R.

For this purpose, and to accomplish...

Data preparation


In this section, we will start by preprocessing the corpus for analysis and then inspecting it. We will then build the training and testing data frames.

Preprocessing and inspecting the corpus

We can see that the joint corpus contains 2,000 documents as we requested. We can now perform the steps we discussed in the preceding section. We will build a function that performs them all at once for this purpose (we will use this function again later in the chapter):

1  install.packages("SnowballC")
2  preprocess = function(corpus, stopwrds = 
3     stopwords("english")){ 
4     library(SnowballC)
5     corpus = tm_map(corpus, content_transformer(tolower))
6     corpus = tm_map(corpus, removePunctuation)
7     corpus = tm_map(corpus, 
8     content_transformer(removeNumbers))
9     corpus = tm_map(corpus, removeWords, stopwrds)
10     corpus = tm_map(corpus, stripWhitespace)
11     corpus = tm_map(corpus, stemDocument)
12     corpus
13  }

Let's run the function on our corpus:

processed...

Creating the training and testing data frames


We now need to create a data frame that includes the criterion attribute (quality), the length of the reviews, and the term-document matrix:

DF = as.data.frame(cbind(quality, lengths, SparseRemoved))

Let's now create our training and a testing dataset:

1  set.seed(123)
2  train = sample(1:2000,1000)
3  TrainDF = DF[train,]
4  TestDF = DF[-train,]

Classification of the reviews


At the beginning of this section, we will try to classify the corpus using algorithms we have already discussed (Naïve Bayes and k-NN). We will then briefly discuss two new algorithms: logistic regression and support vector machines.

Document classification with k-NN

We know k-Nearest Neighbors, so we'll just jump into the classification. We will try with three neighbors and five neighbors:

1  library(class) # knn() is in the class packages
2  library(caret) # confusionMatrix is in the caret package
3  set.seed(975)
4  Class3n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 3)
5  Class5n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 5)
6  confusionMatrix(Class3n,as.factor(TrainDF$quality))

The confusion matrix and the following statistics (the output has been partially reproduced) show that classification with three neighbors doesn't seem too bad: the accuracy is 0.74; yet, the kappa value is not good (it should be at least 0.60):

Confusion Matrix and Statistics...

Mining the news with R


In this section, we discuss news mining in R. We start with a successful document classification and then discuss how to collect news articles directly from R.

A successful document classification

In this section, we examine a particular dataset which features a term-document matrix of 2,071 press articles containing the word flu in their title. The articles were found on LexisNexis using this search term in two newspapers, The New York Times and The Guardian, between January 1980 and May 2013. For copyright reasons, we cannot include the original articles here. These have been preprocessed in a similar way to what we have seen before with another software, Rapidminer 5. In addition to the term-document matrix, the type of seasonal flu versus other (avian and swine flu)–is included in the first column of the data frame (the SEASONAL.FLU attribute). When articles discussed seasonal flu and other strands, they were coded as other (value 0). Terms were coded as present...

Summary


In this chapter, we discussed how to deal with text in R in order to perform classification. We examined how to load documents from several sources, preprocess them, and how to compute term frequencies. We compared the reliability of various algorithms in the classification such as Naïve Bayes, k-Nearest Neighbors, logistic regression, and support vector machines. Additionally, we examined how to perform basic topic modeling in order to extract meaning. We then studied how to automatically download news articles from sources such as The New York Times Article Search API and extract and visualize associations between terms.

In the next chapter, we will discuss cross-validation and how to export models using the PMML.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Predictive Analytics with R
Published in: Sep 2015Publisher: PacktISBN-13: 9781782169352
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Eric Mayor

Eric Mayor is a senior researcher and lecturer at the University of Neuchatel, Switzerland. He is an enthusiastic user of open source and proprietary predictive analytics software packages, such as R, Rapidminer, and Weka. He analyzes data on a daily basis and is keen to share his knowledge in a simple way.
Read more about Eric Mayor