Packt+ | Advance your knowledge in tech

You're reading from Learning Predictive Analytics with R

Product typeBook

Published inSep 2015

Reading LevelIntermediate

PublisherPackt

ISBN-139781782169352

Edition1st Edition

Languages

Concepts

Predictive Analytics

Author (1)

Eric Mayor

Chapter 13. Text Analytics with R

In the previous chapter, we examined how to deal with nested data using multilevel analyses. In Chapter 11, Classifiation Trees we discovered how to classify data using decision trees. Here, we will deal with textual data. This chapter will cover the following topics:

A brief introduction to text analytics
How to load and preprocess text
How to perform document classification
How to perform basic topic modeling to extract meaning
How to download news articles using R

An introduction to text analytics

It might come as a surprise, or not, textual data represents the greatest part of the overall data accessible to companies and data analysts. Textual data is often available only in unstructured form. Imagine, for instance, an e-mail, a company memo, or a post on a blog. What they have in common is that text is mostly presented in the form of words arranged in sentences arranged in paragraphs. More complex documents are also composed of sub-sections, sections, and chapters. Humans derive meaning from this basic structure and the relationships between these elements. But for machines to classify documents and extract meaning, text preprocessing is required.

There are several usual steps in the preprocessing of textual documents for classification. These include:

Importing the corpus.
Converting text to lowercase, so that, in the analyses, words that include capital letters are not distinguished from words that do not. For instance, the following words are the...

Loading the corpus

Before we start, let's perform some preliminary steps by running the following code:

1  URL = "http://www.cs.cornell.edu/people/pabo/
2  movie-review-data/review_polarity.tar.gz"
3  download.file(URL,destfile = "reviews.tar.gz")
4  untar("reviews.tar.gz")

This downloads the data you will use in a compressed file. Line 1 and 2 here should be typed on the same line in your console or script window with nospace between the quotation marks. Next, the file is uncompressed in a folder called txt_sentoken in your working directory. Change your working directory to point to this folder by using the following code line:

setwd("txt_sentoken")

The folder contains the subfolders pos and neg. The pos folder contains 1,000 positive film reviews, whereas the neg folder contains 1,000 negative film reviews. The reviews were collected by researchers at Cornell University. We will analyze these texts here. The first thing we will do is load both corpora into R.

For this purpose, and to accomplish...

Data preparation

In this section, we will start by preprocessing the corpus for analysis and then inspecting it. We will then build the training and testing data frames.

Preprocessing and inspecting the corpus

We can see that the joint corpus contains 2,000 documents as we requested. We can now perform the steps we discussed in the preceding section. We will build a function that performs them all at once for this purpose (we will use this function again later in the chapter):

1  install.packages("SnowballC")
2  preprocess = function(corpus, stopwrds = 
3     stopwords("english")){ 
4     library(SnowballC)
5     corpus = tm_map(corpus, content_transformer(tolower))
6     corpus = tm_map(corpus, removePunctuation)
7     corpus = tm_map(corpus, 
8     content_transformer(removeNumbers))
9     corpus = tm_map(corpus, removeWords, stopwrds)
10     corpus = tm_map(corpus, stripWhitespace)
11     corpus = tm_map(corpus, stemDocument)
12     corpus
13  }

Let's run the function on our corpus:

processed...

Creating the training and testing data frames

We now need to create a data frame that includes the criterion attribute (quality), the length of the reviews, and the term-document matrix:

DF = as.data.frame(cbind(quality, lengths, SparseRemoved))

Let's now create our training and a testing dataset:

1  set.seed(123)
2  train = sample(1:2000,1000)
3  TrainDF = DF[train,]
4  TestDF = DF[-train,]

Classification of the reviews

At the beginning of this section, we will try to classify the corpus using algorithms we have already discussed (Naïve Bayes and k-NN). We will then briefly discuss two new algorithms: logistic regression and support vector machines.

Document classification with k-NN

We know k-Nearest Neighbors, so we'll just jump into the classification. We will try with three neighbors and five neighbors:

1  library(class) # knn() is in the class packages
2  library(caret) # confusionMatrix is in the caret package
3  set.seed(975)
4  Class3n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 3)
5  Class5n = knn(TrainDF[,-1], TrainDF[,-1], TrainDF[,1], k = 5)
6  confusionMatrix(Class3n,as.factor(TrainDF$quality))

The confusion matrix and the following statistics (the output has been partially reproduced) show that classification with three neighbors doesn't seem too bad: the accuracy is 0.74; yet, the kappa value is not good (it should be at least 0.60):

Confusion Matrix and Statistics...

Mining the news with R

In this section, we discuss news mining in R. We start with a successful document classification and then discuss how to collect news articles directly from R.

A successful document classification

In this section, we examine a particular dataset which features a term-document matrix of 2,071 press articles containing the word flu in their title. The articles were found on LexisNexis using this search term in two newspapers, The New York Times and The Guardian, between January 1980 and May 2013. For copyright reasons, we cannot include the original articles here. These have been preprocessed in a similar way to what we have seen before with another software, Rapidminer 5. In addition to the term-document matrix, the type of seasonal flu versus other (avian and swine flu)–is included in the first column of the data frame (the SEASONAL.FLU attribute). When articles discussed seasonal flu and other strands, they were coded as other (value 0). Terms were coded as present...

Summary

In this chapter, we discussed how to deal with text in R in order to perform classification. We examined how to load documents from several sources, preprocess them, and how to compute term frequencies. We compared the reliability of various algorithms in the classification such as Naïve Bayes, k-Nearest Neighbors, logistic regression, and support vector machines. Additionally, we examined how to perform basic topic modeling in order to extract meaning. We then studied how to automatically download news articles from sources such as The New York Times Article Search API and extract and visualize associations between terms.

In the next chapter, we will discuss cross-validation and how to export models using the PMML.

The rest of the chapter is locked

You have been reading a chapter from

Learning Predictive Analytics with R

Published in: Sep 2015Publisher: PacktISBN-13: 9781782169352

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Eric Mayor

Eric Mayor is a senior researcher and lecturer at the University of Neuchatel, Switzerland. He is an enthusiastic user of open source and proprietary predictive analytics software packages, such as R, Rapidminer, and Weka. He analyzes data on a daily basis and is keen to share his knowledge in a simple way.
Read more about Eric Mayor

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages