You're reading from The Natural Language Processing Workshop

Product type Book

Published in Aug 2020

Publisher Packt

ISBN-13 9781800208421

Pages 452 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Authors (6):

Rohan Chopra

Aniruddha M. Godbole

Nipun Sadvilkar

Muzaffar Bashir Shah

Sohom Ghosh

Dwight Gunning

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Natural Language Processing

2. Feature Extraction Methods

3. Developing a Text Classifier

4. Collecting Text Data with Web Scraping and APIs

5. Topic Modeling

6. Vector Representation

7. Text Generation and Summarization

8. Sentiment Analysis

Appendix

1. Introduction to Natural Language Processing

Activity 1.01: Preprocessing of Raw Text

Solution

Let's perform preprocessing on a text corpus. To complete this activity, follow these steps:

Open a Jupyter Notebook.

Insert a new cell and add the following code to import the necessary libraries:

from nltk import download
download('stopwords')
download('wordnet')
nltk.download('punkt')
download('averaged_perceptron_tagger')
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.wsd import lesk
from nltk.tokenize import sent_tokenize
from nltk import stem, pos_tag
import string

Read the content of file.txt and store it in a variable named sentence. Insert a new cell and add the following code to implement this:
```
#load the text file into variable called sentence
sentence = open("../data/file.txt", 'r').read...
```

2. Feature Extraction Methods

Activity 2.01: Extracting Top Keywords from the News Article

Solution

The following steps will help you complete this Activity:

Open a Jupyter Notebook.

Insert a new cell and add the following code to import the necessary libraries and download the data:

import operator
from nltk.tokenize import WhitespaceTokenizer
from nltk import download, stem
# The below statement will download the stop word list 
# 'nltk_data/corpora/stopwords/' at home directory of your computer
download('stopwords')
from nltk.corpus import stopwords

The download statement will download the stop word list at nltk_data/corpora/stopwords/ into your system's home directory.

Create the different types of methods to perform various NLP tasks:

Activity 2.01.ipynb

def load_file(file_path):
    news = ''.join\
              ([line for line in open...

3. Developing a Text Classifier

Activity 3.01: Developing End-to-End Text Classifiers

Solution

The following steps will help you implement this activity:

Open a Jupyter Notebook.

Insert a new cell and add the following code to import the necessary packages:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import warnings
import string
import re
warnings.filterwarnings('ignore')
from sklearn.metrics import accuracy_score, roc_curve, \
classification_report, confusion_matrix, \
precision_recall_curve, auc

Read a data file. It has three columns: is_political, headline, and short_description...

4. Collecting Text Data with Web Scraping and APIs

Activity 4.01: Extracting Information from an Online HTML Page

Solution

Let's extract the data from an online source and analyze it. Follow these steps to implement this activity:

Open a Jupyter Notebook.
Import the requests and BeautifulSoup libraries. Pass the URL to requests with the following command. Convert the fetched content into HTML format using the BeautifulSoup HTML parser. Add the following code to do this:
```
import requests
from bs4 import BeautifulSoup
r = requests\
    .get('https://en.wikipedia.org/wiki/Rabindranath_Tagore')
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
```
To extract the list of headings, see which HTML elements belong to each bold headline in the Works section. You can see that they belong to the h3 tag. We only need the first six headings here. Look for a span tag that has a class attribute with the following set of commands...

5. Topic Modeling

Activity 5.01: Topic-Modeling Jeopardy Questions

Solution

Let's perform topic modeling on the dataset of Jeopardy questions:

Open a Jupyter Notebook.

Insert a new cell and add the following code to import pandas and other libraries:

import numpy as np
import spacy
nlp = spacy.load('en_core_web_sm')
import pandas as pd
pd.set_option('display.max_colwidth', 800)

After downloading the data, you can extract it and place at the location below. Then load the Jeopardy CSV file into a pandas DataFrame. Insert a new cell and add the following code:
```
JEOPARDY_CSV =  '../data/jeopardy/Jeopardy.csv'
questions = pd.read_csv(JEOPARDY_CSV)
questions.columns = [x.strip() for x in questions.columns]
```
The data in the DataFrame is not clean. In order to clean it, remove records that have missing values in the Question column. Add the following code to do this:
```
questions = questions.dropna(subset=['Question'])
```
Find...

6. Vector Representation

Activity 6.01: Finding Similar News Article Using Document Vectors

Solution

Follow these steps to complete this activity:

Open a Jupyter Notebook. Insert a new cell and add the following code to import all necessary libraries:

import warnings
warnings.filterwarnings("ignore")
from gensim.models import Doc2Vec
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string, \
remove_stopwords

Now load the news_lines file.

news_file = '../data/sample_news_data.txt'

After that, you need to iterate over each headline in the file and split the columns, then create a DataFrame containing the headlines. Insert a new cell and add the following code to implement this:

with open(news_file, encoding="utf8", errors='ignore') as f:
    news_lines = [line for line in f.readlines()]
lines_df = pd.DataFrame()
indices  = list(range(len(news_lines)))
lines_df['news'] = news_lines...

7. Text Generation and Summarization

Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset

Solution

Follow these steps to complete this activity:

Open a Jupyter Notebook and insert a new cell. Add the following code to import the required libraries:

import warnings
warnings.filterwarnings('ignore')
import os
import csv
import pandas as pd
from gensim.summarization import summarize

Insert a new cell and add the following code to fetch the Consumer Complaints dataset and consider the rows that have a complaint narrative. Drop all the columns other than Product, Sub-product, Issue, Sub-issue, and Consumer complaint narrative:
```
complaints_pathname = '../data/consumercomplaints/'\
                      'Consumer_Complaints.csv'
df_all_complaints = pd.read_csv(complaints_pathname)
df_all_narr = df_all_complaints...
```

8. Sentiment Analysis

Activity 8.01: Tweet Sentiment Analysis Using the textblob library

Solution

To perform sentiment analysis on the given set of tweets related to airlines, follow these steps:

Open a Jupyter Notebook.
Insert a new cell and add the following code to import the necessary libraries:
```
import pandas as pd
from textblob import TextBlob
import re
```
Since we are displaying the text in the notebook, we want to increase the display width for our DataFrame. Insert a new cell and add the following code to implement this:
```
pd.set_option('display.max_colwidth', 240)
```
Now, load the airline-tweets.csv dataset. We will read this CSV file using pandas' read_csv() function. Insert a new cell and add the following code to implement this:
```
tweets = pd.read_csv('data/airline-tweets.csv')
```
Insert a new cell and add the following code to view the first 10 records of the DataFrame:
```
tweets.head()
```
The code generates the following output:
Figure 8...