Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Natural Language Processing Workshop

You're reading from  The Natural Language Processing Workshop

Product type Book
Published in Aug 2020
Publisher Packt
ISBN-13 9781800208421
Pages 452 pages
Edition 1st Edition
Languages
Authors (6):
Rohan Chopra Rohan Chopra
Profile icon Rohan Chopra
Aniruddha M. Godbole Aniruddha M. Godbole
Profile icon Aniruddha M. Godbole
Nipun Sadvilkar Nipun Sadvilkar
Profile icon Nipun Sadvilkar
Muzaffar Bashir Shah Muzaffar Bashir Shah
Profile icon Muzaffar Bashir Shah
Sohom Ghosh Sohom Ghosh
Profile icon Sohom Ghosh
Dwight Gunning Dwight Gunning
Profile icon Dwight Gunning
View More author details

Table of Contents (10) Chapters

Preface
1. Introduction to Natural Language Processing 2. Feature Extraction Methods 3. Developing a Text Classifier 4. Collecting Text Data with Web Scraping and APIs 5. Topic Modeling 6. Vector Representation 7. Text Generation and Summarization 8. Sentiment Analysis Appendix

1. Introduction to Natural Language Processing

Activity 1.01: Preprocessing of Raw Text

Solution

Let's perform preprocessing on a text corpus. To complete this activity, follow these steps:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    from nltk import download
    download('stopwords')
    download('wordnet')
    nltk.download('punkt')
    download('averaged_perceptron_tagger')
    from nltk import word_tokenize
    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk.corpus import stopwords
    from autocorrect import Speller
    from nltk.wsd import lesk
    from nltk.tokenize import sent_tokenize
    from nltk import stem, pos_tag
    import string
  3. Read the content of file.txt and store it in a variable named sentence. Insert a new cell and add the following code to implement this:
    #load the text file into variable called sentence
    sentence = open("../data/file.txt", 'r').read...

2. Feature Extraction Methods

Activity 2.01: Extracting Top Keywords from the News Article

Solution

The following steps will help you complete this Activity:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries and download the data:
    import operator
    from nltk.tokenize import WhitespaceTokenizer
    from nltk import download, stem
    # The below statement will download the stop word list 
    # 'nltk_data/corpora/stopwords/' at home directory of your computer
    download('stopwords')
    from nltk.corpus import stopwords

    The download statement will download the stop word list at nltk_data/corpora/stopwords/ into your system's home directory.

  3. Create the different types of methods to perform various NLP tasks:

    Activity 2.01.ipynb

    def load_file(file_path):
        news = ''.join\
                  ([line for line in open...

3. Developing a Text Classifier

Activity 3.01: Developing End-to-End Text Classifiers

Solution

The following steps will help you implement this activity:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary packages:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    import warnings
    import string
    import re
    warnings.filterwarnings('ignore')
    from sklearn.metrics import accuracy_score, roc_curve, \
    classification_report, confusion_matrix, \
    precision_recall_curve, auc
  3. Read a data file. It has three columns: is_political, headline, and short_description...

4. Collecting Text Data with Web Scraping and APIs

Activity 4.01: Extracting Information from an Online HTML Page

Solution

Let's extract the data from an online source and analyze it. Follow these steps to implement this activity:

  1. Open a Jupyter Notebook.
  2. Import the requests and BeautifulSoup libraries. Pass the URL to requests with the following command. Convert the fetched content into HTML format using the BeautifulSoup HTML parser. Add the following code to do this:
    import requests
    from bs4 import BeautifulSoup
    r = requests\
        .get('https://en.wikipedia.org/wiki/Rabindranath_Tagore')
    r.status_code
    soup = BeautifulSoup(r.text, 'html.parser')
  3. To extract the list of headings, see which HTML elements belong to each bold headline in the Works section. You can see that they belong to the h3 tag. We only need the first six headings here. Look for a span tag that has a class attribute with the following set of commands...

5. Topic Modeling

Activity 5.01: Topic-Modeling Jeopardy Questions

Solution

Let's perform topic modeling on the dataset of Jeopardy questions:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import pandas and other libraries:
    import numpy as np
    import spacy
    nlp = spacy.load('en_core_web_sm')
    import pandas as pd
    pd.set_option('display.max_colwidth', 800)
  3. After downloading the data, you can extract it and place at the location below. Then load the Jeopardy CSV file into a pandas DataFrame. Insert a new cell and add the following code:
    JEOPARDY_CSV =  '../data/jeopardy/Jeopardy.csv'
    questions = pd.read_csv(JEOPARDY_CSV)
    questions.columns = [x.strip() for x in questions.columns]
  4. The data in the DataFrame is not clean. In order to clean it, remove records that have missing values in the Question column. Add the following code to do this:
    questions = questions.dropna(subset=['Question'])
  5. Find...

6. Vector Representation

Activity 6.01: Finding Similar News Article Using Document Vectors

Solution

Follow these steps to complete this activity:

  1. Open a Jupyter Notebook. Insert a new cell and add the following code to import all necessary libraries:
    import warnings
    warnings.filterwarnings("ignore")
    from gensim.models import Doc2Vec
    import pandas as pd
    from gensim.parsing.preprocessing import preprocess_string, \
    remove_stopwords 
  2. Now load the news_lines file.
    news_file = '../data/sample_news_data.txt'
  3. After that, you need to iterate over each headline in the file and split the columns, then create a DataFrame containing the headlines. Insert a new cell and add the following code to implement this:
    with open(news_file, encoding="utf8", errors='ignore') as f:
        news_lines = [line for line in f.readlines()]
    lines_df = pd.DataFrame()
    indices  = list(range(len(news_lines)))
    lines_df['news'] = news_lines...

7. Text Generation and Summarization

Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset

Solution

Follow these steps to complete this activity:

  1. Open a Jupyter Notebook and insert a new cell. Add the following code to import the required libraries:
    import warnings
    warnings.filterwarnings('ignore')
    import os
    import csv
    import pandas as pd
    from gensim.summarization import summarize
  2. Insert a new cell and add the following code to fetch the Consumer Complaints dataset and consider the rows that have a complaint narrative. Drop all the columns other than Product, Sub-product, Issue, Sub-issue, and Consumer complaint narrative:
    complaints_pathname = '../data/consumercomplaints/'\
                          'Consumer_Complaints.csv'
    df_all_complaints = pd.read_csv(complaints_pathname)
    df_all_narr = df_all_complaints...

8. Sentiment Analysis

Activity 8.01: Tweet Sentiment Analysis Using the textblob library

Solution

To perform sentiment analysis on the given set of tweets related to airlines, follow these steps:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    import pandas as pd
    from textblob import TextBlob
    import re
  3. Since we are displaying the text in the notebook, we want to increase the display width for our DataFrame. Insert a new cell and add the following code to implement this:
    pd.set_option('display.max_colwidth', 240)
  4. Now, load the airline-tweets.csv dataset. We will read this CSV file using pandas' read_csv() function. Insert a new cell and add the following code to implement this:
    tweets = pd.read_csv('data/airline-tweets.csv')
  5. Insert a new cell and add the following code to view the first 10 records of the DataFrame:
    tweets.head()

    The code generates the following output:

    Figure 8...

lock icon The rest of the chapter is locked
arrow left Previous Chapter
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020 Publisher: Packt ISBN-13: 9781800208421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}