Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

1. Introduction to Natural Language Processing

Activity 1.01: Preprocessing of Raw Text

Solution

Let's perform preprocessing on a text corpus. To complete this activity, follow these steps:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    from nltk import download
    download('stopwords')
    download('wordnet')
    nltk.download('punkt')
    download('averaged_perceptron_tagger')
    from nltk import word_tokenize
    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk.corpus import stopwords
    from autocorrect import Speller
    from nltk.wsd import lesk
    from nltk.tokenize import sent_tokenize
    from nltk import stem, pos_tag
    import string
  3. Read the content of file.txt and store it in a variable named sentence. Insert a new cell and add the following code to implement this:
    #load the text file into variable called sentence
    sentence = open("../data/file.txt", 'r').read...

2. Feature Extraction Methods

Activity 2.01: Extracting Top Keywords from the News Article

Solution

The following steps will help you complete this Activity:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries and download the data:
    import operator
    from nltk.tokenize import WhitespaceTokenizer
    from nltk import download, stem
    # The below statement will download the stop word list 
    # 'nltk_data/corpora/stopwords/' at home directory of your computer
    download('stopwords')
    from nltk.corpus import stopwords

    The download statement will download the stop word list at nltk_data/corpora/stopwords/ into your system's home directory.

  3. Create the different types of methods to perform various NLP tasks:

    Activity 2.01.ipynb

    def load_file(file_path):
        news = ''.join\
                  ([line for line in open...

3. Developing a Text Classifier

Activity 3.01: Developing End-to-End Text Classifiers

Solution

The following steps will help you implement this activity:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary packages:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    import warnings
    import string
    import re
    warnings.filterwarnings('ignore')
    from sklearn.metrics import accuracy_score, roc_curve, \
    classification_report, confusion_matrix, \
    precision_recall_curve, auc
  3. Read a data file. It has three columns: is_political, headline, and short_description...

4. Collecting Text Data with Web Scraping and APIs

Activity 4.01: Extracting Information from an Online HTML Page

Solution

Let's extract the data from an online source and analyze it. Follow these steps to implement this activity:

  1. Open a Jupyter Notebook.
  2. Import the requests and BeautifulSoup libraries. Pass the URL to requests with the following command. Convert the fetched content into HTML format using the BeautifulSoup HTML parser. Add the following code to do this:
    import requests
    from bs4 import BeautifulSoup
    r = requests\
        .get('https://en.wikipedia.org/wiki/Rabindranath_Tagore')
    r.status_code
    soup = BeautifulSoup(r.text, 'html.parser')
  3. To extract the list of headings, see which HTML elements belong to each bold headline in the Works section. You can see that they belong to the h3 tag. We only need the first six headings here. Look for a span tag that has a class attribute with the following set of commands...

5. Topic Modeling

Activity 5.01: Topic-Modeling Jeopardy Questions

Solution

Let's perform topic modeling on the dataset of Jeopardy questions:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import pandas and other libraries:
    import numpy as np
    import spacy
    nlp = spacy.load('en_core_web_sm')
    import pandas as pd
    pd.set_option('display.max_colwidth', 800)
  3. After downloading the data, you can extract it and place at the location below. Then load the Jeopardy CSV file into a pandas DataFrame. Insert a new cell and add the following code:
    JEOPARDY_CSV =  '../data/jeopardy/Jeopardy.csv'
    questions = pd.read_csv(JEOPARDY_CSV)
    questions.columns = [x.strip() for x in questions.columns]
  4. The data in the DataFrame is not clean. In order to clean it, remove records that have missing values in the Question column. Add the following code to do this:
    questions = questions.dropna(subset=['Question'])
  5. Find...

6. Vector Representation

Activity 6.01: Finding Similar News Article Using Document Vectors

Solution

Follow these steps to complete this activity:

  1. Open a Jupyter Notebook. Insert a new cell and add the following code to import all necessary libraries:
    import warnings
    warnings.filterwarnings("ignore")
    from gensim.models import Doc2Vec
    import pandas as pd
    from gensim.parsing.preprocessing import preprocess_string, \
    remove_stopwords 
  2. Now load the news_lines file.
    news_file = '../data/sample_news_data.txt'
  3. After that, you need to iterate over each headline in the file and split the columns, then create a DataFrame containing the headlines. Insert a new cell and add the following code to implement this:
    with open(news_file, encoding="utf8", errors='ignore') as f:
        news_lines = [line for line in f.readlines()]
    lines_df = pd.DataFrame()
    indices  = list(range(len(news_lines)))
    lines_df['news'] = news_lines...

7. Text Generation and Summarization

Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset

Solution

Follow these steps to complete this activity:

  1. Open a Jupyter Notebook and insert a new cell. Add the following code to import the required libraries:
    import warnings
    warnings.filterwarnings('ignore')
    import os
    import csv
    import pandas as pd
    from gensim.summarization import summarize
  2. Insert a new cell and add the following code to fetch the Consumer Complaints dataset and consider the rows that have a complaint narrative. Drop all the columns other than Product, Sub-product, Issue, Sub-issue, and Consumer complaint narrative:
    complaints_pathname = '../data/consumercomplaints/'\
                          'Consumer_Complaints.csv'
    df_all_complaints = pd.read_csv(complaints_pathname)
    df_all_narr = df_all_complaints...

8. Sentiment Analysis

Activity 8.01: Tweet Sentiment Analysis Using the textblob library

Solution

To perform sentiment analysis on the given set of tweets related to airlines, follow these steps:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    import pandas as pd
    from textblob import TextBlob
    import re
  3. Since we are displaying the text in the notebook, we want to increase the display width for our DataFrame. Insert a new cell and add the following code to implement this:
    pd.set_option('display.max_colwidth', 240)
  4. Now, load the airline-tweets.csv dataset. We will read this CSV file using pandas' read_csv() function. Insert a new cell and add the following code to implement this:
    tweets = pd.read_csv('data/airline-tweets.csv')
  5. Insert a new cell and add the following code to view the first 10 records of the DataFrame:
    tweets.head()

    The code generates the following output:

    Figure 8...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning