You're reading from The Natural Language Processing Workshop

Product type Book

Published in Aug 2020

Publisher Packt

ISBN-13 9781800208421

Pages 452 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Authors (6):

Rohan Chopra

Aniruddha M. Godbole

Nipun Sadvilkar

Muzaffar Bashir Shah

Sohom Ghosh

Dwight Gunning

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Natural Language Processing

2. Feature Extraction Methods

3. Developing a Text Classifier

4. Collecting Text Data with Web Scraping and APIs

5. Topic Modeling

6. Vector Representation

7. Text Generation and Summarization

8. Sentiment Analysis

Appendix

About the Book

Do you want to learn how to communicate with computer systems using Natural Language Processing (NLP) techniques, or make a machine understand human sentiments? Do you want to build applications like Siri, Alexa, or chatbots, even if you've never done it before?

With The Natural Language Processing Workshop, you can expect to make consistent progress as a beginner, and get up to speed in an interactive way, with the help of hands-on activities and fun exercises.

The book starts with an introduction to NLP. You'll study different approaches to NLP tasks, and perform exercises in Python to understand the process of preparing datasets for NLP models. Next, you'll use advanced NLP algorithms and visualization techniques to collect datasets from open websites, and to summarize and generate random text from a document. In the final chapters, you'll use NLP to create a chatbot that detects positive or negative sentiment in text documents such as movie reviews.

By the end of this book, you'll be equipped with the essential NLP tools and techniques you need to solve common business problems that involve processing text.

Audience

This book is for beginner to mid-level data scientists, machine learning developers, and NLP enthusiasts. A basic understanding of machine learning and NLP is required to help you grasp the topics in this workshop more quickly.

About the Chapters

Chapter 1, Introduction to Natural Language Processing, starts by defining natural language processing and the different types of natural language processing tasks, using practical examples for each type. This chapter also covers the process of structuring and implementing a natural language processing project.

Chapter 2, Feature Extraction Methods, covers basic feature extraction methods from unstructured text. These include tokenization, stemming, lemmatization, and stopword removal. We also discuss observations we might see from these extraction methods and introduce Zipf's Law. Finally, we discuss the Bag of Words model and Term Frequency-Inverse Document Frequency (TF-IDF).

Chapter 3, Developing a Text Classifier, teaches you how to create a simple text classifier with feature extraction methods covered in the previous chapters.

Chapter 4, Collecting Text Data with Web Scraping and APIs, introduces you to web scraping and discusses various methods of collecting and processing text data from online sources, such as HTML and XML files and APIs.

Chapter 5, Topic Modeling, introduces topic modeling, an unsupervised natural language processing technique that groups documents according to topic. You will see how this is done using Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Processes (HDP).

Chapter 6, Vector Representation, discusses the importance of representing text as vectors, and various vector representations, such as Word2Vec and Doc2Vec.

Chapter 7, Text Generation and Summarization, teaches you two simple natural language processing tasks: creating text summaries and generating random text with statistical assumptions and algorithms.

Chapter 8, Sentiment Analysis, teaches you how to detect sentiment in text, using simple techniques. Sentiment analysis is the use of computer algorithms to detect whether the sentiment of text is positive or negative.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We find that the summary for the Wikipedia article is much more coherent than the short story. We can also see that the summary with a ratio of 0.20 is a subset of a summary with a ratio of 0.25."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this: "On this page, click on Keys option to access the secret keys."

A block of code is set as follows:

text_after_twenty=text_after_twenty.replace('\n',' ')
text_after_twenty=re.sub(r"\s+"," ",text_after_twenty)

New terms and important words are shown like this: "A Markov chain consists of a state space and a specific type of successor function."

Long code snippets are truncated and the corresponding names of the code files on GitHub are placed at the top of the truncated code. The permalinks to the entire code are placed below the code snippet. It should look as follows:

Exercise 7.01.ipynb

1  HANDLE = '@\w+\n'
2  LINK = 'https?://t\.co/\w+'
3  SPECIAL_CHARS = '&lt;|&lt;|&amp;|#'
4  PARA='\n+'
5  def clean(text):
6      #text = re.sub(HANDLE, ' ', text)
7      text = re.sub(LINK, ' ', text)
8      text = re.sub(SPECIAL_CHARS, ' ', text)
9      text = re.sub(PARA, '\n', text)

The full code can be found at https://packt.live/2D7RPPZ.

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
                    validation_split=0.2, shuffle=False)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])

Multi-line comments are enclosed by triple quotes, as shown below:

"""
Define a seed for the random number generator to ensure the 
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)

Setting up Your Environment

Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.

Installation and Setup

Jupyter notebooks are available once you install Anaconda on your system. Anaconda can be installed for Windows systems using the steps available at https://docs.anaconda.com/anaconda/install/windows/.

For other systems, navigate to the respective installation guide from https://docs.anaconda.com/anaconda/install/.

These installations will be executed in the C drive of your system. You can choose to change the destination.

Installing the Required Libraries

Open Anaconda Prompt and follow the steps given here to get your system ready. We will create a new environment on Anaconda where we will install all the required libraries and run our code:

To create a new environment, run the following command:
```
conda create --name nlp
```
To activate the environment, type the following:
```
conda activate nlp
```
For this course, whenever you are asked to open a terminal, you need to open Anaconda Prompt, activate the environment, and then proceed.
To install all the libraries, download the environment file from https://packt.live/30qfL9V and run the following command:
```
pip install -f requirements.txt
```
Jupyter notebooks allow us to run code and experiment with code blocks. To start Jupyter Notebook, run the following inside the nlp environment:
```
jupyter notebook
```
A new browser window will open up with the Jupyter interface. You can now navigate to the project location and run Jupyter Notebook.

Installing Libraries

pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/39RZuOh.

The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.

Accessing the Code Files

You can find the complete code files of this book at https://packt.live/3fJ4qap. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/3gwk4WQ.

We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.

If you have any issues or questions about installation, please email us at workshops@packt.com.