Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Natural Language Processing Workshop

You're reading from  The Natural Language Processing Workshop

Product type Book
Published in Aug 2020
Publisher Packt
ISBN-13 9781800208421
Pages 452 pages
Edition 1st Edition
Languages
Authors (6):
Rohan Chopra Rohan Chopra
Profile icon Rohan Chopra
Aniruddha M. Godbole Aniruddha M. Godbole
Profile icon Aniruddha M. Godbole
Nipun Sadvilkar Nipun Sadvilkar
Profile icon Nipun Sadvilkar
Muzaffar Bashir Shah Muzaffar Bashir Shah
Profile icon Muzaffar Bashir Shah
Sohom Ghosh Sohom Ghosh
Profile icon Sohom Ghosh
Dwight Gunning Dwight Gunning
Profile icon Dwight Gunning
View More author details

Table of Contents (10) Chapters

Preface
1. Introduction to Natural Language Processing 2. Feature Extraction Methods 3. Developing a Text Classifier 4. Collecting Text Data with Web Scraping and APIs 5. Topic Modeling 6. Vector Representation 7. Text Generation and Summarization 8. Sentiment Analysis Appendix

About the Book

Do you want to learn how to communicate with computer systems using Natural Language Processing (NLP) techniques, or make a machine understand human sentiments? Do you want to build applications like Siri, Alexa, or chatbots, even if you've never done it before?

With The Natural Language Processing Workshop, you can expect to make consistent progress as a beginner, and get up to speed in an interactive way, with the help of hands-on activities and fun exercises.

The book starts with an introduction to NLP. You'll study different approaches to NLP tasks, and perform exercises in Python to understand the process of preparing datasets for NLP models. Next, you'll use advanced NLP algorithms and visualization techniques to collect datasets from open websites, and to summarize and generate random text from a document. In the final chapters, you'll use NLP to create a chatbot that detects positive or negative sentiment in text documents such as movie reviews. 

By the end of this book, you'll be equipped with the essential NLP tools and techniques you need to solve common business problems that involve processing text.

Audience

This book is for beginner to mid-level data scientists, machine learning developers, and NLP enthusiasts. A basic understanding of machine learning and NLP is required to help you grasp the topics in this workshop more quickly.

About the Chapters

Chapter 1, Introduction to Natural Language Processing, starts by defining natural language processing and the different types of natural language processing tasks, using practical examples for each type. This chapter also covers the process of structuring and implementing a natural language processing project.

Chapter 2, Feature Extraction Methods, covers basic feature extraction methods from unstructured text. These include tokenization, stemming, lemmatization, and stopword removal. We also discuss observations we might see from these extraction methods and introduce Zipf's Law. Finally, we discuss the Bag of Words model and Term Frequency-Inverse Document Frequency (TF-IDF).

Chapter 3, Developing a Text Classifier, teaches you how to create a simple text classifier with feature extraction methods covered in the previous chapters.

Chapter 4, Collecting Text Data with Web Scraping and APIs, introduces you to web scraping and discusses various methods of collecting and processing text data from online sources, such as HTML and XML files and APIs.

Chapter 5, Topic Modeling, introduces topic modeling, an unsupervised natural language processing technique that groups documents according to topic. You will see how this is done using Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Processes (HDP).

Chapter 6, Vector Representation, discusses the importance of representing text as vectors, and various vector representations, such as Word2Vec and Doc2Vec.

Chapter 7, Text Generation and Summarization, teaches you two simple natural language processing tasks: creating text summaries and generating random text with statistical assumptions and algorithms.

Chapter 8, Sentiment Analysis, teaches you how to detect sentiment in text, using simple techniques. Sentiment analysis is the use of computer algorithms to detect whether the sentiment of text is positive or negative.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We find that the summary for the Wikipedia article is much more coherent than the short story. We can also see that the summary with a ratio of 0.20 is a subset of a summary with a ratio of 0.25."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this: "On this page, click on Keys option to access the secret keys."

A block of code is set as follows:

text_after_twenty=text_after_twenty.replace('\n',' ')
text_after_twenty=re.sub(r"\s+"," ",text_after_twenty)

New terms and important words are shown like this: "A Markov chain consists of a state space and a specific type of successor function."

Long code snippets are truncated and the corresponding names of the code files on GitHub are placed at the top of the truncated code. The permalinks to the entire code are placed below the code snippet. It should look as follows:

Exercise 7.01.ipynb

1  HANDLE = '@\w+\n'
2  LINK = 'https?://t\.co/\w+'
3  SPECIAL_CHARS = '<|<|&|#'
4  PARA='\n+'
5  def clean(text):
6      #text = re.sub(HANDLE, ' ', text)
7      text = re.sub(LINK, ' ', text)
8      text = re.sub(SPECIAL_CHARS, ' ', text)
9      text = re.sub(PARA, '\n', text)

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
                    validation_split=0.2, shuffle=False)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])

Multi-line comments are enclosed by triple quotes, as shown below:

"""
Define a seed for the random number generator to ensure the 
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)

Setting up Your Environment

Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.

Installation and Setup

Jupyter notebooks are available once you install Anaconda on your system. Anaconda can be installed for Windows systems using the steps available at https://docs.anaconda.com/anaconda/install/windows/.

For other systems, navigate to the respective installation guide from https://docs.anaconda.com/anaconda/install/.

These installations will be executed in the C drive of your system. You can choose to change the destination.

Installing the Required Libraries

Open Anaconda Prompt and follow the steps given here to get your system ready. We will create a new environment on Anaconda where we will install all the required libraries and run our code:

  1. To create a new environment, run the following command:
    conda create --name nlp
  2. To activate the environment, type the following:
    conda activate nlp

    For this course, whenever you are asked to open a terminal, you need to open Anaconda Prompt, activate the environment, and then proceed.

  3. To install all the libraries, download the environment file from https://packt.live/30qfL9V and run the following command:
    pip install -f requirements.txt
  4. Jupyter notebooks allow us to run code and experiment with code blocks. To start Jupyter Notebook, run the following inside the nlp environment:
    jupyter notebook

    A new browser window will open up with the Jupyter interface. You can now navigate to the project location and run Jupyter Notebook.

Installing Libraries

pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/39RZuOh.

The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.

Accessing the Code Files

You can find the complete code files of this book at https://packt.live/3fJ4qap. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/3gwk4WQ.

We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.

If you have any issues or questions about installation, please email us at workshops@packt.com.

lock icon The rest of the chapter is locked
Next Chapter arrow right
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}