Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

About the Book

Do you want to learn how to communicate with computer systems using Natural Language Processing (NLP) techniques, or make a machine understand human sentiments? Do you want to build applications like Siri, Alexa, or chatbots, even if you've never done it before?

With The Natural Language Processing Workshop, you can expect to make consistent progress as a beginner, and get up to speed in an interactive way, with the help of hands-on activities and fun exercises.

The book starts with an introduction to NLP. You'll study different approaches to NLP tasks, and perform exercises in Python to understand the process of preparing datasets for NLP models. Next, you'll use advanced NLP algorithms and visualization techniques to collect datasets from open websites, and to summarize and generate random text from a document. In the final chapters, you'll use NLP to create a chatbot that detects positive or negative sentiment in text documents such as movie reviews. 

By the end of this book, you'll be equipped with the essential NLP tools and techniques you need to solve common business problems that involve processing text.

Audience

This book is for beginner to mid-level data scientists, machine learning developers, and NLP enthusiasts. A basic understanding of machine learning and NLP is required to help you grasp the topics in this workshop more quickly.

About the Chapters

Chapter 1, Introduction to Natural Language Processing, starts by defining natural language processing and the different types of natural language processing tasks, using practical examples for each type. This chapter also covers the process of structuring and implementing a natural language processing project.

Chapter 2, Feature Extraction Methods, covers basic feature extraction methods from unstructured text. These include tokenization, stemming, lemmatization, and stopword removal. We also discuss observations we might see from these extraction methods and introduce Zipf's Law. Finally, we discuss the Bag of Words model and Term Frequency-Inverse Document Frequency (TF-IDF).

Chapter 3, Developing a Text Classifier, teaches you how to create a simple text classifier with feature extraction methods covered in the previous chapters.

Chapter 4, Collecting Text Data with Web Scraping and APIs, introduces you to web scraping and discusses various methods of collecting and processing text data from online sources, such as HTML and XML files and APIs.

Chapter 5, Topic Modeling, introduces topic modeling, an unsupervised natural language processing technique that groups documents according to topic. You will see how this is done using Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Processes (HDP).

Chapter 6, Vector Representation, discusses the importance of representing text as vectors, and various vector representations, such as Word2Vec and Doc2Vec.

Chapter 7, Text Generation and Summarization, teaches you two simple natural language processing tasks: creating text summaries and generating random text with statistical assumptions and algorithms.

Chapter 8, Sentiment Analysis, teaches you how to detect sentiment in text, using simple techniques. Sentiment analysis is the use of computer algorithms to detect whether the sentiment of text is positive or negative.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We find that the summary for the Wikipedia article is much more coherent than the short story. We can also see that the summary with a ratio of 0.20 is a subset of a summary with a ratio of 0.25."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this: "On this page, click on Keys option to access the secret keys."

A block of code is set as follows:

text_after_twenty=text_after_twenty.replace('\n',' ')
text_after_twenty=re.sub(r"\s+"," ",text_after_twenty)

New terms and important words are shown like this: "A Markov chain consists of a state space and a specific type of successor function."

Long code snippets are truncated and the corresponding names of the code files on GitHub are placed at the top of the truncated code. The permalinks to the entire code are placed below the code snippet. It should look as follows:

Exercise 7.01.ipynb

1  HANDLE = '@\w+\n'
2  LINK = 'https?://t\.co/\w+'
3  SPECIAL_CHARS = '<|<|&|#'
4  PARA='\n+'
5  def clean(text):
6      #text = re.sub(HANDLE, ' ', text)
7      text = re.sub(LINK, ' ', text)
8      text = re.sub(SPECIAL_CHARS, ' ', text)
9      text = re.sub(PARA, '\n', text)

Code Presentation

Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example:

history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
                    validation_split=0.2, shuffle=False)

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])

Multi-line comments are enclosed by triple quotes, as shown below:

"""
Define a seed for the random number generator to ensure the 
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)

Setting up Your Environment

Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.

Installation and Setup

Jupyter notebooks are available once you install Anaconda on your system. Anaconda can be installed for Windows systems using the steps available at https://docs.anaconda.com/anaconda/install/windows/.

For other systems, navigate to the respective installation guide from https://docs.anaconda.com/anaconda/install/.

These installations will be executed in the C drive of your system. You can choose to change the destination.

Installing the Required Libraries

Open Anaconda Prompt and follow the steps given here to get your system ready. We will create a new environment on Anaconda where we will install all the required libraries and run our code:

  1. To create a new environment, run the following command:
    conda create --name nlp
  2. To activate the environment, type the following:
    conda activate nlp

    For this course, whenever you are asked to open a terminal, you need to open Anaconda Prompt, activate the environment, and then proceed.

  3. To install all the libraries, download the environment file from https://packt.live/30qfL9V and run the following command:
    pip install -f requirements.txt
  4. Jupyter notebooks allow us to run code and experiment with code blocks. To start Jupyter Notebook, run the following inside the nlp environment:
    jupyter notebook

    A new browser window will open up with the Jupyter interface. You can now navigate to the project location and run Jupyter Notebook.

Installing Libraries

pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/39RZuOh.

The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.

Accessing the Code Files

You can find the complete code files of this book at https://packt.live/3fJ4qap. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/3gwk4WQ.

We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.

If you have any issues or questions about installation, please email us at workshops@packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning