Python Natural Language Processing Cookbook

By Zhenya Antić
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Playing with Grammar

About this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization.

Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data.

By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.

Publication date:
March 2021
Publisher
Packt
Pages
284
ISBN
9781838987312

 

Chapter 2: Playing with Grammar

Grammar is one of the main building blocks of language. Each human language, and programming language for that matter, has a set of rules that every person speaking it has to follow because otherwise, they risk not being understood. These grammatical rules can be uncovered using NLP and are useful for extracting data from sentences. For example, using information about the grammatical structure of text, we can parse out subjects, objects, and relationships between different entities.

In this chapter, you will learn how to use different packages to reveal the grammatical structure of words and sentences, as well as extract certain parts of sentences. We will cover the following topics:

  • Counting nouns – plural and singular nouns
  • Getting the dependency parse
  • Splitting sentences into clauses
  • Extracting noun chunks
  • Extracting entities and relations
  • Extracting subjects and objects of the sentence
  • Finding references...
 

Technical requirements

Follow these steps to install the packages and models required for this chapter:

pip install inflect
python -m spacy download en_core_web_md
pip install textacy

For the Finding references: anaphora resolution recipe, we have to install the neuralcoref package. To install this package, use the following command:

pip install neuralcoref

In case, when running the code, you encounter errors that mention spacy.strings.StringStore size changed, you might need to install neuralcoref from the source:

pip uninstall neuralcoref
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e

For more information about installation and usage, see https://github.com/huggingface/neuralcoref.

 

Counting nouns – plural and singular nouns

In this recipe, we will do two things:

  • Determine whether a noun is plural or singular
  • Turn plural nouns into singular nouns and vice versa

You might need these two things in a variety of tasks: in making your chatbot speak in grammatically correct sentences, in coming up with text classification features, and so on.

Getting ready

We will be using nltk for this task, as well as the inflect module we described in Technical requirements section. The code for this chapter is located in the Chapter02 directory of this book's GitHub repository. We will be working with the first part of the Adventures of Sherlock Holmes text, available in the sherlock_holmes_1.txt file.

How to do it…

We will be using code from Chapter 1, Learning NLP Basics, to tokenize the text into words and tag them with parts of speech. Then, we will use one of two ways to determine if a noun is singular or plural, and then use...

 

Getting the dependency parse

A dependency parse is a tool that shows dependencies in a sentence. For example, in the sentence The cat wore a hat, the root of the sentence in the verb, wore, and both the subject, the cat, and the object, a hat, are dependents. The dependency parse can be very useful in many NLP tasks since it shows the grammatical structure of the sentence, along with the subject, the main verb, the object, and so on. It can be then used in downstream processing.

Getting ready

We will use spacy to create the dependency parse. If you already downloaded it while working on the previous chapter, you do not need to do anything more. Otherwise, please follow the instructions at the beginning of Chapter 1, Learning NLP Basics, to install the necessary packages.

How to do it…

We will take a few sentences from the sherlock_holmes1.txt file to illustrate the dependency parse. The steps are as follows:

  1. Import spacy:
    import spacy
  2. Load the sentence...
 

Splitting sentences into clauses

When we work with text, we frequently deal with compound (sentences with two parts that are equally important) and complex sentences (sentences with one part depending on another). It is sometimes useful to split these composite sentences into its component clauses for easier processing down the line. This recipe uses the dependency parse from the previous recipe.

Getting ready

You will only need the spacy package in this recipe.

How to do it…

We will work with two sentences, He eats cheese, but he won't eat ice cream and If it rains later, we won't be able to go to the park. Other sentences may turn out to be more complicated to deal with, and I leave it as an exercise for you to split such sentences. Follow these steps:

  1. Import the spacy package:
    import spacy
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. Set the sentence to He eats cheese, but he won't eat ice cream:
    sentence = &quot...
 

Extracting noun chunks

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relationships between them. In this recipe, we will explore how to extract named entities from a piece of text.

Getting ready

We will be using the spacy package, which has a function for extracting noun chunks and the text from the sherlock_holmes_1.txt file as an example.

In this section, we will use another spaCy language model, en_core_web_md. Follow the instructions in the Technical requirements section to learn how to download it.

How to do it…

Use the following steps to get the noun chunks from a piece of text:

  1. Import the spacy package...
 

Extracting entities and relations

It is possible to extract triplets of the subject entity-relation-object entity from documents, which are frequently used in knowledge graphs. These triplets can then be analyzed for further relations and inform other NLP tasks, such as searches.

Getting ready

For this recipe, we will need another Python package based on spacy, called textacy. The main advantage of this package is that it allows regular expression-like searching for tokens based on their part of speech tags. See the installation instructions in the Technical requirements section at the beginning of this chapter for more information.

How to do it…

We will find all verb phrases in the text, as well as all the noun phrases (see the previous section). Then, we will find the left noun phrase (subject) and the right noun phrase (object) that relate to a particular verb phrase. We will use two simple sentences, All living things are made of cells and Cells have organelles...

 

Extracting subjects and objects of the sentence

Sometimes, we might need to find the subject and direct objects of the sentence, and that can easily be accomplished with the spacy package.

Getting ready

We will be using the dependency tags from spacy to find subjects and objects.

How to do it…

We will use the subtree attribute of tokens to find the complete noun chunk that is the subject or direct object of the verb (see the Getting the dependency parse recipe for more information). Let's get started:

  1. Import spacy:
    import spacy
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. We will get the list of sentences we will be processing:
    sentences=["The big black cat stared at the small dog.",
               "Jane watched her brother in the evenings."]
  4. We will use two functions to find the subject and the direct object of the sentence. These functions will loop...
 

Finding references – anaphora resolution

When we work on problems of extracting entities and relations from text (see the Extracting entities and relations recipe), we are faced with real text, and many of our entities might end up being extracted as pronouns, such as she or him. In order to tackle this issue, we need to perform anaphora resolution, or the process of substituting the pronouns with their referents.

Getting ready

For this task, we will be using a spaCy extension written by Hugging Face called neuralcoref (see https://github.com/huggingface/neuralcoref). As the name suggests, it uses neural networks to resolve pronouns. To install the package, use the following command:

pip install neuralcoref

How to do it…

Your steps should be formatted like so:

  1. Import spacy and neuralcoref:
    import spacy
    import neuralcoref
  2. Load the spaCy engine and add neuralcoref to its pipeline:
    nlp = spacy.load('en_core_web_sm')
    neuralcoref.add_to_pipe...

About the Author

  • Zhenya Antić

    Zhenya Antić is a Natural Language Processing (NLP) professional working at Practical Linguistics Inc. She helps businesses to improve processes and increase productivity by automating text processing. Zhenya holds a PhD in linguistics from University of California Berkeley and a BS in computer science from Massachusetts Institute of Technology.

    Browse publications by this author
Book Title
Unlock this book and the full library for only $5/m
Access now