Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Python Natural Language Processing Cookbook - Second Edition
Python Natural Language Processing Cookbook - Second Edition

Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries, Second Edition

Arrow left icon
Profile Icon Zhenya Antić Profile Icon Saurabh Chakravarty
Arrow right icon
$17.99 $35.99
Book Sep 2024 312 pages 2nd Edition
eBook
$17.99 $35.99
Print
$30.99 $44.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Zhenya Antić Profile Icon Saurabh Chakravarty
Arrow right icon
$17.99 $35.99
Book Sep 2024 312 pages 2nd Edition
eBook
$17.99 $35.99
Print
$30.99 $44.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$17.99 $35.99
Print
$30.99 $44.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Python Natural Language Processing Cookbook - Second Edition

Playing with Grammar

Grammar is one of the main building blocks of language. Each human language, and programming language for that matter, has a set of rules that every person speaking it must follow, otherwise risking not being understood. These grammatical rules can be uncovered using NLP and are useful for extracting data from sentences. For example, using information about the grammatical structure of text, we can parse out subjects, objects, and relations between different entities.

In this chapter, you will learn how to use different packages to reveal the grammatical structure of words and sentences, as well as extract certain parts of sentences. These are the topics covered in this chapter:

  • Counting nouns – plural and singular nouns
  • Getting the dependency parse
  • Extracting noun chunks
  • Extracting the subjects and objects of the sentence
  • Finding patterns in text using grammatical information

Technical requirements

Please follow the installation requirements given in Chapter 1 to run the notebooks in this chapter.

Counting nouns – plural and singular nouns

In this recipe, we will do two things: determine whether a noun is plural or singular and turn plural nouns into singular, and vice versa.

You might need these two things for a variety of tasks. For example, you might want to count the word statistics, and for that, you most likely need to count the singular and plural nouns together. In order to count the plural nouns together with singular ones, you need a way to recognize that a word is plural or singular.

Getting ready

To determine whether a noun is singular or plural, we will use spaCy via two different methods: by looking at the difference between the lemma and the actual word and by looking at the morph attribute. To inflect these nouns, or turn singular nouns into plural or vice versa we will use the textblob package. We will also see how to determine the noun’s number using GPT-3 through the OpenAI API. The code for this section is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter02.

How to do it…

We will first use spaCy’s lemma information to infer whether a noun is singular or plural. Then, we will use the morph attribute of Token objects. We will then create a function that uses one of those methods. Finally, we will use GPT-3.5 to find out the number of nouns:

  1. Run the code in the file and language utility notebooks. If you run into an error saying that the small or large models do not exist, you need to open the lang_utils.ipynb file, uncomment, and run the statement that downloads the model:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Initialize the text variable and process it using the spaCy small model to get the resulting Doc object:
    text = "I have five birds"
    doc = small_model(text)
  3. In this step, we loop through the Doc object. For each token in the object, we check whether it’s a noun and whether the lemma is the same as the word itself. Since the lemma is the basic form of the word, if the lemma is different from the word, that token is plural:
    for token in doc:
        if (token.pos_ == "NOUN" and token.lemma_ != token.text):
            print(token.text, "plural")

    The result should be as follows:

    birds plural
  4. Now, we will check the number of a noun using a different method: the morph features of a Token object. The morph features are the morphological features of a word, such as number, case, and so on. Since we know that token 3 is a noun, we directly access the morph features and get the Number to get the same result as previously:
    doc = small_model("I have five birds.")
    print(doc[3].morph.get("Number"))

    Here is the result:

    ['Plur']
  5. In this step, we prepare to define a function that returns a tuple, (noun, number). In order to better encode the noun number, we use an Enum class that assigns numbers to different values. We assign 1 to singular and 2 to plural. Once we create the class, we can directly refer to the noun number variables as Noun_number.SINGULAR and Noun_number.PLURAL:
    class Noun_number(Enum):
        SINGULAR = 1
        PLURAL = 2
  6. In this step, we define the function. It takes as input the text, the spaCy model, and the method of determining the noun number. The two methods are lemma and morph, the same two methods we used in steps 3 and 4, respectively. The function outputs a list of tuples, each of the format (<noun text>, <noun number>), where the noun number is expressed using the Noun_number class defined in step 5:
    def get_nouns_number(text, model, method="lemma"):
        nouns = []
        doc = model(text)
        for token in doc:
            if (token.pos_ == "NOUN"):
                if method == "lemma":
                    if token.lemma_ != token.text:
                        nouns.append((token.text, 
                            Noun_number.PLURAL))
                    else:
                        nouns.append((token.text,
                            Noun_number.SINGULAR))
                elif method == "morph":
                    if token.morph.get("Number") == "Sing":
                        nouns.append((token.text,
                            Noun_number.PLURAL))
                    else:
                        nouns.append((token.text,
                            Noun_number.SINGULAR))
        return nouns
  7. We can use the preceding function and see its performance with different spaCy models. In this step, we use the small spaCy model with the function we just defined. Using both methods, we see that the spaCy model gets the number of the irregular noun geese incorrectly:
    text = "Three geese crossed the road"
    nouns = get_nouns_number(text, small_model, "morph")
    print(nouns)
    nouns = get_nouns_number(text, small_model)
    print(nouns)

    The result should be as follows:

    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
  8. Now, let’s do the same using the large model. If you have not yet downloaded the large model, do so by running the first line. Otherwise, you can comment it out. Here, we see that although the morph method still incorrectly assigns singular to geese, the lemma method provides the correct answer:
    !python -m spacy download en_core_web_lg
    large_model = spacy.load("en_core_web_lg")
    nouns = get_nouns_number(text, large_model, "morph")
    print(nouns)
    nouns = get_nouns_number(text, large_model)
    print(nouns)

    The result should be as follows:

    [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)]
    [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]
  9. Let’s now use GPT-3.5 to get the noun number. In the results, we see that GPT-3.5 gives us an identical result and correctly identifies both the number for geese and the number for road:
    from openai import OpenAI
    client = OpenAI(api_key=OPEN_AI_KEY)
    prompt="""Decide whether each noun in the following text is singular or plural.
    Return the list in the format of a python tuple: (word, number). Do not provide any additional explanations.
    Sentence: Three geese crossed the road."""
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        max_tokens=256,
        top_p=1.0,
        frequency_penalty=0,
        presence_penalty=0,
        messages=[
            {"role": "system", "content": "You are a helpful 
                assistant."},
            {"role": "user", "content": prompt}
        ],
    )
    print(response.choices[0].message.content)

    The result should be as follows:

    ('geese', 'plural')
    ('road', 'singular')

There’s more…

We can also change the nouns from plural to singular, and vice versa. We will use the textblob package for that. The package should be installed automatically via the Poetry environment:

  1. Import the TextBlob class from the package:
    from textblob import TextBlob
  2. Initialize a list of text variables and process them using the TextBlob class via a list comprehension:
    texts = ["book", "goose", "pen", "point", "deer"]
    blob_objs = [TextBlob(text) for text in texts]
  3. Use the pluralize function of the object to get the plural. This function returns a list and we access its first element. Print the result:
    plurals = [blob_obj.words.pluralize()[0] 
        for blob_obj in blob_objs]
    print(plurals)

    The result should be as follows:

    ['books', 'geese', 'pens', 'points', 'deer']
  4. Now, we will do the reverse. We use the preceding plurals list to turn the plural nouns into TextBlob objects:
    blob_objs = [TextBlob(text) for text in plurals]
  5. Turn the nouns into singular using the singularize function and print:
    singulars = [blob_obj.words.singularize()[0] 
        for blob_obj in blob_objs]
    print(singulars)

    The result should be the same as the list we started with in step 2:

    ['book', 'goose', 'pen', 'point', 'deer']

Getting the dependency parse

A dependency parse is a tool that shows dependencies in a sentence. For example, in the sentence The cat wore a hat, the root of the sentence is the verb, wore, and both the subject, the cat, and the object, a hat, are dependents. The dependency parse can be very useful in many NLP tasks since it shows the grammatical structure of the sentence, with the subject, the main verb, the object, and so on. It can then be used in downstream processing.

The spaCy NLP engine does the dependency parse as part of its overall analysis. The dependency parse tags explain the role of each word in the sentence. ROOT is the main word that all other words depend on, usually the verb.

Getting ready

We will use spaCy to create the dependency parse. The required packages are part of the Poetry environment.

How to do it…

We will take a few sentences from the sherlock_holmes1.txt file to illustrate the dependency parse. The steps are as follows:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Define the sentence we will be parsing:
    sentence = 'I have seldom heard him mention her under any other name.'
  3. Define a function that will print the word, its grammatical function embedded in the dep_ attribute, and the explanation of that attribute. The dep_ attribute of the Token object shows the grammatical function of the word in the sentence:
    def print_dependencies(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, "\t", token.dep_, "\t", 
                spacy.explain(token.dep_))
  4. Now, let’s use this function on the first sentence in our list. We can see that the verb heard is the ROOT word of the sentence, with all other words depending on it:
    print_dependencies(sentence, small_model)

    The result should be as follows:

    I    nsubj    nominal subject
    have    aux    auxiliary
    seldom    advmod    adverbial modifier
    heard    ROOT    root
    him    nsubj    nominal subject
    mention    ccomp    clausal complement
    her    dobj    direct object
    under    prep    prepositional modifier
    any    det    determiner
    other    amod    adjectival modifier
    name    pobj    object of preposition
    .    punct    punctuation
  5. To explore the dependency parse structure, we can use the attributes of the Token class. Using the ancestors and children attributes, we can get the tokens that this token depends on and the tokens that depend on it, respectively. The function to print the ancestors is as follows:
    def print_ancestors(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, [t.text for t in token.ancestors])
  6. Now, let’s use this function on the first sentence in our list:
    print_ancestors(sentence, small_model)

    The output will be as follows. In the result, we see that heard has no ancestors since it is the main word in the sentence. All other words depend on it, and in fact, contain heard in their ancestor lists.

    The dependency chain can be seen by following the ancestor links for each word. For example, if we look at the word name, we see that its ancestors are under, mention, and heard. The immediate parent of name is under, the parent of under is mention, and the parent of mention is heard. A dependency chain will always lead to the root, or the main word, of the sentence:

    I ['heard']
    have ['heard']
    seldom ['heard']
    heard []
    him ['mention', 'heard']
    mention ['heard']
    her ['mention', 'heard']
    under ['mention', 'heard']
    any ['name', 'under', 'mention', 'heard']
    other ['name', 'under', 'mention', 'heard']
    name ['under', 'mention', 'heard']
    . ['heard']
  7. To see all the children, use the following function. This function prints out each word and the words that depend on it, its children:
    def print_children(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text,[t.text for t in token.children])
  8. Now, let’s use this function on the first sentence in our list:
    print_children(sentence, small_model)

    The result should be as follows. Now, the word heard has a list of words that depend on it since it is the main word in the sentence:

    I []
    have []
    seldom []
    heard ['I', 'have', 'seldom', 'mention', '.']
    him []
    mention ['him', 'her', 'under']
    her []
    under ['name']
    any []
    other []
    name ['any', 'other']
    . []
  9. We can also see left and right children in separate lists. In the following function, we print the children as two separate lists, left and right. This can be useful when doing grammatical transformations in the sentence:
    def print_lefts_and_rights(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text,
                [t.text for t in token.lefts],
                [t.text for t in token.rights])
  10. Let’s use this function on the first sentence in our list:
    print_lefts_and_rights(sentence, small_model)

    The result should be as follows:

    I [] []
    have [] []
    seldom [] []
    heard ['I', 'have', 'seldom'] ['mention', '.']
    him [] []
    mention ['him'] ['her', 'under']
    her [] []
    under [] ['name']
    any [] []
    other [] []
    name ['any', 'other'] []
    . [] []
  11. We can also see the subtree that the token is in by using this function:
    def print_subtree(sentence, model):
        doc = model(sentence)
        for token in doc:
            print(token.text, [t.text for t in token.subtree])
  12. Let’s use this function on the first sentence in our list:
    print_subtree(sentence, small_model)

    The result should be as follows. From the subtrees that each word is part of, we can see the grammatical phrases that appear in the sentence, such as the noun phrase, any other name, and the prepositional phrase, under any other name:

    I ['I']
    have ['have']
    seldom ['seldom']
    heard ['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
    him ['him']
    mention ['him', 'mention', 'her', 'under', 'any', 'other', 'name']
    her ['her']
    under ['under', 'any', 'other', 'name']
    any ['any']
    other ['other']
    name ['any', 'other', 'name']
    . ['.']

See also

The dependency parse can be visualized graphically using the displaCy package, which is part of spaCy. Please see Chapter 7, Visualizing Text Data, for a detailed recipe on how to do the visualization.

Extracting noun chunks

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relations between them. In this recipe, we will explore how to extract named entities from a text.

Getting ready

We will use the spaCy package, which has a function for extracting noun chunks, and the text from the sherlock_holmes_1.txt file as an example.

How to do it…

Use the following steps to get the noun chunks from a text:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Define the function that will print out the noun chunks. The noun chunks are contained in the doc.noun_chunks class variable:
    def print_noun_chunks(text, model):
        doc = model(text)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
  3. Read the text from the sherlock_holmes_1.txt file and use the function on the resulting text:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    print_noun_chunks(sherlock_holmes_part_of_text, small_model)

    This is the partial result. See the output of the notebook at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter02/noun_chunks_2.3.ipynb for the full printout. The function gets the pronouns, nouns, and noun phrases that are in the text correctly:

    Sherlock Holmes
    she
    the_ woman
    I
    him
    her
    any other name
    his eyes
    she
    the whole
    …

There’s more…

Noun chunks are spaCy Span objects and have all their properties. See the official documentation at https://spacy.io/api/token.

Let’s explore some properties of noun chunks:

  1. We will define a function that will print out the different properties of noun chunks. It will print the text of the noun chunk, its start and end indices within the Doc object, the sentence it belongs to (useful when there is more than one sentence), the root of the noun chunk (its main word), and the chunk’s similarity to the word emotions. Finally, it will print out the similarity of the whole input sentence to emotions:
    def explore_properties(sentence, model):
        doc = model(sentence)
        other_span = "emotions"
        other_doc = model(other_span)
        for noun_chunk in doc.noun_chunks:
            print(noun_chunk.text)
            print("Noun chunk start and end", "\t",
                noun_chunk.start, "\t", noun_chunk.end)
            print("Noun chunk sentence:", noun_chunk.sent)
            print("Noun chunk root:", noun_chunk.root.text)
            print(f"Noun chunk similarity to '{other_span}'",
                noun_chunk.similarity(other_doc))
        print(f"Similarity of the sentence '{sentence}' to 
            '{other_span}':",
            doc.similarity(other_doc))
  2. Set the sentence to All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
  3. Use the explore_properties function on the sentence using the small model:
    explore_properties(sentence, small_model)

    This is the result:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.4026421588260174
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' -0.036891259527462
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.03174900767577446

    You will also see a warning message similar to this one due to the fact that the small model does not ship with word vectors of its own:

    /tmp/ipykernel_1807/2430050149.py:10: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
      print(f"Noun chunk similarity to '{other_span}'", noun_chunk.similarity(other_doc))
  4. Now, let’s apply the same function to the same sentence with the large model:
    sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
    explore_properties(sentence, large_model)

    The large model does come with its own word vectors and does not result in a warning:

    All emotions
    Noun chunk start and end    0    2
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: emotions
    Noun chunk similarity to 'emotions' 0.6302678068015664
    his cold, precise but admirably balanced mind
    Noun chunk start and end    11    19
    Noun chunk sentence: All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
    Noun chunk root: mind
    Noun chunk similarity to 'emotions' 0.5744456705692561
    Similarity of the sentence 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.' to 'emotions': 0.640366414527618

    We see that the similarity of the All emotions noun chunk is high in relation to the word emotions, as compared to the similarity of the his cold, precise but admirably balanced mind noun chunk.

Important note

A larger spaCy model, such as en_core_web_lg, takes up more space but is more precise.

See also

The topic of semantic similarity will be explored in more detail in Chapter 3.

Extracting subjects and objects of the sentence

Sometimes, we might need to find the subject and direct objects of the sentence, and that is easily accomplished with the spaCy package.

Getting ready

We will be using the dependency tags from spaCy to find subjects and objects. The code uses the spaCy engine to parse the sentence. Then, the subject function loops through the tokens, and if the dependency tag contains subj, it returns that token’s subtree, a Span object. There are different subject tags, including nsubj for regular subjects and nsubjpass for subjects of passive sentences, thus we want to look for both.

How to do it…

We will use the subtree attribute of tokens to find the complete noun chunk that is the subject or direct object of the verb (see the Getting the dependency parse recipe). We will define functions to find the subject, direct object, dative phrase, and prepositional phrases:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. We will use two functions to find the subject and the direct object of the sentence. These functions will loop through the tokens and return the subtree that contains the token with subj or dobj in the dependency tag, respectively. Here is the subject function. It looks for the token that has a dependency tag that contains subj and then returns the subtree that contains that token. There are several subject dependency tags, including nsubj and nsubjpass (for the subject of a passive sentence), so we look for the most general pattern:
    def get_subject_phrase(doc):
        for token in doc:
            if ("subj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  3. Here is the direct object function. It works similarly to get_subject_phrase but looks for the dobj dependency tag instead of a tag that contains subj. If the sentence does not have a direct object, it will return None:
    def get_object_phrase(doc):
        for token in doc:
            if ("dobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  4. Assign a list of sentences to a variable, loop through them, and use the preceding functions to print out their subjects and objects:
    sentences = [
        "The big black cat stared at the small dog.",
        "Jane watched her brother in the evenings.",
        "Laura gave Sam a very interesting book."
    ]
    for sentence in sentences:
        doc = small_model(sentence)
        subject_phrase = get_subject_phrase(doc)
        object_phrase = get_object_phrase(doc)
        print(sentence)
        print("\tSubject:", subject_phrase)
        print("\tDirect object:", object_phrase)

    The result will be as follows. Since the first sentence does not have a direct object, None is printed out. For the sentence The big black cat stared at the small dog, the subject is the big black cat and there is no direct object (the small dog is the object of the preposition at). For the sentence Jane watched her brother in the evenings, the subject is Jane and the direct object is her brother. In the sentence Laura gave Sam a very interesting book, the subject is Laura and the direct object is a very interesting book:

    The big black cat stared at the small dog.
      Subject: The big black cat
      Direct object: None
    Jane watched her brother in the evenings.
      Subject: Jane
      Direct object: her brother
    Laura gave Sam a very interesting book.
      Subject: Laura
      Direct object: a very interesting book

There’s more…

We can look for other objects, for example, the dative objects of verbs such as give and objects of prepositional phrases. The functions will look very similar, with the main difference being the dependency tags: dative for the dative object function, and pobj for the prepositional object function. The prepositional object function will return a list since there can be more than one prepositional phrase in a sentence:

  1. The dative object function checks the tokens for the dative tag. It returns None if there are no dative objects:
    def get_dative_phrase(doc):
        for token in doc:
            if ("dative" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  2. We can also combine the subject, object, and dative functions into one with an argument that specifies which object to look for:
    def get_phrase(doc, phrase):
        # phrase is one of "subj", "obj", "dative"
        for token in doc:
            if (phrase in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  3. Let us now define a sentence with a dative object and run the function for all three types of phrases:
    sentence = "Laura gave Sam a very interesting book."
    doc = small_model(sentence)
    subject_phrase = get_phrase(doc, "subj")
    object_phrase = get_phrase(doc, "obj")
    dative_phrase = get_phrase(doc, "dative")
    print(sentence)
    print("\tSubject:", subject_phrase)
    print("\tDirect object:", object_phrase)
    print("\tDative object:", dative_phrase)

    The result will be as follows. The dative object is Sam:

    Laura gave Sam a very interesting book.
      Subject: Laura
      Direct object: a very interesting book
      Dative object: Sam
  4. Here is the prepositional object function. It returns a list of objects of prepositions, which will be empty if there are none:
    def get_prepositional_phrase_objs(doc):
        prep_spans = []
        for token in doc:
            if ("pobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                prep_spans.append(doc[start:end])
        return prep_spans
  5. Let’s define a list of sentences and run the two functions on them:
    sentences = [
        "The big black cat stared at the small dog.",
        "Jane watched her brother in the evenings."
    ]
    for sentence in sentences:
        doc = small_model(sentence)
        subject_phrase = get_phrase(doc, "subj")
        object_phrase = get_phrase(doc, "obj")
        dative_phrase = get_phrase(doc, "dative")
        prepositional_phrase_objs = \
            get_prepositional_phrase_objs(doc)
        print(sentence)
        print("\tSubject:", subject_phrase)
        print("\tDirect object:", object_phrase)
        print("\tPrepositional phrases:", prepositional_phrase_objs)

    The result will be as follows:

    The big black cat stared at the small dog.
      Subject: The big black cat
      Direct object: the small dog
      Prepositional phrases: [the small dog]
    Jane watched her brother in the evenings.
      Subject: Jane
      Direct object: her brother
      Prepositional phrases: [the evenings]

    There is one prepositional phrase in each sentence. In the sentence The big black cat stared at the small dog, it is at the small dog, and in the sentence Jane watched her brother in the evenings, it is in the evenings.

It is left as an exercise for you to find the actual prepositional phrases with prepositions intact instead of just the noun phrases that are dependent on these prepositions.

Finding patterns in text using grammatical information

In this section, we will use the spaCy Matcher object to find patterns in the text. We will use the grammatical properties of the words to create these patterns. For example, we might be looking for verb phrases instead of noun phrases. We can specify grammatical patterns to match verb phrases.

Getting ready

We will be using the spaCy Matcher object to specify and find patterns. It can match different properties, not just grammatical. You can find out more in the documentation at https://spacy.io/usage/rule-based-matching/.

How to do it…

Your steps should be formatted like so:

  1. Run the file and language utility notebooks:
    %run -i "../util/file_utils.ipynb"
    %run -i "../util/lang_utils.ipynb"
  2. Import the Matcher object and initialize it. We need to put in the vocabulary object, which is the same as the vocabulary of the model we will be using to process the text:
    from spacy.matcher import Matcher
    matcher = Matcher(small_model.vocab)
  3. Create a list of patterns and add them to the matcher. Each pattern is a list of dictionaries, where each dictionary describes a token. In our patterns, we only specify the part of speech for each token. We then add these patterns to the Matcher object. The patterns we will be using are a verb by itself (for example, paints), an auxiliary followed by a verb (for example, was observing), an auxiliary followed by an adjective (for example, were late), and an auxiliary followed by a verb and a preposition (for example, were staring at). This is not an exhaustive list; feel free to come up with other examples:
    patterns = [
        [{"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "VERB"}],
        [{"POS": "AUX"}, {"POS": "ADJ"}],
        [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}]
    ]
    matcher.add("Verb", patterns)
  4. Read in the small part of the Sherlock Holmes text and process it using the small model:
    sherlock_holmes_part_of_text = read_text_file("../data/sherlock_holmes_1.txt")
    doc = small_model(sherlock_holmes_part_of_text)
  5. Now, we find the matches using the Matcher object and the processed text. We then loop through the matches and print out the match ID, the string ID (the identifier of the pattern), the start and end of the match, and the text of the match:
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = small_model.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id, string_id, start, end, span.text)

    The result will be as follows:

    14677086776663181681 Verb 14 15 heard
    14677086776663181681 Verb 17 18 mention
    14677086776663181681 Verb 28 29 eclipses
    14677086776663181681 Verb 31 32 predominates
    14677086776663181681 Verb 43 44 felt
    14677086776663181681 Verb 49 50 love
    14677086776663181681 Verb 63 65 were abhorrent
    14677086776663181681 Verb 80 81 take
    14677086776663181681 Verb 88 89 observing
    14677086776663181681 Verb 94 96 has seen
    14677086776663181681 Verb 95 96 seen
    14677086776663181681 Verb 103 105 have placed
    14677086776663181681 Verb 104 105 placed
    14677086776663181681 Verb 114 115 spoke
    14677086776663181681 Verb 120 121 save
    14677086776663181681 Verb 130 132 were admirable
    14677086776663181681 Verb 140 141 drawing
    14677086776663181681 Verb 153 154 trained
    14677086776663181681 Verb 157 158 admit
    14677086776663181681 Verb 167 168 adjusted
    14677086776663181681 Verb 171 172 introduce
    14677086776663181681 Verb 173 174 distracting
    14677086776663181681 Verb 178 179 throw
    14677086776663181681 Verb 228 229 was

The code finds some of the verb phrases in the text. Sometimes, it finds a partial match that is part of another match. Weeding out these partial matches is left as an exercise.

See also

We can use other attributes apart from parts of speech. It is possible to match on the text itself, its length, whether it is alphanumeric, the punctuation, the word’s case, the dep_ and morph attributes, lemma, entity type, and others. It is also possible to use regular expressions on the patterns. For more information, see the spaCy documentation: https://spacy.io/usage/rule-based-matching.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Leverage ready-to-use recipes with the latest LLMs, including Mistral, Llama, and OpenAI models
  • Use LLM-powered agents for custom tasks and real-world interactions
  • Gain practical, in-depth knowledge of transformers and their role in implementing various NLP tasks with open-source and advanced LLMs
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Harness the power of Natural Language Processing to overcome real-world text analysis challenges with this recipe-based roadmap written by two seasoned NLP experts with vast experience transforming various industries with their NLP prowess. You’ll be able to make the most of the latest NLP advancements, including large language models (LLMs), and leverage their capabilities through Hugging Face transformers. Through a series of hands-on recipes, you’ll master essential techniques such as extracting entities and visualizing text data. The authors will expertly guide you through building pipelines for sentiment analysis, topic modeling, and question-answering using popular libraries like spaCy, Gensim, and NLTK. You’ll also learn to implement RAG pipelines to draw out precise answers from a text corpus using LLMs. This second edition expands your skillset with new chapters on cutting-edge LLMs like GPT-4, Natural Language Understanding (NLU), and Explainable AI (XAI)—fostering trust and transparency in your NLP models. By the end of this book, you'll be equipped with the skills to apply advanced text processing techniques, use pre-trained transformer models, build custom NLP pipelines to extract valuable insights from text data to drive informed decision-making.

What you will learn

  • Understand fundamental NLP concepts along with their applications using examples in Python
  • Classify text quickly and accurately with rule-based and supervised methods
  • Train NER models and perform sentiment analysis to identify entities and emotions in text
  • Explore topic modeling and text visualization to reveal themes and relationships within text
  • Leverage Hugging Face and OpenAI LLMs to perform advanced NLP tasks
  • Use question-answering techniques to handle both open and closed domains
  • Apply XAI techniques to better understand your model predictions

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 13, 2024
Length 312 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781803245744
Vendor :
Google
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want

Product Details

Publication date : Sep 13, 2024
Length 312 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781803245744
Vendor :
Google
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Table of Contents

13 Chapters
Preface Chevron down icon Chevron up icon
1. Chapter 1: Learning NLP Basics Chevron down icon Chevron up icon
2. Chapter 2: Playing with Grammar Chevron down icon Chevron up icon
3. Chapter 3: Representing Text – Capturing Semantics Chevron down icon Chevron up icon
4. Chapter 4: Classifying Texts Chevron down icon Chevron up icon
5. Chapter 5: Getting Started with Information Extraction Chevron down icon Chevron up icon
6. Chapter 6: Topic Modeling Chevron down icon Chevron up icon
7. Chapter 7: Visualizing Text Data Chevron down icon Chevron up icon
8. Chapter 8: Transformers and Their Applications Chevron down icon Chevron up icon
9. Chapter 9: Natural Language Understanding Chevron down icon Chevron up icon
10. Chapter 10: Generative AI and Large Language Models Chevron down icon Chevron up icon
11. Index Chevron down icon Chevron up icon
12. Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.