Counting nouns – plural and singular nouns
In this recipe, we will do two things: determine whether a noun is plural or singular and turn plural nouns into singular, and vice versa.
You might need these two things for a variety of tasks. For example, you might want to count the word statistics, and for that, you most likely need to count the singular and plural nouns together. In order to count the plural nouns together with singular ones, you need a way to recognize that a word is plural or singular.
Getting ready
To determine whether a noun is singular or plural, we will use spaCy via two different methods: by looking at the difference between the lemma and the actual word and by looking at the morph attribute. To inflect these nouns, or turn singular nouns into plural or vice versa we will use the textblob package. We will also see how to determine the noun’s number using GPT-3 through the OpenAI API. The code for this section is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/tree/main/Chapter02.
How to do it…
We will first use spaCy’s lemma information to infer whether a noun is singular or plural. Then, we will use the morph attribute of Token objects. We will then create a function that uses one of those methods. Finally, we will use GPT-3.5 to find out the number of nouns:
- Run the code in the file and language utility notebooks. If you run into an error saying that the small or large models do not exist, you need to open the
lang_utils.ipynbfile, uncomment, and run the statement that downloads the model:%run -i "../util/file_utils.ipynb" %run -i "../util/lang_utils.ipynb"
- Initialize the
textvariable and process it using thespaCysmall model to get the resultingDocobject:text = "I have five birds" doc = small_model(text)
- In this step, we loop through the
Docobject. For each token in the object, we check whether it’s a noun and whether the lemma is the same as the word itself. Since the lemma is the basic form of the word, if the lemma is different from the word, that token is plural:for token in doc: if (token.pos_ == "NOUN" and token.lemma_ != token.text): print(token.text, "plural")
The result should be as follows:
birds plural
- Now, we will check the number of a noun using a different method: the
morphfeatures of aTokenobject. Themorphfeatures are the morphological features of a word, such as number, case, and so on. Since we know that token3is a noun, we directly access themorphfeatures and get theNumberto get the same result as previously:doc = small_model("I have five birds.") print(doc[3].morph.get("Number"))Here is the result:
['Plur']
- In this step, we prepare to define a function that returns a tuple,
(noun, number). In order to better encode the noun number, we use anEnumclass that assigns numbers to different values. We assign1to singular and2to plural. Once we create the class, we can directly refer to the noun number variables asNoun_number.SINGULARandNoun_number.PLURAL:class Noun_number(Enum): SINGULAR = 1 PLURAL = 2
- In this step, we define the function. It takes as input the text, the
spaCymodel, and the method of determining the noun number. The two methods arelemmaandmorph, the same two methods we used in steps 3 and 4, respectively. The function outputs a list of tuples, each of the format(<noun text>, <noun number>), where the noun number is expressed using theNoun_numberclass defined in step 5:def get_nouns_number(text, model, method="lemma"): nouns = [] doc = model(text) for token in doc: if (token.pos_ == "NOUN"): if method == "lemma": if token.lemma_ != token.text: nouns.append((token.text, Noun_number.PLURAL)) else: nouns.append((token.text, Noun_number.SINGULAR)) elif method == "morph": if token.morph.get("Number") == "Sing": nouns.append((token.text, Noun_number.PLURAL)) else: nouns.append((token.text, Noun_number.SINGULAR)) return nouns - We can use the preceding function and see its performance with different
spaCymodels. In this step, we use the smallspaCymodel with the function we just defined. Using both methods, we see that thespaCymodel gets the number of the irregular noungeeseincorrectly:text = "Three geese crossed the road" nouns = get_nouns_number(text, small_model, "morph") print(nouns) nouns = get_nouns_number(text, small_model) print(nouns)
The result should be as follows:
[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)] [('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)] - Now, let’s do the same using the large model. If you have not yet downloaded the large model, do so by running the first line. Otherwise, you can comment it out. Here, we see that although the
morphmethod still incorrectly assigns singular togeese, thelemmamethod provides the correct answer:!python -m spacy download en_core_web_lg large_model = spacy.load("en_core_web_lg") nouns = get_nouns_number(text, large_model, "morph") print(nouns) nouns = get_nouns_number(text, large_model) print(nouns)The result should be as follows:
[('geese', <Noun_number.SINGULAR: 1>), ('road', <Noun_number.SINGULAR: 1>)] [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)] - Let’s now use GPT-3.5 to get the noun number. In the results, we see that GPT-3.5 gives us an identical result and correctly identifies both the number for
geeseand the number forroad:from openai import OpenAI client = OpenAI(api_key=OPEN_AI_KEY) prompt="""Decide whether each noun in the following text is singular or plural. Return the list in the format of a python tuple: (word, number). Do not provide any additional explanations. Sentence: Three geese crossed the road.""" response = client.chat.completions.create( model="gpt-3.5-turbo", temperature=0, max_tokens=256, top_p=1.0, frequency_penalty=0, presence_penalty=0, messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], ) print(response.choices[0].message.content)The result should be as follows:
('geese', 'plural') ('road', 'singular')
There’s more…
We can also change the nouns from plural to singular, and vice versa. We will use the textblob package for that. The package should be installed automatically via the Poetry environment:
- Import the
TextBlobclass from the package:from textblob import TextBlob
- Initialize a list of text variables and process them using the
TextBlobclass via a list comprehension:texts = ["book", "goose", "pen", "point", "deer"] blob_objs = [TextBlob(text) for text in texts]
- Use the
pluralizefunction of the object to get the plural. This function returns a list and we access its first element. Print the result:plurals = [blob_obj.words.pluralize()[0] for blob_obj in blob_objs] print(plurals)
The result should be as follows:
['books', 'geese', 'pens', 'points', 'deer']
- Now, we will do the reverse. We use the preceding
pluralslist to turn the plural nouns intoTextBlobobjects:blob_objs = [TextBlob(text) for text in plurals]
- Turn the nouns into singular using the
singularizefunction and print:singulars = [blob_obj.words.singularize()[0] for blob_obj in blob_objs] print(singulars)
The result should be the same as the list we started with in step 2:
['book', 'goose', 'pen', 'point', 'deer']