Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

Text Analytics and NLP

Text analytics is the method of extracting meaningful insights and answering questions from text data, such as those to do with the length of sentences, length of words, word count, and finding words from the text. Let's understand this with an example.

Suppose we are doing a survey using news articles. Let's say we have to find the top five countries that contributed the most in the field of space technology in the past 5 years. So, we will collect all the space technology-related news from the past 5 years using the Google News API. Now, we must extract the names of countries in these news articles. We can perform this task using a file containing a list of all the countries in the world.

Next, we will create a dictionary in which keys will be the country names and their values will be the number of times the country name is found in the news articles. To search for a country in the news articles, we can use a simple word regex. After we have completed searching all the news articles, we can sort the country names by the values associated with them. In this way, we will come up with the top five countries that contributed the most to space technology in the last 5 years.

This is a typical example of text analytics, in which we are generating insights from text without getting into the semantics of the language.

It is important here to note the difference between text analytics and NLP. The art of extracting useful insights from any given text data can be referred to as text analytics. NLP, on the other hand, helps us in understanding the semantics and the underlying meaning of text, such as the sentiment of a sentence, top keywords in text, and parts of speech for different words. It is not just restricted to text data; voice (speech) recognition and analysis also come under the domain of NLP. It can be broadly categorized into two types: Natural Language Understanding (NLU) and Natural Language Generation (NLG). A proper explanation of these terms is provided here:

  • NLU: NLU refers to a process by which an inanimate object with computing power is able to comprehend spoken language. As mentioned earlier, Siri and Alexa use techniques such as Speech to Text to answer different questions, including inquiries about the weather, the latest news updates, live match scores, and more.
  • NLG: NLG refers to a process by which an inanimate object with computing power is able to communicate with humans in a language that they can understand or is able to generate human-understandable text from a dataset. Continuing with the example of Siri or Alexa, ask one of them about the chances of rainfall in your city. It will reply with something along the lines of, "Currently, there is no chance of rainfall in your city." It gets the answer to your query from different sources using a search engine and then summarizes the results. Then, it uses Text to Speech to relay the results in verbally spoken words.

So, when a human speaks to a machine, the machine interprets the language with the help of the NLU process. By using the NLG process, the machine generates an appropriate response and shares it with the human, thus making it easier for humans to understand the machine. These tasks, which are part of NLP, are not part of text analytics. Let's walk through the basics of text analytics and see how we can execute it in Python.

Before going to the exercises, let's define some prerequisites for running the exercises. Whether you are using Windows, Mac or Linux, you need to run your Jupyter Notebook in a virtual environment. You will also need to ensure that you have installed the requirements as stated in the requirements.txt file on https://packt.live/3fJ4qap.

Exercise 1.01: Basic Text Analytics

In this exercise, we will perform some basic text analytics on some given text data, including searching for a particular word, finding the index of a word, and finding a word at a given position. Follow these steps to implement this exercise using the following sentence:

"The quick brown fox jumps over the lazy dog."

  1. Open a Jupyter Notebook.
  2. Assign a sentence variable the value 'The quick brown fox jumps over the lazy dog'. Insert a new cell and add the following code to implement this:
    sentence = 'The quick brown fox jumps over the lazy dog'
    sentence
  3. Check whether the word 'quick' belongs to that text using the following code:
    def find_word(word, sentence):
        return word in sentence
    find_word('quick', sentence)

    The preceding code will return the output 'True'.

  4. Find out the index value of the word 'fox' using the following code:
    def get_index(word, text):
        return text.index(word)
    get_index('fox', sentence)

    The code will return the output 16.

  5. To find out the rank of the word 'lazy', use the following code:
    get_index('lazy', sentence.split())

    This code generates the output 7.

  6. To print the third word of the given text, use the following code:
    def get_word(text,rank):
        return text.split()[rank]
    get_word(sentence,2)

    This will return the output brown.

  7. To print the third word of the given sentence in reverse order, use the following code:
    get_word(sentence,2)[::-1]

    This will return the output nworb.

  8. To concatenate the first and last words of the given sentence, use the following code:
    def concat_words(text):
        """
        This method will concat first and last 
        words of given text
        """
        words = text.split()
        first_word = words[0]
        last_word = words[len(words)-1]
        return first_word + last_word
    concat_words(sentence)

    Note

    The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.

    The code will generate the output Thedog.

  9. To print words at even positions, use the following code:
    def get_even_position_words(text):
        words = text.split()
        return [words[i] for i in range(len(words)) if i%2 == 0]
    get_even_position_words(sentence)

    This code generates the following output:

    ['The', 'brown', 'jumps', 'the', 'dog']
  10. To print the last three letters of the text, use the following code:
    def get_last_n_letters(text, n):
        return text[-n:]
    get_last_n_letters(sentence,3)

    This will generate the output dog.

  11. To print the text in reverse order, use the following code:
    def get_reverse(text):
        return text[::-1]
    get_reverse(sentence)

    This code generates the following output:

    'god yzal eht revo spmuj xof nworb kciuq ehT'
  12. To print each word of the given text in reverse order, maintaining their sequence, use the following code:
    def get_word_reverse(text):
        words = text.split()
        return ' '.join([word[::-1] for word in words])
    get_word_reverse(sentence)

    This code generates the following output:

    ehT kciuq nworb xof spmuj revo eht yzal god

We are now well acquainted with basic text analytics techniques.

Note

To access the source code for this specific section, please refer to https://packt.live/38Yrf77.

You can also run this example online at https://packt.live/2ZsCvpf.

In the next section, let's dive deeper into the various steps and subtasks in NLP.

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning