Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Natural Language Processing Workshop

You're reading from  The Natural Language Processing Workshop

Product type Book
Published in Aug 2020
Publisher Packt
ISBN-13 9781800208421
Pages 452 pages
Edition 1st Edition
Languages
Authors (6):
Rohan Chopra Rohan Chopra
Profile icon Rohan Chopra
Aniruddha M. Godbole Aniruddha M. Godbole
Profile icon Aniruddha M. Godbole
Nipun Sadvilkar Nipun Sadvilkar
Profile icon Nipun Sadvilkar
Muzaffar Bashir Shah Muzaffar Bashir Shah
Profile icon Muzaffar Bashir Shah
Sohom Ghosh Sohom Ghosh
Profile icon Sohom Ghosh
Dwight Gunning Dwight Gunning
Profile icon Dwight Gunning
View More author details

Table of Contents (10) Chapters

Preface
1. Introduction to Natural Language Processing 2. Feature Extraction Methods 3. Developing a Text Classifier 4. Collecting Text Data with Web Scraping and APIs 5. Topic Modeling 6. Vector Representation 7. Text Generation and Summarization 8. Sentiment Analysis Appendix

Sentence Boundary Detection

Sentence boundary detection is the method of detecting where one sentence ends and another begins. If you are thinking that this sounds pretty easy, as a period (.) or a question mark (?) denotes the end of a sentence and the beginning of another sentence, then you are wrong. There can also be instances where the letters of acronyms are separated by full stops, for instance. Various analyses need to be performed at a sentence level; detecting the boundaries of sentences is essential.

An exercise will provide us with a better understanding of this process.

Exercise 1.11: Sentence Boundary Detection

In this exercise, we will extract sentences from a paragraph. To do so, we'll be using the sent_tokenize() method, which is used to detect sentence boundaries. The following steps need to be performed:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    import nltk
    from nltk.tokenize import sent_tokenize
  3. Use the sent_tokenize() method to detect sentences in some given text. Insert a new cell and add the following code to implement this:
    def get_sentences(text):
        return sent_tokenize(text)
    get_sentences("We are reading a book. Do you know who is "\
                  "the publisher? It is Packt. Packt is based "\
                  "out of Birmingham.")

    This code generates the following output:

    ['We are reading a book.'
     'Do you know who is the publisher?'
     'It is Packt.',
     'Packt is based out of Birmingham.']
  4. Use the sent_tokenize() method for text that contains periods (.) other than those found at the ends of sentences:
    get_sentences("Mr. Donald John Trump is the current "\
                  "president of the USA. Before joining "\
                  "politics, he was a businessman.")

    The code will generate the following output:

    ['Mr. Donald John Trump is the current president of the USA.',
     'Before joining politics, he was a businessman.']

As you can see in the code, the sent_tokenize method is able to differentiate between the period (.) after "Mr" and the one used to end the sentence. We have covered all the preprocessing steps that are involved in NLP.

Note

To access the source code for this specific section, please refer to https://packt.live/2ZseU86.

You can also run this example online at https://packt.live/2CC8Ukp.

Now, using the knowledge we've gained, let's perform an activity.

Activity 1.01: Preprocessing of Raw Text

We have a text corpus that is in an improper format. In this activity, we will perform all the preprocessing steps that were discussed earlier to get some meaning out of the text.

Note

The text corpus, file.txt, can be found at this location: https://packt.live/30cu54z

After downloading the file, place it in the same directory as the notebook.

Follow these steps to implement this activity:

  1. Import the necessary libraries.
  2. Load the text corpus to a variable.
  3. Apply the tokenization process to the text corpus and print the first 20 tokens.
  4. Apply spelling correction on each token and print the initial 20 corrected tokens as well as the corrected text corpus.
  5. Apply PoS tags to each of the corrected tokens and print them.
  6. Remove stop words from the corrected token list and print the initial 20 tokens.
  7. Apply stemming and lemmatization to the corrected token list and then print the initial 20 tokens.
  8. Detect the sentence boundaries in the given text corpus and print the total number of sentences.

    Note

    The solution for this activity can be found via this link.

We have learned about and achieved the preprocessing of given data. By now, you should be familiar with what NLP is and what basic preprocessing steps are needed to carry out any NLP project. In the next section, we will focus on the different phases of an NLP project.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}