You're reading from Artificial Intelligence with Python - Second Edition

Product typeBook

Published inJan 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839219535

Edition2nd Edition

Languages

Python

Tools

TensorFlow

Concepts

Artificial Intelligence

Author (1)

Prateek Joshi

Natural Language Processing

In this chapter, we will learn about the exciting topic of natural language processing (NLP). As we have discussed in previous chapters, having computers that are able to understand human language is one of the breakthroughs that will truly make computers even more useful. NLP provides the foundation to begin to understand how this might be possible.

We will discuss and use various concepts, such as tokenization, stemming, and lemmatization, to process text. We will then discuss the Bag of Words model and how to use it to classify text. We will see how to use machine learning to analyze the sentiment of a sentence. We will then discuss topic modeling and implement a system to identify topics in a given document.

By the end of this chapter, you will be familiar with the following topics:

Installing relevant NLP packages
Tokenizing text data
Converting words to their base forms using stemming
Converting words to their base forms...

Introduction and installation of packages

Natural Language Processing (NLP) has become an important part of modern systems. It is used extensively in search engines, conversational interfaces, document processors, and so on. Machines can handle structured data well, but when it comes to working with free-form text, they have a hard time. The goal of NLP is to develop algorithms that enable computers to understand free - form text and help them understand language.

One of the most challenging things about processing free - form natural language is the sheer amount of variation. Context plays a very important role in how a sentence is understood. Humans are innately great at understanding language. It is not clear yet how humans understand language so easily and intuitively. We use our past knowledge and experiences to understand conversations and we can quickly get the gist of what other people are talking about even with little explicit context.

To address this issue, NLP...

Tokenizing text data

When we deal with text, we need to break it down into smaller pieces for analysis. To do this, tokenization can be applied. Tokenization is the process of dividing text into a set of pieces, such as words or sentences. These pieces are called tokens. Depending on what we want to do, we can define our own methods to divide the text into many tokens. Let's look at how to tokenize the input text using NLTK.

Create a new Python file and import the following packages:

from nltk.tokenize import sent_tokenize, \
        word_tokenize, WordPunctTokenizer

Define the input text that will be used for tokenization:

# Define input text
input_text = "Do you know how tokenization works? It's actually \ 
   quite interesting! Let's analyze a couple of sentences and \
   figure it out."

Divide the input text into sentence tokens:

# Sentence tokenizer 
print("\nSentence tokenizer:")
print(sent_tokenize(input_text...

Converting words to their base forms using stemming

Working with text means working with a lot of variation. We must deal with different forms of the same word and enable the computer to understand that these different words have the same base form. For example, the word sing can appear in many forms, such as singer, singing, song, sung, and so on. This set of words share similar meanings. This process is known as stemming. Stemming is a way of producing morphological variants of a root/base word. Humans can easily identify these base forms and derive context.

When analyzing text, it's useful to extract these base forms. Doing so enables the extraction of useful statistics derived from the input text. Stemming is one way to achieve this. The goal of a stemmer is to reduce words from their different forms into a common base form. It is basically a heuristic process that cuts off the ends of words to extract their base forms. Let's see how to do it using NLTK...

Converting words to their base forms using lemmatization

Lemmatization is another method of reducing words to their base forms. In the previous section, we saw that some of the base forms that were obtained from those stemmers didn't make sense. Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is like stemming, but it brings context to the words. So, it links words with similar meanings to one word. For example, all three stemmers said that the base form of calves is calv, which is not a real word. Lemmatization takes a more structured approach to solve this problem. Here are some more examples of lemmatization:

rocks : rock
corpora : corpus
worse : bad

The lemmatization process uses the lexical and morphological analysis of words. It obtains the base forms by removing the inflectional word endings such as ing or ed. This base form of any word is known...

Dividing text data into chunks

Text data usually needs to be divided into pieces for further analysis. This process is known as chunking. This is used frequently in text analysis. The conditions that are used to divide the text into chunks can vary based on the problem at hand. This is not the same as tokenization, where text is also divided into pieces. During chunking, we do not adhere to any constraints, except for the fact that the output chunks need to be meaningful.

When we deal with large text documents, it becomes important to divide the text into chunks to extract meaningful information. In this section, we will see how to divide input text into several pieces.

Create a new Python file and import the following packages:

import numpy as np
from nltk.corpus import brown

Define a function to divide the input text into chunks. The first parameter is the text, and the second parameter is the number of words in each chunk:

# Split the input text into chunks...

Extracting the frequency of terms using the Bag of Words model

One of the main goals of text analysis with the Bag of Words model is to convert text into a numerical form so that we can use machine learning on it. Let's consider text documents that contain many millions of words. In order to analyze these documents, we need to extract the text and convert it into a form of numerical representation.

Machine learning algorithms need numerical data to work with so that they can analyze the data and extract meaningful information. This is where the Bag of Words model comes in. This model extracts vocabulary from all the words in the documents and builds a model using a document-term matrix. This allows us to represent every document as a bag of words. We just keep track of word counts and disregard the grammatical details and the word order.

Let's see what a document-term matrix is all about. A document-term matrix is basically a table that gives us counts...

Building a category predictor

A category predictor is used to predict the category to which a given piece of text belongs. This is frequently used in text classification to categorize text documents. Search engines frequently use this tool to order search results by relevance. For example, let's say that we want to predict whether a given sentence belongs to sports, politics, or science. To do this, we build a corpus of data and train an algorithm. This algorithm can then be used for inference on unknown data.

In order to build this predictor, we will use a metric called Term Frequency – Inverse Document Frequency (tf-idf). In a set of documents, we need to understand the importance of each word. The tf-idf metric helps us to understand how important a given word is to a document in a set of documents.

Let's consider the first part of this metric. The Term Frequency (tf) is basically a measure of how frequently each word appears in a given document. Since different...

Constructing a gender identifier

Gender identification is an interesting problem and far from being an exact science. We can quickly think of names that can be used for both males and females:

Dana
Angel
Lindsey
Morgan
Jessie
Chris
Payton
Tracy
Stacy
Jordan
Robin
Sydney

In addition, in a heterogeneous society such as the United States, there are going to be many ethnic names that will not follow English rules. In general, we can take an educated guess for a wide range of names. In this simple example, we will use a heuristic to construct a feature vector and use it to train a classifier. The heuristic that will be used here is the last N letters of a given name. For example, if the name ends with ia, it's most likely a female name, such as Amelia or Genelia. On the other hand, if the name ends with rk, it's likely a male name, such as Mark or Clark. Since we are not sure of the exact number of letters to use...

Building a sentiment analyzer

Sentiment analysis is the process of determining the sentiment of a piece of text. For example, it can be used to determine whether a movie review is positive or negative. This is one of the most popular applications of natural language processing. We can add more categories as well, depending on the problem at hand. This technique can be used to get a sense of how people feel about a product, brand, or topic. It is frequently used to analyze marketing campaigns, opinion polls, social media presence, product reviews on e-commerce sites, and so on. Let's see how to determine the sentiment of a movie review.

We will use a Naive Bayes classifier to build this sentiment analyzer. First, extract all the unique words from the text. The NLTK classifier needs this data to be arranged in the form of a dictionary so that it can ingest it. Once the text data is divided into training and testing datasets, the Naive Bayes classifier will be trained...

Topic modeling using Latent Dirichlet Allocation

Topic modeling is the process of identifying patterns in text data that correspond to a topic. If the text contains multiple topics, then this technique can be used to identify and separate those themes within the input text. This technique can be used to uncover hidden thematic structure in a given set of documents.

Topic modeling helps us to organize documents in an optimal way, which can then be used for analysis. One thing to note about topic modeling algorithms is that they don't need labeled data. It is like unsupervised learning in that it will identify the patterns on its own. Given the enormous volumes of text data generated on the internet, topic modeling is important because it enables the summarization of vast amounts of data, which would otherwise not be possible.

Latent Dirichlet Allocation is a topic modeling technique, the underlying concept of which is that a given piece of text is a combination of multiple...

Summary

In this chapter, we learned about various underlying concepts in natural language processing. We discussed tokenization and how to separate input text into multiple tokens. We learned how to reduce words to their base forms using stemming and lemmatization. We implemented a text chunker to divide input text into chunks based on predefined conditions.

We discussed the Bag of Words model and built a document-term matrix for input text. We then learned how to categorize text using machine learning. We constructed a gender identifier using a heuristic. We also used machine learning to analyze the sentiments of movie reviews. Finally, we discussed topic modeling and implemented a system to identify topics in a given document.

In the next chapter, we will learn how to model sequential data using Hidden Markov Models and then use them to analyze stock market data.

The rest of the chapter is locked

You have been reading a chapter from

Artificial Intelligence with Python - Second Edition

Published in: Jan 2020Publisher: PacktISBN-13: 9781839219535

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi

Other recommended products

Related to this chapter

Python Machine Learning Cookbook

With this book, you will learn how to perform various machine learning tasks in different environments. You’ll use a wide variety of machine learning algorithms using Python to solve real-world problems. By the end of the book, you will learn to implement most used machine learning algorithms using complex datasets and optimized techniques.

BookMar 2019642 pages

OpenCV 3.x with Python By Example

Computer vision is found everywhere in modern technology. OpenCV for Python enables us to run computer vision algorithms in real time. With the advent of powerful machines, we have more processing power to work with. Using this technology, we can seamlessly integrate our computer vision applications into the cloud. Focusing on OpenCV 3.x and Python 3.6, this book will walk you through all the building blocks needed to build amazing computer vision applications with ease.

BookJan 2018268 pages

Learn OpenCV 4 By Building Projects

OpenCV is mainly used in Computer Vision and image processing and is considered to be one of the best open source libraries that helps developers focus on constructing complete projects on image processing, motion detection, and image segmentation. This book will be your guide to understanding the basic OpenCV concepts and algorithms.

BookNov 2018310 pages

Artificial Intelligence and Machine Learning Fundamentals

Artificial Intelligence and Machine Learning Fundamentals teaches you machine learning and neural networks from the ground up using real-world examples. After you complete this book, you will be excited to revamp your current projects or build new intelligent networks.

BookDec 2018330 pages

Hands-On Genetic Algorithms with Python

Using this book, you will gain expertise in genetic algorithms, understand how they work and know when and how to use them to create intelligent Python-based applications. By the end of this book, you will have hands-on experience applying genetic algorithms to artificial intelligence as well as numerous other domains.

BookJan 2020346 pages

The Applied Artificial Intelligence Workshop

The Applied Artificial Intelligence Workshop teaches you the ins and outs of machine learning and neural networks from the ground up, using real-world examples. You'll learn to develop AI and ML models using Python, starting with using the minmax algorithm and alpha-beta pruning to create your first game, and ending with classifying images using neural networks.

BookJul 2020420 pages

Artificial Intelligence for Big Data

Create smart systems to extract intelligent insights for decision making. You will learn about widely used Artificial Intelligence techniques for carrying out solutions in a production-ready environment. You'll explore advanced topics such as clustering, symbolic and sub-symbolic information representation, and many more.

BookMay 2018384 pages

Hands-On Artificial Intelligence for IoT

The book will help you get well-versed with different techniques in Artificial Intelligence such as machine learning, deep learning, natural language processing and more to build smart IoT systems. By the end of the book, you will have practical knowledge on how to implement and manipulate text, audio, and speech data within the IoT system.

BookJan 2019390 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages