2. Feature Extraction Methods
Overview
In this chapter, you will be able to categorize data based on its content and structure. You will be able to describe preprocessing steps in detail and implement them to clean up text data. You will learn about feature engineering and calculate the similarity between texts. Once you understand these concepts, you will be able to use word clouds and some other techniques to visualize text.
Introduction
In the previous chapter, we learned about the concepts of Natural Language Processing (NLP) and text analytics. We also took a quick look at various preprocessing steps. In this chapter, we will learn how to make text understandable to machine learning algorithms.
As we know, to use a machine learning algorithm on textual data, we need a numerical or vector representation of text data since most of these algorithms are unable to work directly with plain text or strings. But before converting the text data into numerical form, we will need to pass it through some preprocessing steps such as tokenization, stemming, lemmatization, and stop-word removal.
So, in this chapter, we will learn a little bit more about these preprocessing steps and how to extract features from the preprocessed text and convert them into vectors. We will also explore two popular methods for feature extraction (Bag of Words and Term Frequency-Inverse Document Frequency), as well as various methods...
Types of Data
To deal with data effectively, we need to understand the various forms in which it exists. First, let's explore the types of data that exist. There are two main ways to categorize data (by structure and by content), as explained in the upcoming sections.
Categorizing Data Based on Structure
Data can be divided on the basis of structure into three categories, namely, structured, semi-structured, and unstructured data, as shown in the following diagram:
Figure 2.1: Categorization based on content
These three categories are as follows:
- Structured data: This is the most organized form of data. It is represented in tabular formats such as Excel files and Comma-Separated Value (CSV) files. The following image shows what structured data usually looks like:
Figure 2.2: Structured data
The preceding table contains information about five people, with each row representing a person and each column representing one of their attributes...
Cleaning Text Data
The text data that we are going to discuss here is unstructured text data, which consists of written sentences. Most of the time, this text data cannot be used as it is for analysis because it contains some noisy elements, that is, elements that do not really contribute much to the meaning of the sentence at all. These noisy elements need to be removed because they do not contribute to the meaning and semantics of the text. If they're not removed, they can not only waste system memory and processing time, but also negatively impact the accuracy of the results. Data cleaning is the art of extracting meaningful portions from data by eliminating unnecessary details. Consider the sentence, "He tweeted, 'Live coverage of General Elections available at this.tv/show/ge2019. _/\_ Please tune in :) '. "
In this example, to perform NLP tasks on the sentence, we will need to remove the emojis, punctuation, and stop words, and then change the words...
Feature Extraction from Texts
As we already know, machine learning algorithms do not understand textual data directly. We need to represent the text data in numerical form or vectors. To convert each textual sentence into a vector, we need to represent it as a set of features. This set of features should uniquely represent the text, though, individually, some of the features may be common across many textual sentences. Features can be classified into two different categories:
- General features: These features are statistical calculations and do not depend on the content of the text. Some examples of general features could be the number of tokens in the text, the number of characters in the text, and so on.
- Specific features: These features are dependent on the inherent meaning of the text and represent the semantics of the text. For example, the frequency of unique words in the text is a specific feature.
Let's explore these in detail.
Extracting General Features...
Finding Text Similarity – Application of Feature Extraction
So far in this chapter, we have learned how to generate vectors from text. These vectors are then fed to machine learning algorithms to perform various tasks. Other than using them in machine learning applications, we can also perform simple NLP tasks using these vectors. Finding the string similarity is one of them. This is a technique in which we find the similarity between two strings by converting them into vectors. The technique is mainly used in full-text searching.
There are different techniques for finding the similarity between two strings or texts. They are explained one by one here:
- Cosine similarity: The cosine similarity is a technique to find the similarity between the two vectors by calculating the cosine of the angle between them. As we know, the cosine of a zero-degree angle is 1 (meaning the cosine similarity of two identical vectors is 1), while the cosine of 180 degrees is -1 (meaning...
Summary
In this chapter, you have learned about various types of data and ways to deal with unstructured text data. Text data is usually extremely noisy and needs to be cleaned and preprocessed, which mainly consists of tokenization, stemming, lemmatization, and stop-word removal. After preprocessing, features are extracted from texts using various methods, such as BoW and TFIDF. These methods convert unstructured text data into structured numeric data. New features are created from existing features using a technique called feature engineering. In the last part of this chapter, we explored various ways of visualizing text data, such as word clouds.
In the next chapter, you will learn how to develop machine learning models to classify texts using the feature extraction methods you have learned about in this chapter. Moreover, different sampling techniques and model evaluation parameters will be introduced.