Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

3. Developing a Text Classifier

Overview

This chapter starts with an introduction to the various types of machine learning methods, that is, the supervised and unsupervised methods. You will learn about hierarchical clustering and k-means clustering and implement them using various datasets. Next, you will explore tree-based methods such as random forest and XGBoost. Finally, you will implement an end-to-end text classifier in order to categorize text on the basis of its content.

Introduction

In the previous chapters, you learned about various extraction methods, such as tokenization, stemming, lemmatization, and stop-word removal, which are used to extract features from unstructured text. We also discussed Bag of Words and Term Frequency-Inverse Document Frequency (TFIDF).

In this chapter, you will learn how to use these extracted features to develop machine learning models. These models are capable of solving real-world problems, such as detecting whether sentiments carried by texts are positive or negative, predicting whether emails are spam or not, and so on. We will also cover concepts such as supervised and unsupervised learning, classifications and regressions, sampling and splitting data, along with evaluating the performance of a model in depth. This chapter also discusses how to load and save these models for future use.

Machine Learning

Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

Machine learning algorithms are fed with large amounts of data that they can work on to build a model. This model is later used by businesses to generate solutions that help them analyze data and build strategies for the future. For example, a beverage production company can make use of multiple datasets to better understand the trends of their product's consumption over the course of a year. This would help them reduce wastage and better predict the requirements of their consumers. Machine learning is further categorized into unsupervised and supervised learning. Let's explore these two terms in detail.

Unsupervised Learning

Unsupervised learning is the method by which algorithms learn patterns within data that is not labeled. Since labels...

Supervised Learning

Unlike unsupervised learning, supervised learning algorithms need labeled data. They learn how to automatically generate labels or predict values by analyzing various features of the data provided. For example, say you have already starred important text messages on your phone, and you want to automate the task of going through all your messages daily (considering they are important and marked already). This is a use case for supervised learning. Here, messages that have been starred previously can be used as labeled data. Using this data, you can create two types of models that are capable of the following:

  • Classifying whether new messages are important
  • Predicting the probability of new messages being important

The first type is called classification, while the second type is called regression. Let's learn about classification first.

Classification

Say you have two types of food, of which type 1 tastes sweet and type 2 tastes salty...

Developing a Text Classifier

A text classifier is a machine learning model that is capable of labeling texts based on their content. For instance, a text classifier will help you understand whether a random text statement is sarcastic or not. Presently, text classifiers are gaining importance as manually classifying huge amounts of text data is impossible. In the next few sections, we will learn about the different parts of text classifiers and implement them in Python.

Feature Extraction

When dealing with text data, features denote its different attributes. Generally, they are numeric representations of the text. As we discussed in Chapter 2, Feature Extraction Methods, TFIDF representations of texts are one of the most popular ways of extracting features from them.

Feature Engineering

Feature engineering is the art of extracting new features from existing ones. Extracting novel features, which tend to capture variation in data better, requires sound domain expertise.

...

Building Pipelines for NLP Projects

In general, a pipeline refers to a structure that allows a streamlined flow of air, water, or something similar. In this context, pipeline has a similar meaning. It helps to streamline various stages of an NLP project.

An NLP project is done in various stages, such as tokenization, stemming, feature extraction (TFIDF matrix generation), and model building. Instead of carrying out each stage separately, we create an ordered list of all these stages. This list is known as a pipeline. The Pipeline class of sklearn helps us combine these stages into one object that we can use to perform these stages one after the other in a sequence. We will solve a text classification problem using a pipeline in the next section to understand the working of a pipeline better.

Exercise 3.14: Building the Pipeline for an NLP Project

In this exercise, we will develop a pipeline that will allow us to create a TFIDF matrix representation from sklearn's fetch_20newsgroups...

Saving and Loading Models

After a model has been built and its performance matches our expectations, we may want to save it for future use. This eliminates the time needed for rebuilding it. Models can be saved on the hard disk using the joblib and pickle libraries.

The pickle module makes use of binary protocols to save and load Python objects. joblib makes use of the pickle library protocols, but it improves on them to provide an efficient replacement to save large Python objects. Both libraries have two main functions that we will make use of to save and load our models:

  • dump: This function is used to save a Python object to a file on the disk.
  • loads: This function is used to load a saved Python object from a file on the disk.

To deploy saved models, we need to load them from the hard disk to the memory. In the next section, we will perform an exercise based on this to get a better understanding of this process.

Exercise 3.15: Saving and Loading Models

...

Summary

In this chapter, you learned about the different types of machine learning techniques, such as supervised and unsupervised learning. We explored unsupervised algorithms such as hierarchical clustering and k-means clustering, and supervised learning algorithms, such as k-nearest neighbor, the Naive Bayes classifier, and tree-based methods, such as random forest and XGBoost, that can perform both regression and classification. We discussed the need for sampling and went over different kinds of sampling techniques for splitting a given dataset into training and validation sets. Finally, we covered the process of saving a model on the hard disk and loading it back into memory for future use.

In the next chapter, you will learn about several techniques that you can use to collect data from various sources.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning