You're reading from Mastering NLP from Foundations to LLMs

Product type Book

Published in Apr 2024

Publisher Packt

ISBN-13 9781804619186

Pages 340 pages

Edition 1st Edition

Languages

Concepts

Deep Learning

Authors (2):

Lior Gazit

Meysam Ghaffari

View More author details

Table of Contents (14) Chapters

Preface

Chapter 1: Navigating the NLP Landscape: A Comprehensive Introduction

Chapter 2: Mastering Linear Algebra, Probability, and Statistics for Machine Learning and NLP

Chapter 3: Unleashing Machine Learning Potentials in Natural Language Processing

Chapter 4: Streamlining Text Preprocessing Techniques for Optimal NLP Performance

Chapter 5: Empowering Text Classification: Leveraging Traditional Machine Learning Techniques

Chapter 6: Text Classification Reimagined: Delving Deep into Deep Learning Language Models

Chapter 7: Demystifying Large Language Models: Theory, Design, and Langchain Implementation

Chapter 8: Accessing the Power of Large Language Models: Advanced Setup and Integration with RAG

Chapter 9: Exploring the Frontiers: Advanced Applications and Innovations Driven by LLMs

Chapter 10: Riding the Wave: Analyzing Past, Present, and Future Trends Shaped by LLMs and AI

Chapter 11: Exclusive Industry Insights: Perspectives and Predictions from World Class Experts

Index

Why subscribe?

Other Books You May Enjoy

Empowering Text Classification: Leveraging Traditional Machine Learning Techniques

In this chapter, we’ll delve into the fascinating world of text classification, a foundational task in natural language processing (NLP) and machine learning (ML) that deals with categorizing text documents into predefined classes. As the volume of digital text data continues to grow exponentially, the ability to accurately and efficiently classify text has become increasingly important for a wide range of applications, such as sentiment analysis, spam detection, and document organization. This chapter provides a comprehensive overview of the key concepts, methodologies, and techniques that are employed in text classification, catering to readers from diverse backgrounds and skill levels.

We’ll begin by exploring the various types of text classification tasks and their unique characteristics, offering insights into the challenges and opportunities each type presents. Next, we’...

Technical requirements

To effectively read and understand this chapter, it is essential to have a solid foundation in various technical areas. A strong grasp of fundamental concepts in NLP, ML, and linear algebra is crucial. Familiarity with text preprocessing techniques, such as tokenization, stop word removal, and stemming or lemmatization, is necessary to comprehend the data preparation stage.

Additionally, understanding basic ML algorithms, such as logistic regression and support vector machines (SVMs), is crucial for implementing text classification models. Finally, being comfortable with evaluation metrics such as accuracy, precision, recall, and F1 score, along with concepts such as overfitting, underfitting, and hyperparameter tuning, will enable a deeper appreciation of the challenges and best practices in text classification.

Types of text classification

Text classification is an NLP task where ML algorithms assign predefined categories or labels to text based on its content. It involves training a model on a labeled dataset to enable it to accurately predict the category of unseen or new text inputs. Text classification methods can be categorized into three main types – supervised learning, unsupervised learning, and semi-supervised learning:

Supervised learning: This type of text classification involves training a model on labeled data, where each data point is associated with a target label or category. The model then uses this labeled data to learn the patterns and relationships between the input text and the target labels. Examples of supervised learning algorithms for text classification include naive bayes, SVMs, and neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
Unsupervised learning: This type of text classification involves...

Text classification using TF-IDF

One-hot encoded vector is a good approach to perform classification. However, one of its weaknesses is that it does not consider the importance of different words based on different documents. To solve this issue, using TF-IDF can be helpful.

TF-IDF is a numerical statistic that is used to measure the importance of a word in a document within a document collection. It helps reflect the relevance of words in a document, considering not only their frequency within the document but also their rarity across the entire document collection. The TF-IDF value of a word increases proportionally to its frequency in a document but is offset by the frequency of the word in the entire document collection.

Here’s a detailed explanation of the mathematical equations involved in calculating TF-IDF:

Term frequency (TF): The TF of a word, t, in a document, d, represents the number of times the word occurs in the document, normalized by the total...

Text classification using Word2Vec

One of the methods to perform text classification is to convert the words into embedding vectors so that you can use those vectors for classification. Word2Vec is a well-known method to perform this task.

Word2Vec

Word2Vec is a group of neural network-based models that are used to create word embeddings, which are dense vector representations of words in a continuous vector space. These embeddings capture the semantic meaning and relationships between words based on the context in which they appear in the text. Word2Vec has two main architectures. As mentioned previously, the two main architectures that were designed to learn word embeddings are CBOW and skip-gram. Both architectures are designed to learn word embeddings by predicting words based on their surrounding context:

CBOW: The CBOW architecture aims to predict the target word given its surrounding context words. It takes the average of the context word embeddings as input and...

Topic modeling – a particular use case of unsupervised text classification

Topic modeling is an unsupervised ML technique that’s used to discover abstract topics or themes within a large collection of documents. It assumes that each document can be represented as a mixture of topics, and each topic is represented as a distribution over words. The goal of topic modeling is to find the underlying topics and their word distributions, as well as the topic proportions for each document.

There are several topic modeling algorithms, but one of the most popular and widely used is LDA. We will discuss LDA in detail, including its mathematical formulation.

LDA

LDA is a generative probabilistic model that assumes the following generative process for each document:

Choose the number of words in the document.
Choose a topic distribution (θ) for the document from a Dirichlet distribution with parameter α.
For each word in the document, do the following...

Reviewing our use case – ML system design for NLP classification in a Jupyter Notebook

In this section, we will walk through a hands-on example. We will follow the steps we presented previously for articulating the problem, designing the solution, and evaluating the results. This section portrays the process that an ML developer goes through when working on a typical project in the industry. Refer to the notebook at https://colab.research.google.com/drive/1ZG4xN665le7X_HPcs52XSFbcd1OVaI9R?usp=sharing for more information.

The business objective

In this scenario, we are working for a financial news agency. Our objective is to publish news about companies and products in real time.

The technical objective

The CTO derives several technical objectives from the business objective. One objective is for the ML team: given a stream of financial tweets in real time, detect those tweets that discuss information about companies or products.

The pipeline

Let’s review...

Summary

In this chapter, we embarked on a comprehensive exploration of text classification, an indispensable aspect of NLP and ML. We delved into various types of text classification tasks, each presenting unique challenges and opportunities. This foundational understanding sets the stage for effectively tackling a broad range of applications, from sentiment analysis to spam detection.

We walked through the role of N-grams in capturing local context and word sequences within text, thereby enhancing the feature set used for classification tasks. We also illuminated the power of the TF-IDF method, the role of Word2Vec in text classification, and popular architectures such as CBOW and skip-gram, giving you a deep understanding of their mechanics.

Then, we introduced topic modeling and examined how popular algorithms such as LDA can be applied to text classification.

Lastly, we introduced a professional paradigm for leading an NLP-ML project in a business or research setting....