Reader small image

You're reading from  Data Labeling in Machine Learning with Python

Product typeBook
Published inJan 2024
PublisherPackt
ISBN-139781804610541
Edition1st Edition
Right arrow
Author (1)
Vijaya Kumar Suda
Vijaya Kumar Suda
author image
Vijaya Kumar Suda

Vijaya Kumar Suda is a seasoned data and AI professional boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions.
Read more about Vijaya Kumar Suda

Right arrow

Labeling Data for Classification

In this chapter, we are going to learn how to label tabular data by applying business rules programmatically with Python libraries. In real-world use cases , not all of our data will have labels. But we need to prepare labeled data for training the machine learning models and fine-tuning the foundation models. The manual labeling of large sets of data or documents is cumbersome and expensive. In case of manual labeling, individual labels are created one by one. Also, occasionally, sharing private data with a crowd-sourcing team outside the organization is not secure.

So, programmatically labeling data is required to automate data labeling and quickly label a large-scale dataset. In case of programmatic labeling, there are mainly three approaches. In the first approach, users create labeling functions and apply to vast amounts of unlabeled data to auto label large training datasets. In the second approach, users apply semi-supervised learning to create...

Technical requirements

We need to install the Snorkel library using the following command:

%pip install snorkel

You can download the dataset and Python notebook from the following link:

https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python/code/Ch02

OpenAI setup requirements are same as mentioned in Chapter 1.

Predicting labels with LLMs for tabular data

We will explore the process of predicting labels for tabular data classification tasks using large language models (LLMs) and few-shot learning.

In the case of few-shot learning, we provide a few training data examples in the form of text along with a prompt for the model. The model adapts to the context and responds to new questions from the user.

First, let’s examine how to predict labels using LLMs for tabular data.

For tabular data, the initial step involves converting the data into serialized text data using LangChain’s templates. LangChain templates allow converting rows of data into fluent sentences or paragraphs by mapping columns to text snippets with variables that are filled based on cell values. Once we have the text data, we can utilize it as few-shot examples, comprising pairs of questions along with their corresponding labels (answers). Subsequently, we will send this few-shot data to the model.

...

Data labeling using Snorkel

In this section, we are going to learn what Snorkel is and how we can use it to label data in Python programmatically.

Labeling data is an important step of a data science project and critical for training models to solve specific business problems.

In many real-world cases, training data does not have labels, or very little data with labels is available. For example, in a housing dataset, in some neighborhoods, historical housing prices may not be available for most of the houses. Another example, in the case of finance, is all transactions may not have an associated invoice number. Historical data with labels is critical for businesses to train models and automate their business processes using machine learning (ML) and artificial intelligence. However, this requires either outsourcing the data labeling to expensive domain experts or the business waiting for a long time to get new training data with labels.

This is where Snorkel comes into the...

Labeling data using the Compose library

Compose is an open source Python library developed to generate the labels for supervised machine learning. Compose creates labels from historical data using LabelMaker.

Subject matter experts or end users write labeling functions for the outcome of interest. For example, if the outcome of interest is the amount spent by customers in the last five days, then the labeling function returns the amount spent by taking the last five days of transaction data as input. We will take a look at this example as follows.

Let us first install the composeml Python package. It is an open source Python library for prediction engineering:

pip install composeml

We will create the label for the total purchase spend amount in the next five days based on the customer’s transactions data history.

For this, let us first import composeml:

import composeml as cp

Then, load the sample data:

from demo.next_purchase import load_sample
df = load_sample...

Labeling data using semi-supervised learning

In this section, let us see how to generate labels using semi-supervised learning.

What is semi-supervised learning?

Semi-supervised learning falls in between supervised learning and unsupervised learning:

  • In the case of supervised learning, all the training dataset is labeled
  • In the case of unsupervised learning, all the training dataset is unlabeled
  • In the case of semi-supervised learning, a very small set of data is labeled and the majority of the dataset is unlabeled

In this case, first we will generate the pseudo-labels using a small part of the labeled dataset with supervised learning:

  1. In this first step, we use this training dataset to train the supervised model and generate the additional pseudo labeled dataset:

    Training dataset = small set of labeled dataset

  2. In this second step, we will use the small set of labeled dataset along with the pseudo-labeled dataset generated in the first step:

    Training...

Labeling data using K-means clustering

In this section, we are going to learn what the K-means clustering algorithm is and how the K-means clustering algorithm is used to predict labels. Let us understand what unsupervised learning is.

What is unsupervised learning?

Unsupervised learning is a category of machine learning where the algorithm is tasked with discovering patterns, structures, or relationships within a dataset without explicit guidance or labeled outputs. In other words, the algorithm explores the inherent structure of the data on its own. The primary goal of unsupervised learning is often to uncover hidden patterns, group similar data points, or reduce the dimensionality of the data.

In the case of unsupervised learning, we have data with no target variable. We group observations based on similarity into different clusters. Once we have a limited number of clusters, we can register the labels for those clusters.

For example, in the case of customer segmentation...

Summary

In this chapter, we have seen the implementation of rules using Snorkel labeling functions for predicting the income range and labeling functions using the Compose library to predict the total amount spent by a customer during a given period. We have learned how semi-supervised learning can be used to generate pseudo-labels and data augmentation. We also learned how K-means clustering can be used to cluster the income features and then predict the income for each cluster based on business knowledge.

In the next chapter, we are going to learn how we can label data for regression using the Snorkel Python library, semi-supervised learning, and K-means clustering. Let us explore that in the next chapter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Labeling in Machine Learning with Python
Published in: Jan 2024Publisher: PacktISBN-13: 9781804610541
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Vijaya Kumar Suda

Vijaya Kumar Suda is a seasoned data and AI professional boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions.
Read more about Vijaya Kumar Suda