You're reading from Data Labeling in Machine Learning with Python

Product typeBook

Published inJan 2024

PublisherPackt

ISBN-139781804610541

Edition1st Edition

Concepts

Machine Learning

Author (1)

Vijaya Kumar Suda

Labeling Data for Classification

In this chapter, we are going to learn how to label tabular data by applying business rules programmatically with Python libraries. In real-world use cases , not all of our data will have labels. But we need to prepare labeled data for training the machine learning models and fine-tuning the foundation models. The manual labeling of large sets of data or documents is cumbersome and expensive. In case of manual labeling, individual labels are created one by one. Also, occasionally, sharing private data with a crowd-sourcing team outside the organization is not secure.

So, programmatically labeling data is required to automate data labeling and quickly label a large-scale dataset. In case of programmatic labeling, there are mainly three approaches. In the first approach, users create labeling functions and apply to vast amounts of unlabeled data to auto label large training datasets. In the second approach, users apply semi-supervised learning to create...

Technical requirements

We need to install the Snorkel library using the following command:

%pip install snorkel

You can download the dataset and Python notebook from the following link:

https://github.com/PacktPublishing/Data-Labeling-in-Machine-Learning-with-Python/code/Ch02

OpenAI setup requirements are same as mentioned in Chapter 1.

Predicting labels with LLMs for tabular data

We will explore the process of predicting labels for tabular data classification tasks using large language models (LLMs) and few-shot learning.

In the case of few-shot learning, we provide a few training data examples in the form of text along with a prompt for the model. The model adapts to the context and responds to new questions from the user.

First, let’s examine how to predict labels using LLMs for tabular data.

For tabular data, the initial step involves converting the data into serialized text data using LangChain’s templates. LangChain templates allow converting rows of data into fluent sentences or paragraphs by mapping columns to text snippets with variables that are filled based on cell values. Once we have the text data, we can utilize it as few-shot examples, comprising pairs of questions along with their corresponding labels (answers). Subsequently, we will send this few-shot data to the model.

...

Data labeling using Snorkel

In this section, we are going to learn what Snorkel is and how we can use it to label data in Python programmatically.

Labeling data is an important step of a data science project and critical for training models to solve specific business problems.

In many real-world cases, training data does not have labels, or very little data with labels is available. For example, in a housing dataset, in some neighborhoods, historical housing prices may not be available for most of the houses. Another example, in the case of finance, is all transactions may not have an associated invoice number. Historical data with labels is critical for businesses to train models and automate their business processes using machine learning (ML) and artificial intelligence. However, this requires either outsourcing the data labeling to expensive domain experts or the business waiting for a long time to get new training data with labels.

This is where Snorkel comes into the...

Labeling data using the Compose library

Compose is an open source Python library developed to generate the labels for supervised machine learning. Compose creates labels from historical data using LabelMaker.

Subject matter experts or end users write labeling functions for the outcome of interest. For example, if the outcome of interest is the amount spent by customers in the last five days, then the labeling function returns the amount spent by taking the last five days of transaction data as input. We will take a look at this example as follows.

Let us first install the composeml Python package. It is an open source Python library for prediction engineering:

pip install composeml

We will create the label for the total purchase spend amount in the next five days based on the customer’s transactions data history.

For this, let us first import composeml:

import composeml as cp

Then, load the sample data:

from demo.next_purchase import load_sample
df = load_sample...

Labeling data using semi-supervised learning

In this section, let us see how to generate labels using semi-supervised learning.

What is semi-supervised learning?

Semi-supervised learning falls in between supervised learning and unsupervised learning:

In the case of supervised learning, all the training dataset is labeled
In the case of unsupervised learning, all the training dataset is unlabeled
In the case of semi-supervised learning, a very small set of data is labeled and the majority of the dataset is unlabeled

In this case, first we will generate the pseudo-labels using a small part of the labeled dataset with supervised learning:

In this first step, we use this training dataset to train the supervised model and generate the additional pseudo labeled dataset:
Training dataset = small set of labeled dataset
In this second step, we will use the small set of labeled dataset along with the pseudo-labeled dataset generated in the first step:
Training...

Labeling data using K-means clustering

In this section, we are going to learn what the K-means clustering algorithm is and how the K-means clustering algorithm is used to predict labels. Let us understand what unsupervised learning is.

What is unsupervised learning?

Unsupervised learning is a category of machine learning where the algorithm is tasked with discovering patterns, structures, or relationships within a dataset without explicit guidance or labeled outputs. In other words, the algorithm explores the inherent structure of the data on its own. The primary goal of unsupervised learning is often to uncover hidden patterns, group similar data points, or reduce the dimensionality of the data.

In the case of unsupervised learning, we have data with no target variable. We group observations based on similarity into different clusters. Once we have a limited number of clusters, we can register the labels for those clusters.

For example, in the case of customer segmentation...

Summary

In this chapter, we have seen the implementation of rules using Snorkel labeling functions for predicting the income range and labeling functions using the Compose library to predict the total amount spent by a customer during a given period. We have learned how semi-supervised learning can be used to generate pseudo-labels and data augmentation. We also learned how K-means clustering can be used to cluster the income features and then predict the income for each cluster based on business knowledge.

In the next chapter, we are going to learn how we can label data for regression using the Snorkel Python library, semi-supervised learning, and K-means clustering. Let us explore that in the next chapter.

The rest of the chapter is locked

You have been reading a chapter from

Data Labeling in Machine Learning with Python

Published in: Jan 2024Publisher: PacktISBN-13: 9781804610541

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Vijaya Kumar Suda

Vijaya Kumar Suda is a seasoned data and AI professional boasting over two decades of expertise collaborating with global clients. Having resided and worked in diverse locations such as Switzerland, Belgium, Mexico, Bahrain, India, Canada, and the USA, Vijaya has successfully assisted customers spanning various industries. Currently serving as a senior data and AI consultant at Microsoft, he is instrumental in guiding industry partners through their digital transformation endeavors using cutting-edge cloud technologies and AI capabilities. His proficiency encompasses architecture, data engineering, machine learning, generative AI, and cloud solutions.
Read more about Vijaya Kumar Suda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages