You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Feature Engineering for Natural Language Data

In the previous chapter, we explored how to extract features from numerical data and images. We explored a few algorithms that are used for that purpose. In this chapter, we’ll continue with the algorithms that extract features from natural language data.

Natural language is a special kind of data source in software engineering. With the introduction of GitHub Copilot and ChatGPT, it became evident that machine learning and artificial intelligence tools for software engineering tasks are no longer science fiction. Therefore, in this chapter, we’ll explore the first steps that made these technologies so powerful – feature extraction from natural language data.

In this chapter, we’ll cover the following topics:

Tokenizers and their role in feature extraction
Bag-of-words as a simple technique for processing natural language data
Word embeddings as more advanced methods that can capture contexts...

Natural language data in software engineering and the rise of GitHub Copilot

Programming has always been a mixture of science, engineering, and creativity. Creating new programs and being able to instruct computers to do something has always been something that was considered worth paying for – that’s how all programmers make their living. There have been attempts to automate programming and to support smaller tasks – for example, provide programmers with suggestions on how to use a specific function or library method.

Good programmers, however, can make programs that last and that are readable for others. They can also make reliable programs that work without maintenance for a long period. The best programmers are the ones who can solve very difficult tasks and follow the principles and best practices of software engineering.

In 2020, something happened – GitHub Copilot entered the stage and showed that automated tools, based on large language models...

What a tokenizer is and what it does

The first step in feature engineering text data is to decide on the tokenization of the text. The tokenization of text is a process of extracting parts of words that capture the meaning of the text without too many extra details.

There are different ways to extract tokens, which we’ll explore in this chapter, but to illustrate the problem of extracting tokens, let’s look at one word that can take different forms – print. The word by itself can be a token, but it can be in different forms, such as printing, printed, printer, prints, imprinted, and many others. If we use a simple tokenizer, each of these words will be one token – which means quite a few tokens. However, all these tokens capture some sort of meaning related to printing, so maybe we do not need so many of them.

This is where tokenizers come in. Here, we can decide how to treat these different forms of the word. We could take the main part only –...

Bag-of-words and simple tokenizers

In Chapters 3 and 5, we saw the use of the bag-of-words feature extraction technique. This technique takes the text and counts the number of tokens, which were words in Chapters 3 and 5. It is simple and computationally efficient, but it has a few problems.

When instantiating the bag-of-words tokenizer, we can use several parameters that strongly impact the results, as we did in the following fragment of code in the previous chapters:

# create the feature extractor, i.e., BOW vectorizer
# please note the argument - max_features
# this argument says that we only want three features
# this will illustrate that we can get problems - e.g. noise
# when using too few features
vectorizer = CountVectorizer(max_features = 3)

The max_features parameter is a cut-off value that reduces the number of features, but it also can introduce noise where two (or more) distinct sentences have the same feature vector (we saw an example of such a sentence in Chapter...

WordPiece tokenizer

A better way to tokenize and extract features from text documents is to use a WordPiece tokenizer. This tokenizer works in such a way that it finds the most common pieces of text that it can discriminate, and also the ones that are the most common. This kind of tokenizer needs to be trained – that is, we need to provide a set of representative texts to get the right vocabulary (tokens).

Let’s look at an example where we use a simple program, a module from an open source project, to train such a tokenizer and then apply this tokenizer to the famous “Hello World” program in C. Let’s start by creating the tokenizer:

from tokenizers import BertWordPieceTokenizer
# initialize the actual tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=True
)

In this example,...

BPE

A more advanced method for tokenizing text is the BPE algorithm. This algorithm is based on the same premises as the compression algorithm that was created in the 1990s by Gage. The algorithm compresses a series of bytes by the bytes not used in the compressed data. The BPE tokenizer does a similar thing, except that it replaces a series of tokens with new bytes that are not used in the text. In this way, the algorithm can create a much larger vocabulary than CountVectorizer and the WordPiece tokenizer. BPE is very popular both for its ability to handle large vocabulary and for its efficient implementation through the fastBPE library.

Let’s explore how to apply this tokenizer to the same data and check the difference between the previous two. The following code fragment shows how to instantiate this tokenizer from the Hugging Face library:

# in this example we use the tokenizers
# from the HuggingFace library
from tokenizers import Tokenizer
from tokenizers.models...

The SentencePiece tokenizer

SentencePiece is a more general option than BPE for one more reason: it allows us to treat whitespaces as regular tokens. This allows us to find more complex dependencies and therefore train models that understand more than just pieces of words. Hence the name – SentencePiece. This tokenizer was originally introduced to enable the tokenization of languages such as Japanese, which do not use whitespaces in the same way as, for example, English. The tokenizer can be installed by running the pip install -q sentencepiece command.

In the following code example, we’re instantiating and training the SentencePiece tokenizer:

import sentencepiece as spm
# this statement trains the tokenizer
spm.SentencePieceTrainer.train('--input="/content/drive/MyDrive/ds/cs_dos/nx_icmp_checksum_compute.c" --model_prefix=m --vocab_size=200')
# makes segmenter instance and
# loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp...

Word embeddings

Tokenizers are one way of extracting features from text. They are powerful and can be trained to create complex tokens and capture statistical dependencies of words. However, they are limited by the fact that they are completely unsupervised and do not capture any meaning or relationship between words. This means that the tokenizers are great at providing input to neural network models, such as BERT, but sometimes, we would like to have features that are more aligned with a certain task.

This is where word embeddings come to the rescue. The following code shows how to instantiate the word embedding model, which is imported from the gensim library. First, we need to prepare the dataset:

from gensim.models import word2vec
# now, we need to prepare a dataset
# in our case, let's just read a dataset that is a code of a program
# in this example, I use the file from an open source component - Azure NetX
# the actual part is not that important, as long as we have...

FastText

Luckily for us, there is an extension of the word2vec model that can approximate the unknown tokens – FastText. We can use it in a very similar way as we use word2vec:

from gensim.models import FastText
# create the instance of the model
model = FastText(vector_size=4,
                 window=3,
                 min_count=1)
# build a vocabulary
model.build_vocab(corpus_iterable=tokenized_sentences)
# and train the model
model.train(corpus_iterable=tokenized_sentences,
            total_examples=len(tokenized_sentences),
            epochs=10)

In the preceding code fragment, the model is trained on the same set of data as word2vec. model = FastText(vector_size=4, window=3, min_count...

From feature extraction to models

The feature extraction methods presented in this chapter are not the only ones we can use. Quite a few more exist (to say the least). However, they all work similarly. Unfortunately, no silver bullet exists, and all models have advantages and disadvantages. For the same task, but a different dataset, simpler models may be better than complex ones.

Now that we have seen how to extract features from text, images, and numerical data, it’s time we start training the models. This is what we’ll do in the next chapter.

References

Al-Sabbagh, K.W., et al. Selective regression testing based on big data: comparing feature extraction techniques. in 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 2020. IEEE.
Staron, M., et al. Improving Quality of Code Review Datasets–Token-Based Feature Extraction Method. in Software Quality: Future Perspectives on Software Engineering Quality: 13th International Conference, SWQD 2021, Vienna, Austria, January 19–21, 2021, Proceedings 13. 2021. Springer.
Sennrich, R., B. Haddow, and A. Birch, Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
Gage, P., A new algorithm for data compression. C Users Journal, 1994. 12(2): p. 23-38.
Kudo, T. and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages