Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Feature Engineering for Natural Language Data

In the previous chapter, we explored how to extract features from numerical data and images. We explored a few algorithms that are used for that purpose. In this chapter, we’ll continue with the algorithms that extract features from natural language data.

Natural language is a special kind of data source in software engineering. With the introduction of GitHub Copilot and ChatGPT, it became evident that machine learning and artificial intelligence tools for software engineering tasks are no longer science fiction. Therefore, in this chapter, we’ll explore the first steps that made these technologies so powerful – feature extraction from natural language data.

In this chapter, we’ll cover the following topics:

  • Tokenizers and their role in feature extraction
  • Bag-of-words as a simple technique for processing natural language data
  • Word embeddings as more advanced methods that can capture contexts...

Natural language data in software engineering and the rise of GitHub Copilot

Programming has always been a mixture of science, engineering, and creativity. Creating new programs and being able to instruct computers to do something has always been something that was considered worth paying for – that’s how all programmers make their living. There have been attempts to automate programming and to support smaller tasks – for example, provide programmers with suggestions on how to use a specific function or library method.

Good programmers, however, can make programs that last and that are readable for others. They can also make reliable programs that work without maintenance for a long period. The best programmers are the ones who can solve very difficult tasks and follow the principles and best practices of software engineering.

In 2020, something happened – GitHub Copilot entered the stage and showed that automated tools, based on large language models...

What a tokenizer is and what it does

The first step in feature engineering text data is to decide on the tokenization of the text. The tokenization of text is a process of extracting parts of words that capture the meaning of the text without too many extra details.

There are different ways to extract tokens, which we’ll explore in this chapter, but to illustrate the problem of extracting tokens, let’s look at one word that can take different forms – print. The word by itself can be a token, but it can be in different forms, such as printing, printed, printer, prints, imprinted, and many others. If we use a simple tokenizer, each of these words will be one token – which means quite a few tokens. However, all these tokens capture some sort of meaning related to printing, so maybe we do not need so many of them.

This is where tokenizers come in. Here, we can decide how to treat these different forms of the word. We could take the main part only –...

Bag-of-words and simple tokenizers

In Chapters 3 and 5, we saw the use of the bag-of-words feature extraction technique. This technique takes the text and counts the number of tokens, which were words in Chapters 3 and 5. It is simple and computationally efficient, but it has a few problems.

When instantiating the bag-of-words tokenizer, we can use several parameters that strongly impact the results, as we did in the following fragment of code in the previous chapters:

# create the feature extractor, i.e., BOW vectorizer
# please note the argument - max_features
# this argument says that we only want three features
# this will illustrate that we can get problems - e.g. noise
# when using too few features
vectorizer = CountVectorizer(max_features = 3)

The max_features parameter is a cut-off value that reduces the number of features, but it also can introduce noise where two (or more) distinct sentences have the same feature vector (we saw an example of such a sentence in Chapter...

WordPiece tokenizer

A better way to tokenize and extract features from text documents is to use a WordPiece tokenizer. This tokenizer works in such a way that it finds the most common pieces of text that it can discriminate, and also the ones that are the most common. This kind of tokenizer needs to be trained – that is, we need to provide a set of representative texts to get the right vocabulary (tokens).

Let’s look at an example where we use a simple program, a module from an open source project, to train such a tokenizer and then apply this tokenizer to the famous “Hello World” program in C. Let’s start by creating the tokenizer:

from tokenizers import BertWordPieceTokenizer
# initialize the actual tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=True
)

In this example,...

BPE

A more advanced method for tokenizing text is the BPE algorithm. This algorithm is based on the same premises as the compression algorithm that was created in the 1990s by Gage. The algorithm compresses a series of bytes by the bytes not used in the compressed data. The BPE tokenizer does a similar thing, except that it replaces a series of tokens with new bytes that are not used in the text. In this way, the algorithm can create a much larger vocabulary than CountVectorizer and the WordPiece tokenizer. BPE is very popular both for its ability to handle large vocabulary and for its efficient implementation through the fastBPE library.

Let’s explore how to apply this tokenizer to the same data and check the difference between the previous two. The following code fragment shows how to instantiate this tokenizer from the Hugging Face library:

# in this example we use the tokenizers
# from the HuggingFace library
from tokenizers import Tokenizer
from tokenizers.models...

The SentencePiece tokenizer

SentencePiece is a more general option than BPE for one more reason: it allows us to treat whitespaces as regular tokens. This allows us to find more complex dependencies and therefore train models that understand more than just pieces of words. Hence the name – SentencePiece. This tokenizer was originally introduced to enable the tokenization of languages such as Japanese, which do not use whitespaces in the same way as, for example, English. The tokenizer can be installed by running the pip install -q sentencepiece command.

In the following code example, we’re instantiating and training the SentencePiece tokenizer:

import sentencepiece as spm
# this statement trains the tokenizer
spm.SentencePieceTrainer.train('--input="/content/drive/MyDrive/ds/cs_dos/nx_icmp_checksum_compute.c" --model_prefix=m --vocab_size=200')
# makes segmenter instance and
# loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp...

Word embeddings

Tokenizers are one way of extracting features from text. They are powerful and can be trained to create complex tokens and capture statistical dependencies of words. However, they are limited by the fact that they are completely unsupervised and do not capture any meaning or relationship between words. This means that the tokenizers are great at providing input to neural network models, such as BERT, but sometimes, we would like to have features that are more aligned with a certain task.

This is where word embeddings come to the rescue. The following code shows how to instantiate the word embedding model, which is imported from the gensim library. First, we need to prepare the dataset:

from gensim.models import word2vec
# now, we need to prepare a dataset
# in our case, let's just read a dataset that is a code of a program
# in this example, I use the file from an open source component - Azure NetX
# the actual part is not that important, as long as we have...

FastText

Luckily for us, there is an extension of the word2vec model that can approximate the unknown tokens – FastText. We can use it in a very similar way as we use word2vec:

from gensim.models import FastText
# create the instance of the model
model = FastText(vector_size=4,
                 window=3,
                 min_count=1)
# build a vocabulary
model.build_vocab(corpus_iterable=tokenized_sentences)
# and train the model
model.train(corpus_iterable=tokenized_sentences,
            total_examples=len(tokenized_sentences),
            epochs=10)

In the preceding code fragment, the model is trained on the same set of data as word2vec. model = FastText(vector_size=4, window=3, min_count...

From feature extraction to models

The feature extraction methods presented in this chapter are not the only ones we can use. Quite a few more exist (to say the least). However, they all work similarly. Unfortunately, no silver bullet exists, and all models have advantages and disadvantages. For the same task, but a different dataset, simpler models may be better than complex ones.

Now that we have seen how to extract features from text, images, and numerical data, it’s time we start training the models. This is what we’ll do in the next chapter.

References

  • Al-Sabbagh, K.W., et al. Selective regression testing based on big data: comparing feature extraction techniques. in 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 2020. IEEE.
  • Staron, M., et al. Improving Quality of Code Review Datasets–Token-Based Feature Extraction Method. in Software Quality: Future Perspectives on Software Engineering Quality: 13th International Conference, SWQD 2021, Vienna, Austria, January 19–21, 2021, Proceedings 13. 2021. Springer.
  • Sennrich, R., B. Haddow, and A. Birch, Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  • Gage, P., A new algorithm for data compression. C Users Journal, 1994. 12(2): p. 23-38.
  • Kudo, T. and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron