Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

Classical machine learning (ML) and neural networks (NNs) are very good for classical problems – prediction, classification, and recognition. As we learned in the previous chapter, training them requires a moderate amount of data, and we train them for specific tasks. However, breakthroughs in ML and artificial intelligence (AI) in the late 2010s and the beginning of 2020s were about completely different types of models – deep learning (DL), Generative Pre-Trained Transformers (GPTs), and generative AI (GenAI).

GenAI models provide two advantages – they can generate new data and they can provide us with an internal representation of the data that captures the context of the data and, to some extent, its semantics. In the previous chapters, we saw how we can use existing models for inference and generating simple pieces of text.

In this chapter, we explore how GenAI models work...

From classical ML to GenAI

Classical AI, also known as symbolic AI or rule-based AI, emerged as one of the earliest schools of thought in the field. It is rooted in the concept of explicitly encoding knowledge and using logical rules to manipulate symbols and derive intelligent behavior. Classical AI systems are designed to follow predefined rules and algorithms, enabling them to solve well-defined problems with precision and determinism. We delve into the underlying principles of classical AI, exploring its reliance on rule-based systems, expert systems, and logical reasoning.

In contrast, GenAI represents a paradigm shift in AI development, capitalizing on the power of ML and NNs to create intelligent systems that can generate new content, recognize patterns, and make informed decisions. Rather than relying on explicit rules and handcrafted knowledge, GenAI leverages data-driven approaches to learn from vast amounts of information and infer patterns and relationships. We examine...

The theory behind advanced models – AEs and transformers

One of the large limitations of classical ML models is the access to annotated data. Large NNs contain millions (if not billions) of parameters, which means that they require equally many labeled data points to be trained correctly. Data labeling, also known as annotation, is the most expensive activity in ML, and therefore it is the labeling process that becomes the de facto limit of ML models. In the early 2010s, the solution to that problem was to use crowdsourcing.

Crowdsourcing, which is a process of collective data collection (among other things), means that we use users of our services to label the data. A CAPTCHA is one of the most prominent examples. A CAPTCHA is used when we need to recognize images in order to log in to a service. When we introduce new images, every time a user needs to recognize these images, we can label a lot of data in a relatively short time.

There is, nevertheless, an inherent problem...

Training and evaluation of a RoBERTa model

In general, the training process for GPT-3 involves exposing the model to a massive amount of text data from diverse sources, such as books, articles, websites, and more. By analyzing the patterns, relationships, and language structures within this data, the model learns to predict the likelihood of a word or phrase appearing based on the surrounding context. This learning objective is achieved through a process known as masked language modeling (MLM), where certain words are randomly masked in the input, and the model is tasked with predicting the correct word based on the context.

In this chapter, we train the RoBERTa model, which is a variation of the now-classical BERT model. Instead of using generic sources such as books and Wikipedia articles, we use programs. To make our training task a bit more specific, let us train a model that is capable of “understanding” code from a networking domain – WolfSSL, which is...

Training and evaluation of an AE

We mentioned AEs in Chapter 7 when we discussed the process of feature engineering for images. AEs, however, are used to do much more than just image feature extraction. One of the major aspects of them is to be able to recreate images. This means that we can create images based on the placement of the image in the latent space.

So, let us train the AE model for a dataset that is pretty standard in ML – Fashion MNIST. We got to see what the dataset looks like in our previous chapters. We start our training by preparing the data in the following code fragment:

# Transforms images to a PyTorch Tensor
tensor_transform = transforms.ToTensor()
# Download the Fashion MNIST Dataset
dataset = datasets.FashionMNIST(root = "./data",
                         train = True,
       ...

Developing safety cages to prevent models from breaking the entire system

As GenAI systems such as MLMs and AEs create new content, there is a risk that they generate content that can either break the entire software system or become unethical.

Therefore, software engineers often use the concept of a safety cage to guard the model itself from inappropriate input and output. For an MLM such as RoBERTa, this can be a simple preprocessor that checks whether the content generated is problematic. Conceptually, this is illustrated in Figure 11.8:

Figure 11.8 – Safety-cage concept for MLMs

Figure 11.8 – Safety-cage concept for MLMs

In the example of the wolfBERTa model, this can mean that we check whether the generated code does not contain cybersecurity vulnerabilities, which can potentially allow hackers to take over our system. This means that all programs generated by the wolfBERTa model should be checked using tools such as SonarQube or CodeSonar to check for cybersecurity vulnerabilities...

Summary

In this chapter, we learned how to train advanced models and saw that their training is not much more difficult than training classical ML models, which were described in Chapter 10. Even though the models that we trained are much more complex than the models in Chapter 10, we can use the same principles and expand this kind of activity to train even more complex models.

We focused on GenAI in the form of BERT models (fundamental GPT models) and AEs. Training these models is not very difficult, and we do not need huge computing power to train them. Our wolfBERTa model has ca. 80 million parameters, which seems like a lot, but the really good models, such as GPT-3, have billions of parameters – GPT-3 has 175 billion parameters, NVIDIA Turing has over 350 billion parameters, and GPT-4 is 1,000 times larger than GPT-3. The training process is the same, but we need a supercomputing architecture in order to train these models.

We have also learned that these models...

References

  • Kratsch, W. et al., Machine learning in business process monitoring: a comparison of deep learning and classical approaches used for outcome prediction. Business & Information Systems Engineering, 2021, 63: p. 261-276.
  • Vaswani, A. et al., Attention is all you need. Advances in neural information processing systems, 2017, 30.
  • Aggarwal, A., M. Mittal, and G. Battineni, Generative adversarial network: An overview of theory and applications. International Journal of Information Management Data Insights, 2021. 1(1): p. 100004.
  • Creswell, A., et al., Generative adversarial networks: An overview. IEEE signal processing magazine, 2018. 35(1): p. 53-65.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron