You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product type Book

Published in Jan 2024

Publisher Packt

ISBN-13 9781837634064

Pages 346 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Miroslaw Staron

Table of Contents (24) Chapters

Preface

1. Part 1:Machine Learning Landscape in Software Engineering

2. Machine Learning Compared to Traditional Software

3. Elements of a Machine Learning System

4. Data in Software Systems – Text, Images, Code, and Their Annotations

5. Data Acquisition, Data Quality, and Noise

6. Quantifying and Improving Data Properties

7. Part 2: Data Acquisition and Management

8. Processing Data in Machine Learning Systems

9. Feature Engineering for Numerical and Image Data

10. Feature Engineering for Natural Language Data

11. Part 3: Design and Development of ML Systems

12. Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

13. Training and Evaluating Classical Machine Learning Systems and Neural Networks

14. Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

15. Designing Machine Learning Pipelines (MLOps) and Their Testing

16. Designing and Implementing Large-Scale, Robust ML Software

17. Part 4: Ethical Aspects of Data Management and ML System Development

18. Ethics in Data Acquisition and Management

19. Ethics in Machine Learning Systems

20. Integrating ML Systems in Ecosystems

21. Summary and Where to Go Next

22. Index

Why subscribe?

23. Other Books You May Enjoy

Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

Classical machine learning (ML) and neural networks (NNs) are very good for classical problems – prediction, classification, and recognition. As we learned in the previous chapter, training them requires a moderate amount of data, and we train them for specific tasks. However, breakthroughs in ML and artificial intelligence (AI) in the late 2010s and the beginning of 2020s were about completely different types of models – deep learning (DL), Generative Pre-Trained Transformers (GPTs), and generative AI (GenAI).

GenAI models provide two advantages – they can generate new data and they can provide us with an internal representation of the data that captures the context of the data and, to some extent, its semantics. In the previous chapters, we saw how we can use existing models for inference and generating simple pieces of text.

In this chapter, we explore how GenAI models work...

From classical ML to GenAI

Classical AI, also known as symbolic AI or rule-based AI, emerged as one of the earliest schools of thought in the field. It is rooted in the concept of explicitly encoding knowledge and using logical rules to manipulate symbols and derive intelligent behavior. Classical AI systems are designed to follow predefined rules and algorithms, enabling them to solve well-defined problems with precision and determinism. We delve into the underlying principles of classical AI, exploring its reliance on rule-based systems, expert systems, and logical reasoning.

In contrast, GenAI represents a paradigm shift in AI development, capitalizing on the power of ML and NNs to create intelligent systems that can generate new content, recognize patterns, and make informed decisions. Rather than relying on explicit rules and handcrafted knowledge, GenAI leverages data-driven approaches to learn from vast amounts of information and infer patterns and relationships. We examine...

The theory behind advanced models – AEs and transformers

One of the large limitations of classical ML models is the access to annotated data. Large NNs contain millions (if not billions) of parameters, which means that they require equally many labeled data points to be trained correctly. Data labeling, also known as annotation, is the most expensive activity in ML, and therefore it is the labeling process that becomes the de facto limit of ML models. In the early 2010s, the solution to that problem was to use crowdsourcing.

Crowdsourcing, which is a process of collective data collection (among other things), means that we use users of our services to label the data. A CAPTCHA is one of the most prominent examples. A CAPTCHA is used when we need to recognize images in order to log in to a service. When we introduce new images, every time a user needs to recognize these images, we can label a lot of data in a relatively short time.

There is, nevertheless, an inherent problem...

Training and evaluation of a RoBERTa model

In general, the training process for GPT-3 involves exposing the model to a massive amount of text data from diverse sources, such as books, articles, websites, and more. By analyzing the patterns, relationships, and language structures within this data, the model learns to predict the likelihood of a word or phrase appearing based on the surrounding context. This learning objective is achieved through a process known as masked language modeling (MLM), where certain words are randomly masked in the input, and the model is tasked with predicting the correct word based on the context.

In this chapter, we train the RoBERTa model, which is a variation of the now-classical BERT model. Instead of using generic sources such as books and Wikipedia articles, we use programs. To make our training task a bit more specific, let us train a model that is capable of “understanding” code from a networking domain – WolfSSL, which is...

Training and evaluation of an AE

We mentioned AEs in Chapter 7 when we discussed the process of feature engineering for images. AEs, however, are used to do much more than just image feature extraction. One of the major aspects of them is to be able to recreate images. This means that we can create images based on the placement of the image in the latent space.

So, let us train the AE model for a dataset that is pretty standard in ML – Fashion MNIST. We got to see what the dataset looks like in our previous chapters. We start our training by preparing the data in the following code fragment:

# Transforms images to a PyTorch Tensor
tensor_transform = transforms.ToTensor()
# Download the Fashion MNIST Dataset
dataset = datasets.FashionMNIST(root = "./data",
                         train = True,
       ...

Developing safety cages to prevent models from breaking the entire system

As GenAI systems such as MLMs and AEs create new content, there is a risk that they generate content that can either break the entire software system or become unethical.

Therefore, software engineers often use the concept of a safety cage to guard the model itself from inappropriate input and output. For an MLM such as RoBERTa, this can be a simple preprocessor that checks whether the content generated is problematic. Conceptually, this is illustrated in Figure 11.8:

Figure 11.8 – Safety-cage concept for MLMs

In the example of the wolfBERTa model, this can mean that we check whether the generated code does not contain cybersecurity vulnerabilities, which can potentially allow hackers to take over our system. This means that all programs generated by the wolfBERTa model should be checked using tools such as SonarQube or CodeSonar to check for cybersecurity vulnerabilities...

Summary

In this chapter, we learned how to train advanced models and saw that their training is not much more difficult than training classical ML models, which were described in Chapter 10. Even though the models that we trained are much more complex than the models in Chapter 10, we can use the same principles and expand this kind of activity to train even more complex models.

We focused on GenAI in the form of BERT models (fundamental GPT models) and AEs. Training these models is not very difficult, and we do not need huge computing power to train them. Our wolfBERTa model has ca. 80 million parameters, which seems like a lot, but the really good models, such as GPT-3, have billions of parameters – GPT-3 has 175 billion parameters, NVIDIA Turing has over 350 billion parameters, and GPT-4 is 1,000 times larger than GPT-3. The training process is the same, but we need a supercomputing architecture in order to train these models.

We have also learned that these models...

References

Kratsch, W. et al., Machine learning in business process monitoring: a comparison of deep learning and classical approaches used for outcome prediction. Business & Information Systems Engineering, 2021, 63: p. 261-276.
Vaswani, A. et al., Attention is all you need. Advances in neural information processing systems, 2017, 30.
Aggarwal, A., M. Mittal, and G. Battineni, Generative adversarial network: An overview of theory and applications. International Journal of Information Management Data Insights, 2021. 1(1): p. 100004.
Creswell, A., et al., Generative adversarial networks: An overview. IEEE signal processing magazine, 2018. 35(1): p. 53-65.