Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Training and Evaluating Classical Machine Learning Systems and Neural Networks

Modern machine learning frameworks are designed to be user-friendly for programmers. The popularity of the Python programming environment (and R) has shown that designing, developing, and testing machine learning models can be focused on the machine learning task and not on the programming tasks. The developers of the machine learning models can focus on developing the entire system and not on programming the internals of the algorithms. However, this bears a darker side – a lack of understanding of the internals of the models and how they are trained, evaluated, and validated.

In this chapter, I’ll dive a bit deeper into the process of training and evaluation. We’ll start with the basic theory behind different algorithms before learning how they are trained. We’ll start with the classical machine learning models, exemplified by decision trees. Then, we’ll gradually...

Training and testing processes

Machine learning has revolutionized the way we solve complex problems by enabling computers to learn from data and make predictions or decisions without being explicitly programmed. One crucial aspect of machine learning is training models, which involves teaching algorithms to recognize patterns and relationships in data. Two fundamental methods for training machine learning models are model.fit() and model.predict().

The model.fit() function lies at the heart of training a machine learning model. It is the process by which a model learns from a labeled dataset to make accurate predictions. During training, the model adjusts its internal parameters to minimize the discrepancy between its predictions and the true labels in the training data. This iterative optimization process, often referred to as “learning,” allows the model to generalize its knowledge and perform well on unseen data.

In addition to the training data and labels,...

Training classical machine learning models

We’ll start by training a model that lets us look inside it. We’ll use the CART decision tree classifier, where we can visualize the actual decision tree that is trained. We’ll use the same numerical data we used in the previous chapter. First, let’s read the data and create the train/test split:

# read the file with data using openpyxl
import pandas as pd
# we read the data from the excel file,
# which is the defect data from the ant 1.3 system
dfDataAnt13 = pd.read_excel('./chapter_6_dataset_numerical.xlsx',
                            sheet_name='ant_1_3',
                            index_col=0)
# prepare the dataset...

Understanding the training process

From the software engineer’s perspective, the training process is rather simple – we fit the model, validate it, and use it. We check how good the model is in terms of the performance metrics. If the model is good enough, and we can explain it, then we develop the entire product around it, or we use it in a larger software product.

When the model does not learn anything useful, we need to understand why this is the case and whether there could be another model that can. We can use the visualization techniques we learned about in Chapter 6 to explore the data and clear it from noise using the techniques from Chapter 4.

Now, let’s explore the process of how the decision tree model learns from the data. The DecisionTree classifier learns from the provided data by recursively partitioning the feature space based on the values of the features in the training dataset. It constructs a binary tree where each internal node represents...

Random forest and opaque models

Let’s train the random forest classifier based on the same data as in the counter-example and check whether the model performs better and whether the model uses similar features as the DecisionTree classifier in the original counter-example.

Let’s instantiate, train, and validate the model on the same data using the following fragment of code:

from sklearn.ensemble import RandomForestClassifier
randomForestModel = RandomForestClassifier()
randomForestModel.fit(X_train, y_train)
y_pred_rf = randomForestModel.predict(X_test)

After evaluating the model, we obtain the following performance metrics:

Accuracy: 0.62
Precision: 0.63, Recall: 0.62

Admittedly, these metrics are different than the metrics in the decision trees, but the overall performance is not that much different. The difference in accuracy of 0.03 is negligible. First, we can extract the important features, reusing the same techniques that were presented in Chapter...

Training deep learning models

Training a dense neural network involves various steps. First, we prepare the data. This typically involves tasks such as feature scaling, handling missing values, encoding categorical variables, and splitting the data into training and validation sets.

Then, we define the architecture of the dense neural network. This includes specifying the number of layers, the number of neurons in each layer, the activation functions to be used, and any regularization techniques, such as dropout or batch normalization.

Once the model has been defined, we need to initialize it. We create an instance of the neural network model based on the defined architecture. This involves creating an instance of the neural network class or using a predefined model architecture available in a deep learning library. We also need to define a loss function that quantifies the error between the predicted output of the model and the actual target values. The choice of loss function...

Misleading results – data leaking

In the training process, we use one set of data and in the test set, we use another set. The best training process is when these two datasets are separate. If they are not, we get into something that is called a data leak problem. This problem is when we have the same data points in both the train and test sets. Let’s illustrate this with an example.

First, we need to create a new split, where we have some data points in both sets. We can do that by using the split function and setting 20% of the data points to the test set. This means that at least 10% of the data points are in both sets:

X_trainL, X_testL, y_trainL, y_testL = \
        sklearn.model_selection.train_test_split(X, y, random_state=42, train_size=0.8)

Now, we can use the same code to make predictions on this data and then calculate the performance metrics:

# now, let's evaluate the model on this new data
with torch...

Summary

In this chapter, we discussed various topics related to machine learning and neural networks. We explained how to read data from an Excel file using the pandas library and prepare the dataset for training a machine learning model. We explored the use of decision tree classifiers and demonstrated how to train a decision tree model using scikit-learn. We also showed how to make predictions using the trained model.

Then, we discussed how to switch from a decision tree classifier to a random forest classifier, which is an ensemble of decision trees. We explained the necessary code modifications and provided an example. Next, we shifted our focus to using a dense neural network in PyTorch. We described the process of creating the neural network architecture, training the model, and making predictions using the trained model.

Lastly, we explained the steps involved in training a dense neural network, including data preparation, model architecture, initializing the model, defining...

References

  • Chidamber, S.R. and C.F. Kemerer, A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 1994. 20(6): p. 476–493.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron