Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Feature Engineering for Numerical and Image Data

In most cases, when we design large-scale machine learning systems, the types of data we get require more processing than just visualization. This visualization is only for the design and development of machine learning systems. During deployment, we can monitor the data, as we discussed in the previous chapters, but we need to make sure that we use optimized data for inference.

Therefore, in this chapter, we’ll focus on feature engineering – finding the right features that describe our data closer to the problem domain rather than closer to the data itself. Feature engineering is a process where we extract and transform variables from raw data so that we can use them for predictions, classifications, and other machine learning tasks. The goal of feature engineering is to analyze and prepare the data for different machine learning tasks, such as making predictions or classifications.

In this chapter, we’ll...

Feature engineering

Feature engineering is the process of transforming raw data into numerical values that can be used in machine learning algorithms. For example, we can transform raw data about software defects (for example, their description, the characteristics of the module they come from, and so on) into a table of numerical values that we can use for machine learning. The raw numerical values, as we saw in the previous chapter, are the result of quantifying entities that we use as sources of data. They are the results of applying measurement instruments to the data. Therefore, by definition, they are closer to the problem domain rather than the solution domain.

The features, on the other hand, quantify the raw data and contain only the information that is important for the machine learning task at hand. We use these features to make sure that we find the patterns in the data during training that we can then use during deployment. If we look at this process from the perspective...

Feature engineering for numerical data

We’ll introduce feature engineering for numerical data by using the same technique that we used previously but for visualizing data – PCA.

PCA

PCA is used to transform a set of variables into components that are supposed to be independent of one another. The first component should explain the variability of the data or be correlated with most of the variables. Figure 7.3 illustrates such a transformation:

Figure 7.3 – Graphical illustration of the PCA transformation from two dimensions to two dimensions

Figure 7.3 – Graphical illustration of the PCA transformation from two dimensions to two dimensions

This figure contains two axes – the blue ones, which are the original coordinates, and the orange ones, which are the imaginary axes and provide the coordinates for the principal components. The transformation does not change the values of the x and y axes and instead finds such a transformation that the axes align with the data points. Here, we can see that the transformed Y axis...

Feature engineering for image data

One of the most prominent feature extraction methods for image data is the use of convolutional neural networks (CNNs) and extracting embeddings from these networks. In recent years, a new type of this kind of neural network was introduced – autoencoders. Although we can use autoencoders for all kinds of data, they are particularly well-suited for images. So, let’s construct an autoencoder for the MNIST dataset and extract bottleneck values from it.

First, we need to download the MNIST dataset using the following code fragment:

# first, let's read the image data from the Keras library
from tensorflow.keras.datasets import mnist
# and load it with the pre-defined train/test splits
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train/255.0
X_test = X_test/255.0

Now, we can construct the encoder part by using the following code. Please note that there is one extra layer in the encoder part. The goal of...

Summary

In this chapter, our focus was on feature extraction techniques. We explored how we can use dimensionality reduction techniques and autoencoders to reduce the number of features in order to make machine learning models more effective.

However, numerical and image data are only two examples of data. In the next chapter, we continue with the feature engineering methods, but for textual data, which is more common in contemporary software engineering.

References

  • Zheng, A. and A. Casari, Feature engineering for machine learning: principles and techniques for data scientists. 2018: O’Reilly Media, Inc
  • Heaton, J. An empirical analysis of feature engineering for predictive modeling. In SoutheastCon 2016. 2016. IEEE.
  • Staron, M. and W. Meding, Software Development Measurement Programs. Springer. https://doi.org/10.1007/978-3-319-91836-5. Vol. 10. 2018. 3281333.
  • Abran, A., Software metrics and software metrology. 2010: John Wiley & Sons.
  • Meng, Q., et al. Relational autoencoder for feature extraction. In 2017 International joint conference on neural networks (IJCNN). 2017. IEEE.
  • Masci, J., et al. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning, ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21. 2011. Springer.
  • Rumelhart, D.E., G.E. Hinton, and...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron