You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Feature Engineering for Numerical and Image Data

In most cases, when we design large-scale machine learning systems, the types of data we get require more processing than just visualization. This visualization is only for the design and development of machine learning systems. During deployment, we can monitor the data, as we discussed in the previous chapters, but we need to make sure that we use optimized data for inference.

Therefore, in this chapter, we’ll focus on feature engineering – finding the right features that describe our data closer to the problem domain rather than closer to the data itself. Feature engineering is a process where we extract and transform variables from raw data so that we can use them for predictions, classifications, and other machine learning tasks. The goal of feature engineering is to analyze and prepare the data for different machine learning tasks, such as making predictions or classifications.

In this chapter, we’ll...

Feature engineering

Feature engineering is the process of transforming raw data into numerical values that can be used in machine learning algorithms. For example, we can transform raw data about software defects (for example, their description, the characteristics of the module they come from, and so on) into a table of numerical values that we can use for machine learning. The raw numerical values, as we saw in the previous chapter, are the result of quantifying entities that we use as sources of data. They are the results of applying measurement instruments to the data. Therefore, by definition, they are closer to the problem domain rather than the solution domain.

The features, on the other hand, quantify the raw data and contain only the information that is important for the machine learning task at hand. We use these features to make sure that we find the patterns in the data during training that we can then use during deployment. If we look at this process from the perspective...

Feature engineering for numerical data

We’ll introduce feature engineering for numerical data by using the same technique that we used previously but for visualizing data – PCA.

PCA

PCA is used to transform a set of variables into components that are supposed to be independent of one another. The first component should explain the variability of the data or be correlated with most of the variables. Figure 7.3 illustrates such a transformation:

Figure 7.3 – Graphical illustration of the PCA transformation from two dimensions to two dimensions

This figure contains two axes – the blue ones, which are the original coordinates, and the orange ones, which are the imaginary axes and provide the coordinates for the principal components. The transformation does not change the values of the x and y axes and instead finds such a transformation that the axes align with the data points. Here, we can see that the transformed Y axis...

Feature engineering for image data

One of the most prominent feature extraction methods for image data is the use of convolutional neural networks (CNNs) and extracting embeddings from these networks. In recent years, a new type of this kind of neural network was introduced – autoencoders. Although we can use autoencoders for all kinds of data, they are particularly well-suited for images. So, let’s construct an autoencoder for the MNIST dataset and extract bottleneck values from it.

First, we need to download the MNIST dataset using the following code fragment:

# first, let's read the image data from the Keras library
from tensorflow.keras.datasets import mnist
# and load it with the pre-defined train/test splits
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train/255.0
X_test = X_test/255.0

Now, we can construct the encoder part by using the following code. Please note that there is one extra layer in the encoder part. The goal of...

Summary

In this chapter, our focus was on feature extraction techniques. We explored how we can use dimensionality reduction techniques and autoencoders to reduce the number of features in order to make machine learning models more effective.

However, numerical and image data are only two examples of data. In the next chapter, we continue with the feature engineering methods, but for textual data, which is more common in contemporary software engineering.

References

Zheng, A. and A. Casari, Feature engineering for machine learning: principles and techniques for data scientists. 2018: O’Reilly Media, Inc
Heaton, J. An empirical analysis of feature engineering for predictive modeling. In SoutheastCon 2016. 2016. IEEE.
Staron, M. and W. Meding, Software Development Measurement Programs. Springer. https://doi.org/10.1007/978-3-319-91836-5. Vol. 10. 2018. 3281333.
Abran, A., Software metrics and software metrology. 2010: John Wiley & Sons.
Meng, Q., et al. Relational autoencoder for feature extraction. In 2017 International joint conference on neural networks (IJCNN). 2017. IEEE.
Masci, J., et al. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning, ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21. 2011. Springer.
Rumelhart, D.E., G.E. Hinton, and...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages