Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Processing Data in Machine Learning Systems

We talked about data in Chapter 3, where we introduced the types of data that are used in machine learning systems. In this chapter, we’ll dive deeper into ways in which data and algorithms are entangled. We’ll talk about data in generic terms, but in this chapter, we’ll explain what kind of data is needed in machine learning systems. I’ll explain the fact that all kinds of data are used in numerical form – either as a feature vector or as more complex feature matrices. Then, I’ll explain the need to transform unstructured data (for example, text) into structured data. This chapter will lay the foundations for diving deeper into each type of data, which is the content of the next few chapters.

In this chapter, we will do the following:

  • Discuss the process of measurement (obtaining numerical data) and the measurement instruments that are used in that process
  • Visualize numerical data...

Numerical data

Numerical data usually comes in the form of tables of numbers, kind of like database tables. One of the most common data in this form is metrics data – for example, the standard object-oriented metrics that have been used since the 1980s.

Numerical data is often the result of a measurement process. The measurement process is a process where we quantify the empirical properties of an entity using measurement instruments to a number. The process must guarantee that important empirical properties are preserved in the mathematical domain – that is, in the numbers. Figure 6.1 shows an example of this process:

Figure 6.1 – The measurement process with an example of quality measurement using defects

Figure 6.1 – The measurement process with an example of quality measurement using defects

The important part of this process consists of three elements. First is the measurement instrument, which needs to map the empirical properties to numbers in a true way. Then, there are the measurement standards, such as...

Other types of data – images

In Chapter 3, we looked at image data, mostly from the perspective of what kind of image data exists. Now, we will take a more pragmatic approach and introduce a better way of working with images than just using files.

Let’s look at how image data is stored in a popular repository – Hugging Face. The library has a specific module for working with datasets – conveniently called Dataset. It can be installed using the pip install -q datasets command. So, let’s load a dataset and visualize one of the images from there using the following code fragment:

# importing the images library
from datasets import load_dataset, Image
# loading a dataset "food101", or more concretely it's split for training
dataset = load_dataset("food101", split="train")

Now, the variable dataset contains all the images. Well, not all of them – just the part that the designer of the dataset specified...

Text data

For the text data, we’ll use the same Hugging Face hub to obtain two kinds of data – unstructured text, as we did in Chapter 3, and structured data – programming language code:

# import Hugging Face Dataset
from datasets import load_dataset
# load the dataset with text classification labels
dataset = load_dataset('imdb')

The preceding code fragment loads the dataset of movie reviews from the Internet Movie Database (IMDb). We can get an example of the data by using an interface that’s similar to what we used for images:

# show the first example
dataset['train'][0]

We can visualize it using a similar one too:

# plot the distribution of the labels
sns.histplot(dataset['train']['label'], bins=2)

The preceding code fragment creates the following diagram, showing that both positive and negative comments are perfectly balanced:

Figure 6.13 – Balanced classes in the IMDb movie database reviews

Figure 6.13 – Balanced classes in the...

Toward feature engineering

In this chapter, we explored methods for visualizing data. We learned how to create diagrams and identify dependencies in the data. We also learned how we can use dimensionality reduction techniques to plot multidimensional data on a two dimensional diagram.

In the next few chapters, we’ll dive into feature engineering different types of data. Sometimes, it is easy to mix feature engineering with data extraction. In practice, it is not that difficult to tell one from the other.

Extracted data is data that has been collected by applying some sort of measurement instrument. Raw text or images are good examples of this kind of data. Extracted data is close to the domain where the data comes from – or how it is measured.

Features describe the data based on the analysis that we want to perform – they are closer to what we want to do with the data. It is closer to what we want to achieve and which form of machine learning analysis...

References

  • International Standardization Organization, International vocabulary of basic and general terms in metrology (VIM). In International Organization. 2004. p. 09-14.
  • Alhusain, S. Predicting Relative Thresholds for Object Oriented Metrics. In 2021 IEEE/ACM International Conference on Technical Debt (TechDebt). 2021. IEEE.
  • Feldt, R., et al. Supporting software decision meetings: Heatmaps for visualising test and code measurements. In 2013 39th Euromicro Conference on Software Engineering and Advanced Applications. 2013. IEEE.
  • Staron, M., et al. Measuring and visualizing code stability – a case study at three companies. In 2013 Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement. 2013. IEEE.
  • Wen, S., C. Nilsson, and M. Staron. Assessing the release readiness of engine control software. In Proceedings of the 1st International Workshop on Software...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron