You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Processing Data in Machine Learning Systems

We talked about data in Chapter 3, where we introduced the types of data that are used in machine learning systems. In this chapter, we’ll dive deeper into ways in which data and algorithms are entangled. We’ll talk about data in generic terms, but in this chapter, we’ll explain what kind of data is needed in machine learning systems. I’ll explain the fact that all kinds of data are used in numerical form – either as a feature vector or as more complex feature matrices. Then, I’ll explain the need to transform unstructured data (for example, text) into structured data. This chapter will lay the foundations for diving deeper into each type of data, which is the content of the next few chapters.

In this chapter, we will do the following:

Discuss the process of measurement (obtaining numerical data) and the measurement instruments that are used in that process
Visualize numerical data...

Numerical data

Numerical data usually comes in the form of tables of numbers, kind of like database tables. One of the most common data in this form is metrics data – for example, the standard object-oriented metrics that have been used since the 1980s.

Numerical data is often the result of a measurement process. The measurement process is a process where we quantify the empirical properties of an entity using measurement instruments to a number. The process must guarantee that important empirical properties are preserved in the mathematical domain – that is, in the numbers. Figure 6.1 shows an example of this process:

Figure 6.1 – The measurement process with an example of quality measurement using defects

The important part of this process consists of three elements. First is the measurement instrument, which needs to map the empirical properties to numbers in a true way. Then, there are the measurement standards, such as...

Other types of data – images

In Chapter 3, we looked at image data, mostly from the perspective of what kind of image data exists. Now, we will take a more pragmatic approach and introduce a better way of working with images than just using files.

Let’s look at how image data is stored in a popular repository – Hugging Face. The library has a specific module for working with datasets – conveniently called Dataset. It can be installed using the pip install -q datasets command. So, let’s load a dataset and visualize one of the images from there using the following code fragment:

# importing the images library
from datasets import load_dataset, Image
# loading a dataset "food101", or more concretely it's split for training
dataset = load_dataset("food101", split="train")

Now, the variable dataset contains all the images. Well, not all of them – just the part that the designer of the dataset specified...

Text data

For the text data, we’ll use the same Hugging Face hub to obtain two kinds of data – unstructured text, as we did in Chapter 3, and structured data – programming language code:

# import Hugging Face Dataset
from datasets import load_dataset
# load the dataset with text classification labels
dataset = load_dataset('imdb')

The preceding code fragment loads the dataset of movie reviews from the Internet Movie Database (IMDb). We can get an example of the data by using an interface that’s similar to what we used for images:

# show the first example
dataset['train'][0]

We can visualize it using a similar one too:

# plot the distribution of the labels
sns.histplot(dataset['train']['label'], bins=2)

The preceding code fragment creates the following diagram, showing that both positive and negative comments are perfectly balanced:

Figure 6.13 – Balanced classes in the...

Toward feature engineering

In this chapter, we explored methods for visualizing data. We learned how to create diagrams and identify dependencies in the data. We also learned how we can use dimensionality reduction techniques to plot multidimensional data on a two dimensional diagram.

In the next few chapters, we’ll dive into feature engineering different types of data. Sometimes, it is easy to mix feature engineering with data extraction. In practice, it is not that difficult to tell one from the other.

Extracted data is data that has been collected by applying some sort of measurement instrument. Raw text or images are good examples of this kind of data. Extracted data is close to the domain where the data comes from – or how it is measured.

Features describe the data based on the analysis that we want to perform – they are closer to what we want to do with the data. It is closer to what we want to achieve and which form of machine learning analysis...

References

International Standardization Organization, International vocabulary of basic and general terms in metrology (VIM). In International Organization. 2004. p. 09-14.
Alhusain, S. Predicting Relative Thresholds for Object Oriented Metrics. In 2021 IEEE/ACM International Conference on Technical Debt (TechDebt). 2021. IEEE.
Feldt, R., et al. Supporting software decision meetings: Heatmaps for visualising test and code measurements. In 2013 39th Euromicro Conference on Software Engineering and Advanced Applications. 2013. IEEE.
Staron, M., et al. Measuring and visualizing code stability – a case study at three companies. In 2013 Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement. 2013. IEEE.
Wen, S., C. Nilsson, and M. Staron. Assessing the release readiness of engine control software. In Proceedings of the 1st International Workshop on Software...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages