Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Data in Software Systems – Text, Images, Code, and Their Annotations

Machine learning (ML) systems are data-hungry applications, and they like their data well prepared for training and inference. Although it may sound obvious, it is more important to scrutinize the properties of data than to select an algorithm to process the data. The data, however, can come in many different formats and can be from different sources. We can consider data in its raw format – for example, a text document or an image file. We can also consider data in a format that is specific to a task at hand – for example, tokenized text (where words are divided into tokens) or an image with bounding boxes (where objects are identified and enclosed in rectangles).

When considering the end user system, what we can do with the data and how we handle the data becomes crucial. However, identifying important elements in the data and transforming it into a format that is useful for ML algorithms...

Raw data and features – what are the differences?

ML systems are data-hungry. They rely on the data to be trained and to make inferences. However, not all data is equally important. Before the era of deep learning (DL), the data was supposed to be processed in order to be used in ML. Before DL, the algorithms were limited in the amount of data that could be used for training. The storage and memory limitations were also limited, and therefore, ML engineers had to prepare the data much more than for DL. For example, ML engineers needed to spend more effort to find a small but still representative sample of data for training. After the introduction of DL, ML models can find complex patterns in much larger datasets. Therefore, the work of ML engineers is now focused on finding sufficiently large, and representative, datasets.

Classical ML systems – that is, non-DL systems – require data in a tabular form in order to make inferences, and therefore it is important...

Every data has its purpose – annotations and tasks

Data in raw format is important, but only the first step in the development and operations of ML software. The most important part, and the costliest one, is the annotation of the data. To train an ML model and then use it to make inferences, we need to define a task. Defining a task is both conceptual and operational. The conceptual definition is to define what we want the software to do, but the operational definition is how we want to achieve that goal. The operational definition boils down to a definition of what we see in the data and what we want the ML model to identify/replicate.

Annotations are the mechanisms by which we direct the ML algorithms. Every piece of data that we use requires some sort of label to denote what it is. In the raw format of the data, this annotation can be a label of what the data point contains. For example, such a label can be that the image contains the number 1 (from the MNIST dataset...

Annotating text for intent recognition

SA, which we mentioned before, is only one type of annotation of textual data. It is useful for assessing whether the text is positive or negative. However, instead of annotating text with a sentiment, we can annotate the text with – for example – the intent and train an ML model to recognize intent from other text passages. The table in Figure 3.16 provides such an annotation, based on the same review data as before:

Where different types of data can be used together – an outlook on multi-modal data models

This chapter introduced three types of data – images, text, and structured text. These three types of data are examples of data that is in a numerical form, such as matrices of numbers, or in forms of time series. Regardless of the form, however, working with data and ML systems is very similar. We need to extract the data from a source system, then transform it into a format that we can annotate, and then use this as input to an ML model.

When we consider different types of data, we could start to think about whether we could use two types of data in the same system. There are a few ways of doing that. The first one is when we use different ML systems in different pipelines, but we connect the pipelines. GitHub Copilot is such a system. It uses a pipeline for processing a natural language to find similar programs and to transform them so that they fit the context of the program...

References

  • Tao, J. et al., An object detection system based on YOLO in traffic scene. In 2017 6th International Conference on Computer Science and Network Technology (ICCSNT). 2017. IEEE.
  • Artan, C.T. and T. Kaya, Car Damage Analysis for Insurance Market Using Convolutional Neural Networks. In International Conference on Intelligent and Fuzzy Systems. 2019. Springer.
  • Nakaura, T. et al., A primer for understanding radiology articles about machine learning and deep learning. Diagnostic and Interventional Imaging, 2020. 101(12): p. 765-770.
  • Bradski, G., The OpenCV Library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 2000. 25(11): p. 120-123.
  • Memon, J. et al., Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access, 2020. 8: p. 142642-142668.
  • Mosin, V. et al., Comparing autoencoder-based approaches for anomaly detection in highway driving scenario images. SN Applied Sciences...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Id

Score

Summary

Text

Intent

1

5

Good Quality Dog Food

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky...