You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Data in Software Systems – Text, Images, Code, and Their Annotations

Machine learning (ML) systems are data-hungry applications, and they like their data well prepared for training and inference. Although it may sound obvious, it is more important to scrutinize the properties of data than to select an algorithm to process the data. The data, however, can come in many different formats and can be from different sources. We can consider data in its raw format – for example, a text document or an image file. We can also consider data in a format that is specific to a task at hand – for example, tokenized text (where words are divided into tokens) or an image with bounding boxes (where objects are identified and enclosed in rectangles).

When considering the end user system, what we can do with the data and how we handle the data becomes crucial. However, identifying important elements in the data and transforming it into a format that is useful for ML algorithms...

Raw data and features – what are the differences?

ML systems are data-hungry. They rely on the data to be trained and to make inferences. However, not all data is equally important. Before the era of deep learning (DL), the data was supposed to be processed in order to be used in ML. Before DL, the algorithms were limited in the amount of data that could be used for training. The storage and memory limitations were also limited, and therefore, ML engineers had to prepare the data much more than for DL. For example, ML engineers needed to spend more effort to find a small but still representative sample of data for training. After the introduction of DL, ML models can find complex patterns in much larger datasets. Therefore, the work of ML engineers is now focused on finding sufficiently large, and representative, datasets.

Classical ML systems – that is, non-DL systems – require data in a tabular form in order to make inferences, and therefore it is important...

Every data has its purpose – annotations and tasks

Data in raw format is important, but only the first step in the development and operations of ML software. The most important part, and the costliest one, is the annotation of the data. To train an ML model and then use it to make inferences, we need to define a task. Defining a task is both conceptual and operational. The conceptual definition is to define what we want the software to do, but the operational definition is how we want to achieve that goal. The operational definition boils down to a definition of what we see in the data and what we want the ML model to identify/replicate.

Annotations are the mechanisms by which we direct the ML algorithms. Every piece of data that we use requires some sort of label to denote what it is. In the raw format of the data, this annotation can be a label of what the data point contains. For example, such a label can be that the image contains the number 1 (from the MNIST dataset...

Annotating text for intent recognition

SA, which we mentioned before, is only one type of annotation of textual data. It is useful for assessing whether the text is positive or negative. However, instead of annotating text with a sentiment, we can annotate the text with – for example – the intent and train an ML model to recognize intent from other text passages. The table in Figure 3.16 provides such an annotation, based on the same review data as before:

References

Tao, J. et al., An object detection system based on YOLO in traffic scene. In 2017 6th International Conference on Computer Science and Network Technology (ICCSNT). 2017. IEEE.
Artan, C.T. and T. Kaya, Car Damage Analysis for Insurance Market Using Convolutional Neural Networks. In International Conference on Intelligent and Fuzzy Systems. 2019. Springer.
Nakaura, T. et al., A primer for understanding radiology articles about machine learning and deep learning. Diagnostic and Interventional Imaging, 2020. 101(12): p. 765-770.
Bradski, G., The OpenCV Library. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 2000. 25(11): p. 120-123.
Memon, J. et al., Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access, 2020. 8: p. 142642-142668.
Mosin, V. et al., Comparing autoencoder-based approaches for anomaly detection in highway driving scenario images. SN Applied Sciences...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Id	Score	Summary	Text	Intent
1	5	Good Quality Dog Food	I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky...

You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Data in Software Systems – Text, Images, Code, and Their Annotations

Raw data and features – what are the differences?

Every data has its purpose – annotations and tasks

Annotating text for intent recognition

Where different types of data can be used together – an outlook on multi-modal data models

References

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook