You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Data Acquisition, Data Quality, and Noise

Data for machine learning systems can come directly from humans and software systems – usually called source systems. Where the data comes from has implications regarding what it looks like, what kind of quality it has, and how to process it.

The data that originates from humans is usually noisier than data that originates from software systems. We, as humans, are known for small inconsistencies and we can also understand things inconsistently. For example, the same defect reported by two different people could have a very different description; the same is true for requirements, designs, and source code.

The data that originates from software systems is often more consistent and contains less noise or the noise in the data is more regular than the noise in the human-generated data. This data is generated by source systems. Therefore, controlling and monitoring the quality of the data that’s generated automatically is different...

Sources of data and what we can do with them

Machine learning software has become increasingly important in all fields today. Anything from telecommunication networks, self-driving vehicles, computer games, smart navigation systems, and facial recognition to websites, news production, cinematography, and experimental music creation can be done using machine learning. Some applications are very successful at, for example, using machine learning in search strings (BERT models). Some applications are not so successful, such as using machine learning in hiring processes. Often, this depends on the programmers, data scientists, or models that are used in these applications. However, in most cases, the success of a machine learning application is often in the data that is used to train it and use it. It depends on the quality of that data and the features that are extracted from it. For example, Amazon’s machine learning recommender was taken out of operation because it was biased...

Extracting data from software engineering tools – Gerrit and Jira

To illustrate how to work with data extraction, let’s extract data from a popular software engineering tool for code reviews – Gerrit. This tool is used for reviewing and discussing fragments of code developed by individual programmers, just before they are integrated into the main code base of the product.

The following program code shows how to access the database of Gerrit – that is, through the JSON API – and how to extract the list of all changes for a specific project. This program uses the Python pygerrit2 package (https://pypi.org/project/pygerrit2/). This module helps us use the JSON API as it provides Python functions instead of just JSON strings:

# importing libraries
from pygerrit2 import GerritRestAPI
# A bit of config - repo
gerrit_url = "https://gerrit.onap.org/r"
# since we use a public OSS repository
auth = None
# this line gets sets the parameters for...

Extracting data from product databases – GitHub and Git

JIRA and Gerrit are, to some extent, additional tools to the main product development tools. However, every software development organization uses a source code repository to store the main asset – the source code of the company’s software product. Today, the tools that are used the most are Git version control and its close relative, GitHub. Source code repositories can be a very useful source of data for machine learning systems – we can extract the source code of the product and analyze it.

GitHub is a great source of data for machine learning if we use it responsibly. Please remember that the source code provided as open source, by the community, is not for profiting off. We need to follow the licenses and we need to acknowledge the contributions that were made by the authors, contributors, and maintainers of the open source community. Regardless of the license, we are always able to analyze...

Data quality

When designing and developing machine learning systems, we consider the data quality on a relatively low level. We look for missing values, outliers, or similar. They are important because they can cause problems when training machine learning models. Nevertheless, they are nearly enough from a software engineering perspective.

When engineering reliable software systems, we need to know more about the data we use than whether it contains (or not) missing values. We need to know whether we can trust the data (whether it is believable), whether the data is representative, or whether it is up to date. So, we need a quality model for our data.

There are several quality models for data in software engineering, and the one I often use, and recommend, is the AIMQ model – a methodology for assessing information quality.

The quality dimensions of the AIMQ model are as follows (cited from Lee, Y.W., et al., AIMQ: a methodology for information quality assessment...

Noise

Data quality in machine learning systems has one additional and crucial attribute – noise. Noise can be defined as data points that contribute negatively to the ability of machine learning systems to identify patterns in the data. These data points can be outliers that make the datasets skew toward one or several classes in classification problems. The outliers can also cause prediction systems to over- or under-predict because they emphasize patterns that do not exist in the data.

Another type of noise is contradictory entries, where two (or more) identical data points are labeled with different labels. We can illustrate this with the example of product reviews on Amazon, which we saw in Chapter 3. Let’s import them into a new Python script with dfData = pd.read_csv('./book_chapter_4_embedded_1k_reviews.csv'). In this case, this dataset contains a summary of the reviews and the score. We focus on these two columns and we define noise as different scores...

Summary

Data for machine learning systems is crucial – without data, there can be no machine learning systems. In most machine learning literature, the process of training models usually starts with the data in tabular form. In software engineering, however, this is an intermediate step. The data is collected from source systems and needs to be processed.

In this chapter, we learned how to access data from modern software engineering systems such as Gerrit, GitHub, JIRA, and Git. The code included in this chapter illustrates how to collect data that can be used for further steps in the machine learning pipeline – feature extraction. We’ll focus on this in the next chapter.

Collecting data is not the only preprocessing step that is required to design and develop a reliable software system. Quantifying and monitoring information (and data) quality is equally important. We need to check that the data is fresh (timely) and that there are no problems in preprocessing...

References

Vaswani, A. et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.
Dastin, J., Amazon scraps secret AI recruiting tool that showed bias against women. In Ethics of Data and Analytics. 2018, Auerbach Publications. p. 296-299.
Staron, M., D. Durisic, and R. Rana, Improving measurement certainty by using calibration to find systematic measurement error—a case of lines-of-code measure. In Software Engineering: Challenges and Solutions. 2017, Springer. p. 119-132.
Staron, M. and W. Meding, Software Development Measurement Programs. Springer. https://doi. org/10.1007/978-3-319-91836-5, 2018. 10: p. 3281333.
Fenton, N. and J. Bieman, Software metrics: a rigorous and practical approach. 2014: CRC press.
Li, N., M. Shepperd, and Y. Guo, A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology, 2020. 122: p. 106287.
Staron, M. et al. Robust...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages