Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Data Acquisition, Data Quality, and Noise

Data for machine learning systems can come directly from humans and software systems – usually called source systems. Where the data comes from has implications regarding what it looks like, what kind of quality it has, and how to process it.

The data that originates from humans is usually noisier than data that originates from software systems. We, as humans, are known for small inconsistencies and we can also understand things inconsistently. For example, the same defect reported by two different people could have a very different description; the same is true for requirements, designs, and source code.

The data that originates from software systems is often more consistent and contains less noise or the noise in the data is more regular than the noise in the human-generated data. This data is generated by source systems. Therefore, controlling and monitoring the quality of the data that’s generated automatically is different...

Sources of data and what we can do with them

Machine learning software has become increasingly important in all fields today. Anything from telecommunication networks, self-driving vehicles, computer games, smart navigation systems, and facial recognition to websites, news production, cinematography, and experimental music creation can be done using machine learning. Some applications are very successful at, for example, using machine learning in search strings (BERT models). Some applications are not so successful, such as using machine learning in hiring processes. Often, this depends on the programmers, data scientists, or models that are used in these applications. However, in most cases, the success of a machine learning application is often in the data that is used to train it and use it. It depends on the quality of that data and the features that are extracted from it. For example, Amazon’s machine learning recommender was taken out of operation because it was biased...

Extracting data from software engineering tools – Gerrit and Jira

To illustrate how to work with data extraction, let’s extract data from a popular software engineering tool for code reviews – Gerrit. This tool is used for reviewing and discussing fragments of code developed by individual programmers, just before they are integrated into the main code base of the product.

The following program code shows how to access the database of Gerrit – that is, through the JSON API – and how to extract the list of all changes for a specific project. This program uses the Python pygerrit2 package (https://pypi.org/project/pygerrit2/). This module helps us use the JSON API as it provides Python functions instead of just JSON strings:

# importing libraries
from pygerrit2 import GerritRestAPI
# A bit of config - repo
gerrit_url = "https://gerrit.onap.org/r"
# since we use a public OSS repository
auth = None
# this line gets sets the parameters for...

Extracting data from product databases – GitHub and Git

JIRA and Gerrit are, to some extent, additional tools to the main product development tools. However, every software development organization uses a source code repository to store the main asset – the source code of the company’s software product. Today, the tools that are used the most are Git version control and its close relative, GitHub. Source code repositories can be a very useful source of data for machine learning systems – we can extract the source code of the product and analyze it.

GitHub is a great source of data for machine learning if we use it responsibly. Please remember that the source code provided as open source, by the community, is not for profiting off. We need to follow the licenses and we need to acknowledge the contributions that were made by the authors, contributors, and maintainers of the open source community. Regardless of the license, we are always able to analyze...

Data quality

When designing and developing machine learning systems, we consider the data quality on a relatively low level. We look for missing values, outliers, or similar. They are important because they can cause problems when training machine learning models. Nevertheless, they are nearly enough from a software engineering perspective.

When engineering reliable software systems, we need to know more about the data we use than whether it contains (or not) missing values. We need to know whether we can trust the data (whether it is believable), whether the data is representative, or whether it is up to date. So, we need a quality model for our data.

There are several quality models for data in software engineering, and the one I often use, and recommend, is the AIMQ model – a methodology for assessing information quality.

The quality dimensions of the AIMQ model are as follows (cited from Lee, Y.W., et al., AIMQ: a methodology for information quality assessment...

Noise

Data quality in machine learning systems has one additional and crucial attribute – noise. Noise can be defined as data points that contribute negatively to the ability of machine learning systems to identify patterns in the data. These data points can be outliers that make the datasets skew toward one or several classes in classification problems. The outliers can also cause prediction systems to over- or under-predict because they emphasize patterns that do not exist in the data.

Another type of noise is contradictory entries, where two (or more) identical data points are labeled with different labels. We can illustrate this with the example of product reviews on Amazon, which we saw in Chapter 3. Let’s import them into a new Python script with dfData = pd.read_csv('./book_chapter_4_embedded_1k_reviews.csv'). In this case, this dataset contains a summary of the reviews and the score. We focus on these two columns and we define noise as different scores...

Summary

Data for machine learning systems is crucial – without data, there can be no machine learning systems. In most machine learning literature, the process of training models usually starts with the data in tabular form. In software engineering, however, this is an intermediate step. The data is collected from source systems and needs to be processed.

In this chapter, we learned how to access data from modern software engineering systems such as Gerrit, GitHub, JIRA, and Git. The code included in this chapter illustrates how to collect data that can be used for further steps in the machine learning pipeline – feature extraction. We’ll focus on this in the next chapter.

Collecting data is not the only preprocessing step that is required to design and develop a reliable software system. Quantifying and monitoring information (and data) quality is equally important. We need to check that the data is fresh (timely) and that there are no problems in preprocessing...

References

  • Vaswani, A. et al., Attention is all you need. Advances in neural information processing systems, 2017. 30.
  • Dastin, J., Amazon scraps secret AI recruiting tool that showed bias against women. In Ethics of Data and Analytics. 2018, Auerbach Publications. p. 296-299.
  • Staron, M., D. Durisic, and R. Rana, Improving measurement certainty by using calibration to find systematic measurement error—a case of lines-of-code measure. In Software Engineering: Challenges and Solutions. 2017, Springer. p. 119-132.
  • Staron, M. and W. Meding, Software Development Measurement Programs. Springer. https://doi. org/10.1007/978-3-319-91836-5, 2018. 10: p. 3281333.
  • Fenton, N. and J. Bieman, Software metrics: a rigorous and practical approach. 2014: CRC press.
  • Li, N., M. Shepperd, and Y. Guo, A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology, 2020. 122: p. 106287.
  • Staron, M. et al. Robust...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron