Reader small image

You're reading from  Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837634064
Edition1st Edition
Languages
Right arrow
Author (1)
Miroslaw Staron
Miroslaw Staron
author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Right arrow

Ethics in Data Acquisition and Management

Machine learning (ML) requires a lot of data that can come from a variety of sources, but not all sources are equally easy to use. In software engineering, we can design and develop systems that use data from other systems. We can also use data that does not really originate from people; for example, we can use data about defects or complexity of systems. However, to provide more value to society, we need to use data that contains information about people or their belongings; for example, when we train machines to recognize faces or license plates. Regardless of our use case, however, we need to follow ethical guidelines and, above all, have the guiding principle that our software should not cause any harm.

We start this chapter by exploring a few examples of unethical systems that show bias; for example, credit ranking systems that penalize certain minorities. I will also explain problems with using open source data and revealing the identities...

Ethics in computer science and software engineering

The modern view on ethics has its roots in the Nuremberg Code, which was developed after the Second World War. The code is based on several principles, but the most important one is the fact that every study needs to have permission if it involves human subjects. This is essential, as it prevents the abuse of humans during experimentation. Every participant in a study should also be able to retract their permission at any given time. Let us look at all 10 principles:

  1. The voluntary consent of the human subject is absolutely essential.
  2. The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature.
  3. The experiment should be so designed and based on the results of animal experimentation and a knowledge of the natural history of the disease or other problem under study that the anticipated results will justify...

Data is all around us, but can we really use it?

One of the ways in which we protect subjects and data is to use appropriate licenses for the use of data. Licenses are a sort of contract in that the licensor grants permission to use the data in a specific way to a licensee. Licenses are used for both software products (algorithms, components) and data. The following license models are the most commonly used ones in contemporary software:

  • Proprietary license: It is a model whereby the licensor owns the data and grants permission to use the data for certain purposes, often for profit. In such a contract, the parties usually regulate what the data can be used for, how, and for how long. These licenses also regulate liabilities for both parties.
  • Permissive open licenses: These are licenses that provide the licensee almost unrestricted access to the data, at the same time limiting the liability of the licensor. The licensee is often not required to provide access to the licensee...

Ethics behind data from open source systems

Proprietary systems oftentimes have licenses that regulate who owns the data and for what purpose. For example, code review data from a company often belongs to the company. By working for the company, the employees usually sign off their rights to the data that they generate for the company. It is needed in the legal sense because the employees are getting compensated for that – usually in the form of salaries.

However, what the employees do not transfer to the company is the right to use their personal data freely. This means that when we work with source systems, such as the Gerrit review system, we should not extract personal information without the permission of the people involved. If we execute the query where masking of this data is not possible, we must ensure that the personal data is anonymized (as soon as it is possible) and is not leaked to the analysis. We must ensure that such personal data is not made publicly available...

Ethics behind data collected from humans

In Europe, one of the main legal frameworks that regulates how we can use the data is the General Data Protection Regulation (GDPR) (https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679). It regulates the scope of handling personal data and puts requirements on the organization to obtain permission to collect, process, and use personal data, as well as requiring organizations to provide individuals with the ability to revoke permissions. The regulation is the most restrictive international regulation that is meant to protect individuals (us) from being abused by companies that have the means and abilities to collect and process data about us.

Although we use a lot of data from GitHub and similar repositories, there are repositories where we also store the data. One of them is Zenodo, which is used increasingly often to store datasets. Its terms of use require us to obtain the right permissions. Here are its terms of use...

References

  • Code, N., The Nuremberg Code. Trials of war criminals before the Nuremberg military tribunals under control council law, 1949. 10(1949): p. 181-2.
  • Wohlin, C. et al., Experimentation in software engineering. 2012: Springer Science & Business Media.
  • Gold, N.E. and J. Krinke, Ethics in the mining of software repositories. Empirical Software Engineering, 2022. 27(1): p. 17.
  • Kenneally, E. and D. Dittrich, The Menlo Report: Ethical principles guiding information and communication technology research. Available at SSRN 2445102, 2012.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Infrastructure and Best Practices for Software Engineers
Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron