You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product typeBook

Published inJan 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837634064

Edition1st Edition

Languages

Python

Concepts

Machine Learning

Author (1)

Miroslaw Staron

Ethics in Data Acquisition and Management

Machine learning (ML) requires a lot of data that can come from a variety of sources, but not all sources are equally easy to use. In software engineering, we can design and develop systems that use data from other systems. We can also use data that does not really originate from people; for example, we can use data about defects or complexity of systems. However, to provide more value to society, we need to use data that contains information about people or their belongings; for example, when we train machines to recognize faces or license plates. Regardless of our use case, however, we need to follow ethical guidelines and, above all, have the guiding principle that our software should not cause any harm.

We start this chapter by exploring a few examples of unethical systems that show bias; for example, credit ranking systems that penalize certain minorities. I will also explain problems with using open source data and revealing the identities...

Ethics in computer science and software engineering

The modern view on ethics has its roots in the Nuremberg Code, which was developed after the Second World War. The code is based on several principles, but the most important one is the fact that every study needs to have permission if it involves human subjects. This is essential, as it prevents the abuse of humans during experimentation. Every participant in a study should also be able to retract their permission at any given time. Let us look at all 10 principles:

The voluntary consent of the human subject is absolutely essential.
The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature.
The experiment should be so designed and based on the results of animal experimentation and a knowledge of the natural history of the disease or other problem under study that the anticipated results will justify...

Data is all around us, but can we really use it?

One of the ways in which we protect subjects and data is to use appropriate licenses for the use of data. Licenses are a sort of contract in that the licensor grants permission to use the data in a specific way to a licensee. Licenses are used for both software products (algorithms, components) and data. The following license models are the most commonly used ones in contemporary software:

Proprietary license: It is a model whereby the licensor owns the data and grants permission to use the data for certain purposes, often for profit. In such a contract, the parties usually regulate what the data can be used for, how, and for how long. These licenses also regulate liabilities for both parties.
Permissive open licenses: These are licenses that provide the licensee almost unrestricted access to the data, at the same time limiting the liability of the licensor. The licensee is often not required to provide access to the licensee...

Ethics behind data from open source systems

Proprietary systems oftentimes have licenses that regulate who owns the data and for what purpose. For example, code review data from a company often belongs to the company. By working for the company, the employees usually sign off their rights to the data that they generate for the company. It is needed in the legal sense because the employees are getting compensated for that – usually in the form of salaries.

However, what the employees do not transfer to the company is the right to use their personal data freely. This means that when we work with source systems, such as the Gerrit review system, we should not extract personal information without the permission of the people involved. If we execute the query where masking of this data is not possible, we must ensure that the personal data is anonymized (as soon as it is possible) and is not leaked to the analysis. We must ensure that such personal data is not made publicly available...

Ethics behind data collected from humans

In Europe, one of the main legal frameworks that regulates how we can use the data is the General Data Protection Regulation (GDPR) (https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679). It regulates the scope of handling personal data and puts requirements on the organization to obtain permission to collect, process, and use personal data, as well as requiring organizations to provide individuals with the ability to revoke permissions. The regulation is the most restrictive international regulation that is meant to protect individuals (us) from being abused by companies that have the means and abilities to collect and process data about us.

Although we use a lot of data from GitHub and similar repositories, there are repositories where we also store the data. One of them is Zenodo, which is used increasingly often to store datasets. Its terms of use require us to obtain the right permissions. Here are its terms of use...

Contracts and legal obligations

To finish the chapter, I would like to take up one last topic. Although there is a lot of data available, we must make sure that we do our due diligence and find out which contracts and obligations apply to us.

Licenses are one type of contract, but not the only one. Almost all universities put contracts and obligations on researchers. These may include the need to ask for permission from ethical review boards or the need to make data available for scrutiny from other researchers.

Professional codes of conduct are another type of obligation; for example, the one from ACM (https://www.acm.org/code-of-ethics). These codes of conduct often stem from the Nuremberg Code and require us to ensure that our work is for the good of society.

Finally, when working with commercial organizations, we may need to sign a so-called non-disclosure agreement (NDA). Such agreements are often required to ensure that we do not disclose information without prior permission...

References

Code, N., The Nuremberg Code. Trials of war criminals before the Nuremberg military tribunals under control council law, 1949. 10(1949): p. 181-2.
Wohlin, C. et al., Experimentation in software engineering. 2012: Springer Science & Business Media.
Gold, N.E. and J. Krinke, Ethics in the mining of software repositories. Empirical Software Engineering, 2022. 27(1): p. 17.
Kenneally, E. and D. Dittrich, The Menlo Report: Ethical principles guiding information and communication technology research. Available at SSRN 2445102, 2012.

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Infrastructure and Best Practices for Software Engineers

Published in: Jan 2024Publisher: PacktISBN-13: 9781837634064

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Miroslaw Staron

Miroslaw Staron is a professor of Applied IT at the University of Gothenburg in Sweden with a focus on empirical software engineering, measurement, and machine learning. He is currently editor-in-chief of Information and Software Technology and co-editor of the regular Practitioner's Digest column of IEEE Software. He has authored books on automotive software architectures, software measurement, and action research. He also leads several projects in AI for software engineering and leads an AI and digitalization theme at Software Center. He has written over 200 journal and conference articles.
Read more about Miroslaw Staron

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages