You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product type Book

Published in Jan 2024

Publisher Packt

ISBN-13 9781837634064

Pages 346 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Miroslaw Staron

Table of Contents (24) Chapters

Preface

1. Part 1:Machine Learning Landscape in Software Engineering

2. Machine Learning Compared to Traditional Software

3. Elements of a Machine Learning System

4. Data in Software Systems – Text, Images, Code, and Their Annotations

5. Data Acquisition, Data Quality, and Noise

6. Quantifying and Improving Data Properties

7. Part 2: Data Acquisition and Management

8. Processing Data in Machine Learning Systems

9. Feature Engineering for Numerical and Image Data

10. Feature Engineering for Natural Language Data

11. Part 3: Design and Development of ML Systems

12. Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

13. Training and Evaluating Classical Machine Learning Systems and Neural Networks

14. Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

15. Designing Machine Learning Pipelines (MLOps) and Their Testing

16. Designing and Implementing Large-Scale, Robust ML Software

17. Part 4: Ethical Aspects of Data Management and ML System Development

18. Ethics in Data Acquisition and Management

19. Ethics in Machine Learning Systems

20. Integrating ML Systems in Ecosystems

21. Summary and Where to Go Next

22. Index

Why subscribe?

23. Other Books You May Enjoy

Ethics in Data Acquisition and Management

Machine learning (ML) requires a lot of data that can come from a variety of sources, but not all sources are equally easy to use. In software engineering, we can design and develop systems that use data from other systems. We can also use data that does not really originate from people; for example, we can use data about defects or complexity of systems. However, to provide more value to society, we need to use data that contains information about people or their belongings; for example, when we train machines to recognize faces or license plates. Regardless of our use case, however, we need to follow ethical guidelines and, above all, have the guiding principle that our software should not cause any harm.

We start this chapter by exploring a few examples of unethical systems that show bias; for example, credit ranking systems that penalize certain minorities. I will also explain problems with using open source data and revealing the identities...

Ethics in computer science and software engineering

The modern view on ethics has its roots in the Nuremberg Code, which was developed after the Second World War. The code is based on several principles, but the most important one is the fact that every study needs to have permission if it involves human subjects. This is essential, as it prevents the abuse of humans during experimentation. Every participant in a study should also be able to retract their permission at any given time. Let us look at all 10 principles:

The voluntary consent of the human subject is absolutely essential.
The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature.
The experiment should be so designed and based on the results of animal experimentation and a knowledge of the natural history of the disease or other problem under study that the anticipated results will justify...

Data is all around us, but can we really use it?

One of the ways in which we protect subjects and data is to use appropriate licenses for the use of data. Licenses are a sort of contract in that the licensor grants permission to use the data in a specific way to a licensee. Licenses are used for both software products (algorithms, components) and data. The following license models are the most commonly used ones in contemporary software:

Proprietary license: It is a model whereby the licensor owns the data and grants permission to use the data for certain purposes, often for profit. In such a contract, the parties usually regulate what the data can be used for, how, and for how long. These licenses also regulate liabilities for both parties.
Permissive open licenses: These are licenses that provide the licensee almost unrestricted access to the data, at the same time limiting the liability of the licensor. The licensee is often not required to provide access to the licensee...

Ethics behind data from open source systems

Proprietary systems oftentimes have licenses that regulate who owns the data and for what purpose. For example, code review data from a company often belongs to the company. By working for the company, the employees usually sign off their rights to the data that they generate for the company. It is needed in the legal sense because the employees are getting compensated for that – usually in the form of salaries.

However, what the employees do not transfer to the company is the right to use their personal data freely. This means that when we work with source systems, such as the Gerrit review system, we should not extract personal information without the permission of the people involved. If we execute the query where masking of this data is not possible, we must ensure that the personal data is anonymized (as soon as it is possible) and is not leaked to the analysis. We must ensure that such personal data is not made publicly available...

Ethics behind data collected from humans

In Europe, one of the main legal frameworks that regulates how we can use the data is the General Data Protection Regulation (GDPR) (https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679). It regulates the scope of handling personal data and puts requirements on the organization to obtain permission to collect, process, and use personal data, as well as requiring organizations to provide individuals with the ability to revoke permissions. The regulation is the most restrictive international regulation that is meant to protect individuals (us) from being abused by companies that have the means and abilities to collect and process data about us.

Although we use a lot of data from GitHub and similar repositories, there are repositories where we also store the data. One of them is Zenodo, which is used increasingly often to store datasets. Its terms of use require us to obtain the right permissions. Here are its terms of use...

Contracts and legal obligations

To finish the chapter, I would like to take up one last topic. Although there is a lot of data available, we must make sure that we do our due diligence and find out which contracts and obligations apply to us.

Licenses are one type of contract, but not the only one. Almost all universities put contracts and obligations on researchers. These may include the need to ask for permission from ethical review boards or the need to make data available for scrutiny from other researchers.

Professional codes of conduct are another type of obligation; for example, the one from ACM (https://www.acm.org/code-of-ethics). These codes of conduct often stem from the Nuremberg Code and require us to ensure that our work is for the good of society.

Finally, when working with commercial organizations, we may need to sign a so-called non-disclosure agreement (NDA). Such agreements are often required to ensure that we do not disclose information without prior permission...

References

Code, N., The Nuremberg Code. Trials of war criminals before the Nuremberg military tribunals under control council law, 1949. 10(1949): p. 181-2.
Wohlin, C. et al., Experimentation in software engineering. 2012: Springer Science & Business Media.
Gold, N.E. and J. Krinke, Ethics in the mining of software repositories. Empirical Software Engineering, 2022. 27(1): p. 17.
Kenneally, E. and D. Dittrich, The Menlo Report: Ethical principles guiding information and communication technology research. Available at SSRN 2445102, 2012.