You're reading from Machine Learning Infrastructure and Best Practices for Software Engineers

Product type Book

Published in Jan 2024

Publisher Packt

ISBN-13 9781837634064

Pages 346 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Miroslaw Staron

Table of Contents (24) Chapters

Preface

1. Part 1:Machine Learning Landscape in Software Engineering

2. Machine Learning Compared to Traditional Software

3. Elements of a Machine Learning System

4. Data in Software Systems – Text, Images, Code, and Their Annotations

5. Data Acquisition, Data Quality, and Noise

6. Quantifying and Improving Data Properties

7. Part 2: Data Acquisition and Management

8. Processing Data in Machine Learning Systems

9. Feature Engineering for Numerical and Image Data

10. Feature Engineering for Natural Language Data

11. Part 3: Design and Development of ML Systems

12. Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)

13. Training and Evaluating Classical Machine Learning Systems and Neural Networks

14. Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders

15. Designing Machine Learning Pipelines (MLOps) and Their Testing

16. Designing and Implementing Large-Scale, Robust ML Software

17. Part 4: Ethical Aspects of Data Management and ML System Development

18. Ethics in Data Acquisition and Management

19. Ethics in Machine Learning Systems

20. Integrating ML Systems in Ecosystems

21. Summary and Where to Go Next

22. Index

Why subscribe?

23. Other Books You May Enjoy

Summary

Data for machine learning systems is crucial – without data, there can be no machine learning systems. In most machine learning literature, the process of training models usually starts with the data in tabular form. In software engineering, however, this is an intermediate step. The data is collected from source systems and needs to be processed.

In this chapter, we learned how to access data from modern software engineering systems such as Gerrit, GitHub, JIRA, and Git. The code included in this chapter illustrates how to collect data that can be used for further steps in the machine learning pipeline – feature extraction. We’ll focus on this in the next chapter.

Collecting data is not the only preprocessing step that is required to design and develop a reliable software system. Quantifying and monitoring information (and data) quality is equally important. We need to check that the data is fresh (timely) and that there are no problems in preprocessing...