You're reading from Engineering MLOps

Product typeBook

Published inApr 2021

PublisherPackt

ISBN-139781800562882

Edition1st Edition

Tools

Azure Functions

Concepts

Machine Learning

Author (1)

Emmanuel Raj

Chapter 3: Code Meets Data

In this chapter, we'll get started with hands-on MLOps implementation as we learn by solving a business problem using the MLOps workflow discussed in the previous chapter. We'll also discuss effective methods of source code management for machine learning (ML), explore data quality characteristics, and analyze and shape data for an ML solution.

We begin this chapter by categorizing the business problem to curate a best-fit MLOps solution for it. Following this, we'll set up the required resources and tools to implement the solution. 10 guiding principles for source code management for ML are discussed to apply clean code practices. We will discuss what constitutes good-quality data for ML and much more, followed by processing a dataset related to the business problem and ingesting and versioning it to the ML workspace. Most of the chapter is hands-on and designed to equip you with a good understanding of and experience with MLOps. For this...

Business problem analysis and categorizing the problem

In the previous chapter, we looked into the following business problem statement. In this section, we will demystify the problem statement by categorizing it using the principles to curate an implementation roadmap. We will glance at the dataset given to us to address the business problem and decide what type of ML model will address the business problem efficiently. Lastly, we'll categorize the MLOps approach for implementing robust and scalable ML operations and decide on tools for implementation.

Here is the problem statement:

You work as a data scientist with a small team of data scientists for a cargo shipping company based in Finland. 90% of goods are imported into Finland via cargo shipping. You are tasked with saving 20% of the costs for cargo operations at the port of Turku, Finland. This can be achieved by developing an ML solution that predicts weather conditions at the port 4 hours in advance. You need to...

Setting up the resources and tools

If you have these tools already installed and set up on your PC, feel free to skip this section; otherwise, follow the detailed instructions to get them up and running.

Installing MLflow

We get started by installing MLflow, which is an open source platform for managing the ML life cycle, including experimentation, reproducibility, deployment, and a central model registry.

To install MLflow, go to your terminal and execute the following command:

pip3 install mlflow

After successful installation, test the installation by executing the following command to start the mlflow tracking UI:

mlflow ui

Upon running the mlflow tracking UI, you will be running a server listening at port 5000 on your machine, and it outputs a message like the following:

[2021-03-11 14:34:23 +0200] [43819] [INFO] Starting gunicorn 20.0.4
[2021-03-11 14:34:23 +0200] [43819] [INFO] Listening at: http://127.0.0.1:5000 (43819)
[2021-03-11 14:34:23 +0200...

10 principles of source code management for ML

Here are 10 principles that can be applied to your code to ensure the quality, robustness, and scalability of your code:

Modularity: It is better to have modular code than to have one big chunk. Modularity encourages reusability and facilitates upgrading by replacing the required components. To avoid needless complexity and repetition, follow this golden rule:
Two or more ML components should be paired only when one of them uses the other. If none of them uses each other, then pairing should be avoided.
An ML component that is not tightly paired with its environment can be more easily modified or replaced than a tightly paired component.
Single task dedicated functions: Functions are important building blocks of pipelines and the system, and they are small sections of code that are used to perform particular tasks. The purpose of functions is to avoid repetition of commands and enable reusable code. They can easily become a complex...

What is good data for ML?

Good ML models are a result of training on good-quality data. Before proceeding to ML training, a pre-requisite is to have good-quality data. Therefore, we need to process the data to increase its quality. So, determining the quality of data is essential. Five characteristics will enable us to discern the quality of data, as follows:

Accuracy: Accuracy is a crucial characteristic of data quality, as having inaccurate data can lead to poor ML model performance and consequences in real life. To check the accuracy of the data, confirm whether the information represents a real-life situation or not.
Completeness: In most cases, incomplete information is unusable and can lead to incorrect outcomes if an ML model is trained on it. It is vital to check the comprehensiveness of the data.
Reliability: Contradictions or duplications in data can lead to the unreliability of the data. Reliability is a vital characteristic; trusting the data is essential...

Data preprocessing

Raw data cannot be directly passed to the ML model for training purposes. We have to refine or preprocess the data before training the ML model. To further analyze the imported data, we will perform a series of steps to preprocess the data into a suitable shape for the ML training. We start by assessing the quality of the data to check for accuracy, completeness, reliability, relevance, and timeliness. After this, we calibrate the required data and encode text into numerical data, which is ideal for ML training. Lastly, we will analyze the correlations and time series, and filter out irrelevant data for training ML models.

Data quality assessment

To assess the quality of the data, we look for accuracy, completeness, reliability, relevance, and timeliness. Firstly, let's check if the data is complete and reliable by assessing the formats, cumulative statistics, and anomalies such as missing data. We use pandas functions as follows:

df.describe(...

Data registration and versioning

It is vital to register and version the data in the workspace before starting ML training as it enables us to backtrack our experiments or ML models to the source of data used for training the models. The purpose of versioning the data is to backtrack at any point, to replicate a model's training, or to explain the workings of the model as per the inference or testing data for explaining the ML model. For these reasons, we will register the processed data and version it to use it for our ML pipeline. We will register and version the processed data to the Azure Machine Learning workspace using the Azure Machine Learning SDK as follows:

subscription_id = '---insert your subscription ID here----'
resource_group = 'Learn_MLOps'
workspace_name = 'MLOps_WS' 
workspace = Workspace(subscription_id, resource_group, workspace_name)

Fetch your subscription ID, resource_group and workspace_name from the Azure...

Toward the ML Pipeline

So far, we have processed the data by working on irregularities such as missing data, selected features by observing correlations, created new features, and finally ingested and versioned the processed data to the Machine learning workspace. There are two ways to fuel the data ingestion for ML model training in the ML pipeline. One way is from the central storage (where all your raw data is stored) and the second way is using a feature store. As knowledge is power, Let's get to know the use of the feature store before we move to the ML pipeline.

Feature Store

A feature store compliments the central storage by storing important features and make them available for training or inference. A feature store is a store where you transform raw data into useful features that ML models can use directly to train and infer to make predictions. Raw Data typically comes from various data sources, which are structured, unstructured, streaming, batch, and real-time...

Summary

In this chapter, we have learned how to identify a suitable ML solution to a business problem and categorize operations to implement suitable MLOps. We set up our tools, resources, and development environment. 10 principles of source code management were discussed, followed by data quality characteristics. Congrats! So far, you have implemented a critical building block of the MLOps workflow – data processing and registering processed data to the workspace. Lastly, we had a glimpse into the essentials of the ML pipeline.

In the next chapter, you will do the most exciting part of MLOps: building the ML pipeline. Let's press on!

The rest of the chapter is locked

You have been reading a chapter from

Engineering MLOps

Published in: Apr 2021Publisher: PacktISBN-13: 9781800562882

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Emmanuel Raj

Emmanuel Raj is a Finland-based Senior Machine Learning Engineer with 6+ years of industry experience. He is also a Machine Learning Engineer at TietoEvry and a Member of the European AI Alliance at the European Commission. He is passionate about democratizing AI and bringing research and academia to industry. He holds a Master of Engineering degree in Big Data Analytics from Arcada University of Applied Sciences. He has a keen interest in R&D in technologies such as Edge AI, Blockchain, NLP, MLOps and Robotics. He believes "the best way to learn is to teach", he is passionate about sharing and learning new technologies with others.
Read more about Emmanuel Raj

Other recommended products

Related to this chapter

Automated Machine Learning with Microsoft Azure

A practical, step-by-step guide to using Microsoft's AutoML technology on the Azure Machine Learning service for developers and data scientists working with the Python programming language

BookApr 2021340 pages

Amazon SageMaker Best Practices

Going beyond the basics, Amazon SageMaker Best Practices provides end-to-end coverage of the service capabilities that the platform offers for building and automating machine learning workloads to address data science challenges. With this book, you'll discover tips to train, deploy, and monitor your machine learning solutions efficiently.

BookSep 2021348 pages

Machine Learning Engineering with MLflow

Machine Learning Engineering with MLflow is a step-by-step guide that will have you up and running, and productive in no time with MLflow using the most effective machine learning engineering approach. You will also learn how to scale MLflow in big data environments and for high computing demands.

BookAug 2021248 pages2

Mastering Azure Machine Learning

This book will help you learn how to build a scalable end-to-end machine learning pipeline in Azure from experimentation and training to optimization and deployment. By the end of this book, you will learn to build complex distributed systems and scalable cloud infrastructure using powerful machine learning algorithms to compute insights.

BookApr 2020436 pages

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

BookFeb 2021312 pages

Hands-On Automated Machine Learning

This book helps machine learning professionals in developing AutoML systems that can be utilized to build ML solutions. This book covers the necessary foundations and shows the most practical ways possible to get to speed with regards to creating AutoML modules.

BookApr 2018282 pages

Machine Learning with Go Quick Start Guide

Machine learning has become an essential part of the modern data-driven world and has been extensively adopted in various fields across financial forecasting, effective searches, robotics, digital imaging in healthcare, and more. This book will teach you to perform various machine learning tasks using Go in different environments.

BookMay 2019168 pages

Hands-On Machine Learning with Azure

This book will teach you how advanced machine learning can be performed in the cloud in a very cheap way. You will learn more about Azure ML processes as an enterprise-ready methodology. By the end of this book, you will implement machine learning and artificial intelligence concepts in your model to solve real-world problems.

BookOct 2018340 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages