Reader small image

You're reading from  Engineering MLOps

Product typeBook
Published inApr 2021
PublisherPackt
ISBN-139781800562882
Edition1st Edition
Right arrow
Author (1)
Emmanuel Raj
Emmanuel Raj
author image
Emmanuel Raj

Emmanuel Raj is a Finland-based Senior Machine Learning Engineer with 6+ years of industry experience. He is also a Machine Learning Engineer at TietoEvry and a Member of the European AI Alliance at the European Commission. He is passionate about democratizing AI and bringing research and academia to industry. He holds a Master of Engineering degree in Big Data Analytics from Arcada University of Applied Sciences. He has a keen interest in R&D in technologies such as Edge AI, Blockchain, NLP, MLOps and Robotics. He believes "the best way to learn is to teach", he is passionate about sharing and learning new technologies with others.
Read more about Emmanuel Raj

Right arrow

Chapter 3: Code Meets Data

In this chapter, we'll get started with hands-on MLOps implementation as we learn by solving a business problem using the MLOps workflow discussed in the previous chapter. We'll also discuss effective methods of source code management for machine learning (ML), explore data quality characteristics, and analyze and shape data for an ML solution.

We begin this chapter by categorizing the business problem to curate a best-fit MLOps solution for it. Following this, we'll set up the required resources and tools to implement the solution. 10 guiding principles for source code management for ML are discussed to apply clean code practices. We will discuss what constitutes good-quality data for ML and much more, followed by processing a dataset related to the business problem and ingesting and versioning it to the ML workspace. Most of the chapter is hands-on and designed to equip you with a good understanding of and experience with MLOps. For this...

Business problem analysis and categorizing the problem

In the previous chapter, we looked into the following business problem statement. In this section, we will demystify the problem statement by categorizing it using the principles to curate an implementation roadmap. We will glance at the dataset given to us to address the business problem and decide what type of ML model will address the business problem efficiently. Lastly, we'll categorize the MLOps approach for implementing robust and scalable ML operations and decide on tools for implementation.

Here is the problem statement:

You work as a data scientist with a small team of data scientists for a cargo shipping company based in Finland. 90% of goods are imported into Finland via cargo shipping. You are tasked with saving 20% of the costs for cargo operations at the port of Turku, Finland. This can be achieved by developing an ML solution that predicts weather conditions at the port 4 hours in advance. You need to...

Setting up the resources and tools

If you have these tools already installed and set up on your PC, feel free to skip this section; otherwise, follow the detailed instructions to get them up and running. 

Installing MLflow

We get started by installing MLflow, which is an open source platform for managing the ML life cycle, including experimentation, reproducibility, deployment, and a central model registry.

To install MLflow, go to your terminal and execute the following command:

pip3 install mlflow

After successful installation, test the installation by executing the following command to start the mlflow tracking UI:

mlflow ui

Upon running the mlflow tracking UI, you will be running a server listening at port 5000 on your machine, and it outputs a message like the following:

[2021-03-11 14:34:23 +0200] [43819] [INFO] Starting gunicorn 20.0.4
[2021-03-11 14:34:23 +0200] [43819] [INFO] Listening at: http://127.0.0.1:5000 (43819)
[2021-03-11 14:34:23 +0200...

10 principles of source code management for ML

Here are 10 principles that can be applied to your code to ensure the quality, robustness, and scalability of your code:

  • Modularity: It is better to have modular code than to have one big chunk. Modularity encourages reusability and facilitates upgrading by replacing the required components. To avoid needless complexity and repetition, follow this golden rule:

    Two or more ML components should be paired only when one of them uses the other. If none of them uses each other, then pairing should be avoided.

    An ML component that is not tightly paired with its environment can be more easily modified or replaced than a tightly paired component.

  • Single task dedicated functions: Functions are important building blocks of pipelines and the system, and they are small sections of code that are used to perform particular tasks. The purpose of functions is to avoid repetition of commands and enable reusable code. They can easily become a complex...

What is good data for ML?

Good ML models are a result of training on good-quality data. Before proceeding to ML training, a pre-requisite is to have good-quality data. Therefore, we need to process the data to increase its quality. So, determining the quality of data is essential. Five characteristics will enable us to discern the quality of data, as follows:

  • Accuracy: Accuracy is a crucial characteristic of data quality, as having inaccurate data can lead to poor ML model performance and consequences in real life. To check the accuracy of the data, confirm whether the information represents a real-life situation or not.
  • Completeness: In most cases, incomplete information is unusable and can lead to incorrect outcomes if an ML model is trained on it. It is vital to check the comprehensiveness of the data.
  • Reliability: Contradictions or duplications in data can lead to the unreliability of the data. Reliability is a vital characteristic; trusting the data is essential...

Data preprocessing

Raw data cannot be directly passed to the ML model for training purposes. We have to refine or preprocess the data before training the ML model. To further analyze the imported data, we will perform a series of steps to preprocess the data into a suitable shape for the ML training. We start by assessing the quality of the data to check for accuracy, completeness, reliability, relevance, and timeliness. After this, we calibrate the required data and encode text into numerical data, which is ideal for ML training. Lastly, we will analyze the correlations and time series, and filter out irrelevant data for training ML models.

Data quality assessment

To assess the quality of the data, we look for accuracy, completeness, reliability, relevance, and timeliness. Firstly, let's check if the data is complete and reliable by assessing the formats, cumulative statistics, and anomalies such as missing data. We use pandas functions as follows:

df.describe(...

Data registration and versioning

It is vital to register and version the data in the workspace before starting ML training as it enables us to backtrack our experiments or ML models to the source of data used for training the models. The purpose of versioning the data is to backtrack at any point, to replicate a model's training, or to explain the workings of the model as per the inference or testing data for explaining the ML model. For these reasons, we will register the processed data and version it to use it for our ML pipeline. We will register and version the processed data to the Azure Machine Learning workspace using the Azure Machine Learning SDK as follows:

subscription_id = '---insert your subscription ID here----'
resource_group = 'Learn_MLOps'
workspace_name = 'MLOps_WS' 
workspace = Workspace(subscription_id, resource_group, workspace_name)

Fetch your subscription ID, resource_group and workspace_name from the Azure...

Toward the ML Pipeline

So far, we have processed the data by working on irregularities such as missing data, selected features by observing correlations, created new features, and finally ingested and versioned the processed data to the Machine learning workspace. There are two ways to fuel the data ingestion for ML model training in the ML pipeline. One way is from the central storage (where all your raw data is stored) and the second way is using a feature store. As knowledge is power, Let's get to know the use of the feature store before we move to the ML pipeline.

Feature Store

A feature store compliments the central storage by storing important features and make them available for training or inference. A feature store is a store where you transform raw data into useful features that ML models can use directly to train and infer to make predictions. Raw Data typically comes from various data sources, which are structured, unstructured, streaming, batch, and real-time...

Summary

In this chapter, we have learned how to identify a suitable ML solution to a business problem and categorize operations to implement suitable MLOps. We set up our tools, resources, and development environment. 10 principles of source code management were discussed, followed by data quality characteristics. Congrats! So far, you have implemented a critical building block of the MLOps workflow – data processing and registering processed data to the workspace. Lastly, we had a glimpse into the essentials of the ML pipeline.

In the next chapter, you will do the most exciting part of MLOps: building the ML pipeline. Let's press on!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Engineering MLOps
Published in: Apr 2021Publisher: PacktISBN-13: 9781800562882
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Emmanuel Raj

Emmanuel Raj is a Finland-based Senior Machine Learning Engineer with 6+ years of industry experience. He is also a Machine Learning Engineer at TietoEvry and a Member of the European AI Alliance at the European Commission. He is passionate about democratizing AI and bringing research and academia to industry. He holds a Master of Engineering degree in Big Data Analytics from Arcada University of Applied Sciences. He has a keen interest in R&D in technologies such as Edge AI, Blockchain, NLP, MLOps and Robotics. He believes "the best way to learn is to teach", he is passionate about sharing and learning new technologies with others.
Read more about Emmanuel Raj