Reader small image

You're reading from  Machine Learning Engineering with MLflow

Product typeBook
Published inAug 2021
PublisherPackt
ISBN-139781800560796
Edition1st Edition
Tools
Right arrow
Author (1)
Natu Lauchande
Natu Lauchande
author image
Natu Lauchande

Natu Lauchande is a principal data engineer in the fintech space currently tackling problems at the intersection of machine learning, data engineering, and distributed systems. He has worked in diverse industries, including biomedical/pharma research, cloud, fintech, and e-commerce/mobile. Along the way, he had the opportunity to be granted a patent (as co-inventor) in distributed systems, publish in a top academic journal, and contribute to open source software. He has also been very active as a speaker at machine learning/tech conferences and meetups.
Read more about Natu Lauchande

Right arrow

Chapter 7: Data and Feature Management

In this chapter, we will add a feature management data layer to the machine learning platform being built. We will leverage the features of the MLflow Projects module to structure our data pipeline.

Specifically, we will look at the following sections in this chapter:

  • Structuring your data pipeline project
  • Acquiring stock data
  • Checking data quality
  • Managing features

In this chapter, we will acquire relevant data to provide datasets for training. Our primary resource will be the Yahoo Finance Data for BTC dataset. Alongside that data, we will acquire the following extra datasets.

Leveraging our productionization architecture introduced in Chapter 6, Introducing ML Systems Architecture, represented in Figure 7.1, the feature and data component is responsible for acquiring data from sources and making the data available in a format consumable by the different components of the platform:

...

Technical requirements

For this chapter, you will need the following prerequisites:

In the next section, we will describe the structure of our data pipeline, the data sources, and the different steps that we will execute to implement our practical example leveraging MLflow project features...

Structuring your data pipeline project

At a high level, our data pipeline will run weekly, collecting data for the preceding 7 days and storing it in a way that can be run by machine learning jobs to generate models upstream. We will structure our data folders into three types of data:

  • Raw data: A dataset generated by retrieving data from the Yahoo Finance API for the last 90 days. We will store the data in CSV format – the same format that it was received in from the API. We will log the run in MLflow and extract the number of rows collected.
  • Staged data: Over the raw data, we will run quality checks, schema verification, and confirm that the data can be used in production. This information about data quality will be logged in MLflow Tracking.
  • Training data: The training data is the final product of the data pipeline. It must be executed over data that is deemed as clean and suitable to execute models. The data contains the data processed into features that...

Acquiring stock data

Our script to acquire the data will be based on the pandas-datareader Python package. It provides a simple abstraction to remote financial APIs we can leverage in the future in the pipeline. The abstraction is very simple. Given a data source such as Yahoo Finance, you provide the stock ticker/pair and date range, and the data is provided in a DataFrame.

We will now create the load_raw_data.py file, which will be responsible for loading the data and saving it in the raw folder. You can look at the contents of the file in the repository at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter07/psystock-data-features-main/load_raw_data.py. Execute the following steps to implement the file:

  1. We will start by importing the relevant packages:
    import mlflow
    from datetime import date
    from dateutil.relativedelta import relativedelta
    import pprint
    import pandas
    import pandas_datareader.data as web
  2. Next, you should...

Checking data quality

Checking data quality as part of your machine learning system is extremely critical to ensure the integrity and correctness of your model training and inference. Principles of software testing and quality should be borrowed and used on the data layer of machine learning platforms.

From a data quality perspective, in a dataset there are a couple of critical dimensions with which to assess and profile our data, namely:

  • Schema compliance: Ensuring the data is from the expected types; making sure that numeric values don't contain any other types of data
  • Valid data: Assessing from a data perspective whether the data is valid from a business perspective
  • Missing data: Assessing whether all the data needed to run analytics and algorithms is available

For data validation, we will use the Great Expectations Python package (available at https://github.com/great-expectations/great_expectations). It allows making assertions on data with many...

Generating a feature set and training data

We will refactor a bit of the code previously developed in our local environment to generate features for training to add to our MLflow project the data pipelineof our MLflow project .

We will now create the feature_set_generation.py file, which will be responsible for generating our features and saving them in the training folder where all the data is valid and ready to be used for ML training. You can look at the contents in the file in the repository https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter07/psystock-data-features-main/feature_set_generation.py:

  1. We need to import the following dependencies:
    import mlflow
    from datetime import date
    from dateutil.relativedelta import relativedelta
    import pprint
    import pandas as pd
    import pandas_datareader
    import pandas_datareader.data as web
    import numpy as np
  2. Before delving into the main component of the code, we'll now proceed to...

Using a feature store

A feature store is a software layer on top of your data to abstract all the production and management processes for data by providing inference systems with an interface to retrieve a feature set that can be used for inference or training.

In this section, we will illustrate the concept of a feature store by using Feast (a feature store), an operational data system for managing and serving machine learning features to models in production:

Figure 7.8 – Feast Architecture (retrieved from https://docs.feast.dev/)

In order to understand how Feast works and how it can fit into your data layer component (code available at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter07/psystock_feature_store, execute the following steps:

  1. Install feast:
    pip install feast==0.10
  2. Initialize a feature repository:
    feast init
  3. Create your feature definitions by replacing the yaml file generated...

Summary

In this chapter, we covered MLflow and its integration with the feature management data layer of our reference architecture. We leveraged the features of the MLflow Projects module to structure our data pipeline.

The important layer of data and feature management was introduced, and the need for feature generation was made clear, as were the concepts of data quality, validation, and data preparation.

We applied the different stages of producing a data pipeline to our own project. We then formalized data acquisition and quality checks. In the last section, we introduced the concept of a feature store and how to create and use one.

In the next chapters and following section of the book, we will focus on applying the data pipeline and features to the process of training and deploying the data pipeline in production.

Further reading

In order to further your knowledge, you can consult the documentation at the following link:

https://github.com/mlflow/mlflow/blob/master/examples/multistep_workflow/MLproject

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering with MLflow
Published in: Aug 2021Publisher: PacktISBN-13: 9781800560796
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Natu Lauchande

Natu Lauchande is a principal data engineer in the fintech space currently tackling problems at the intersection of machine learning, data engineering, and distributed systems. He has worked in diverse industries, including biomedical/pharma research, cloud, fintech, and e-commerce/mobile. Along the way, he had the opportunity to be granted a patent (as co-inventor) in distributed systems, publish in a top academic journal, and contribute to open source software. He has also been very active as a speaker at machine learning/tech conferences and meetups.
Read more about Natu Lauchande