You're reading from Machine Learning Engineering with MLflow

Product typeBook

Published inAug 2021

PublisherPackt

ISBN-139781800560796

Edition1st Edition

Tools

Maven

Concepts

Machine Learning

Author (1)

Natu Lauchande

Chapter 7: Data and Feature Management

In this chapter, we will add a feature management data layer to the machine learning platform being built. We will leverage the features of the MLflow Projects module to structure our data pipeline.

Specifically, we will look at the following sections in this chapter:

Structuring your data pipeline project
Acquiring stock data
Checking data quality
Managing features

In this chapter, we will acquire relevant data to provide datasets for training. Our primary resource will be the Yahoo Finance Data for BTC dataset. Alongside that data, we will acquire the following extra datasets.

Leveraging our productionization architecture introduced in Chapter 6, Introducing ML Systems Architecture, represented in Figure 7.1, the feature and data component is responsible for acquiring data from sources and making the data available in a format consumable by the different components of the platform:

...

Technical requirements

For this chapter, you will need the following prerequisites:

The latest version of Docker installed on your machine. If you don't already have it installed, please follow the instructions at https://docs.docker.com/get-docker/.
The latest version of docker-compose installed. Please follow the instructions at https://docs.docker.com/compose/install/.
Access to Git on the command line and installed as described at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.
Access to a Bash terminal (Linux or Windows).
Access to a browser.
Python 3.5+ installed.
The latest version of your machine learning installed locally as described in Chapter 3, Your Data Science Workbench.

In the next section, we will describe the structure of our data pipeline, the data sources, and the different steps that we will execute to implement our practical example leveraging MLflow project features...

Structuring your data pipeline project

At a high level, our data pipeline will run weekly, collecting data for the preceding 7 days and storing it in a way that can be run by machine learning jobs to generate models upstream. We will structure our data folders into three types of data:

Raw data: A dataset generated by retrieving data from the Yahoo Finance API for the last 90 days. We will store the data in CSV format – the same format that it was received in from the API. We will log the run in MLflow and extract the number of rows collected.
Staged data: Over the raw data, we will run quality checks, schema verification, and confirm that the data can be used in production. This information about data quality will be logged in MLflow Tracking.
Training data: The training data is the final product of the data pipeline. It must be executed over data that is deemed as clean and suitable to execute models. The data contains the data processed into features that...

Acquiring stock data

Our script to acquire the data will be based on the pandas-datareader Python package. It provides a simple abstraction to remote financial APIs we can leverage in the future in the pipeline. The abstraction is very simple. Given a data source such as Yahoo Finance, you provide the stock ticker/pair and date range, and the data is provided in a DataFrame.

We will now create the load_raw_data.py file, which will be responsible for loading the data and saving it in the raw folder. You can look at the contents of the file in the repository at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter07/psystock-data-features-main/load_raw_data.py. Execute the following steps to implement the file:

We will start by importing the relevant packages:

import mlflow
from datetime import date
from dateutil.relativedelta import relativedelta
import pprint
import pandas
import pandas_datareader.data as web

Next, you should...

Checking data quality

Checking data quality as part of your machine learning system is extremely critical to ensure the integrity and correctness of your model training and inference. Principles of software testing and quality should be borrowed and used on the data layer of machine learning platforms.

From a data quality perspective, in a dataset there are a couple of critical dimensions with which to assess and profile our data, namely:

Schema compliance: Ensuring the data is from the expected types; making sure that numeric values don't contain any other types of data
Valid data: Assessing from a data perspective whether the data is valid from a business perspective
Missing data: Assessing whether all the data needed to run analytics and algorithms is available

For data validation, we will use the Great Expectations Python package (available at https://github.com/great-expectations/great_expectations). It allows making assertions on data with many...

Generating a feature set and training data

We will refactor a bit of the code previously developed in our local environment to generate features for training to add to our MLflow project the data pipelineof our MLflow project .

We will now create the feature_set_generation.py file, which will be responsible for generating our features and saving them in the training folder where all the data is valid and ready to be used for ML training. You can look at the contents in the file in the repository https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/blob/master/Chapter07/psystock-data-features-main/feature_set_generation.py:

We need to import the following dependencies:

import mlflow
from datetime import date
from dateutil.relativedelta import relativedelta
import pprint
import pandas as pd
import pandas_datareader
import pandas_datareader.data as web
import numpy as np

Before delving into the main component of the code, we'll now proceed to...

Using a feature store

A feature store is a software layer on top of your data to abstract all the production and management processes for data by providing inference systems with an interface to retrieve a feature set that can be used for inference or training.

In this section, we will illustrate the concept of a feature store by using Feast (a feature store), an operational data system for managing and serving machine learning features to models in production:

Figure 7.8 – Feast Architecture (retrieved from https://docs.feast.dev/)

In order to understand how Feast works and how it can fit into your data layer component (code available at https://github.com/PacktPublishing/Machine-Learning-Engineering-with-MLflow/tree/master/Chapter07/psystock_feature_store, execute the following steps:

Install feast:
```
pip install feast==0.10
```
Initialize a feature repository:
```
feast init
```
Create your feature definitions by replacing the yaml file generated...

Summary

In this chapter, we covered MLflow and its integration with the feature management data layer of our reference architecture. We leveraged the features of the MLflow Projects module to structure our data pipeline.

The important layer of data and feature management was introduced, and the need for feature generation was made clear, as were the concepts of data quality, validation, and data preparation.

We applied the different stages of producing a data pipeline to our own project. We then formalized data acquisition and quality checks. In the last section, we introduced the concept of a feature store and how to create and use one.

In the next chapters and following section of the book, we will focus on applying the data pipeline and features to the process of training and deploying the data pipeline in production.

Natu Lauchande is a principal data engineer in the fintech space currently tackling problems at the intersection of machine learning, data engineering, and distributed systems. He has worked in diverse industries, including biomedical/pharma research, cloud, fintech, and e-commerce/mobile. Along the way, he had the opportunity to be granted a patent (as co-inventor) in distributed systems, publish in a top academic journal, and contribute to open source software. He has also been very active as a speaker at machine learning/tech conferences and meetups.
Read more about Natu Lauchande

Other recommended products

Related to this chapter

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

BookFeb 2021312 pages

Engineering MLOps

Get to grips with ML lifecycle management and MLOps implementation for your organization. This book will give you comprehensive insights into MLOps coupled with real-world examples in Azure that will teach you how to write programs, train robust and scalable ML models, and build ML pipelines to train, deploy, and monitor models securely in production.

BookApr 2021370 pages

Amazon SageMaker Best Practices

Going beyond the basics, Amazon SageMaker Best Practices provides end-to-end coverage of the service capabilities that the platform offers for building and automating machine learning workloads to address data science challenges. With this book, you'll discover tips to train, deploy, and monitor your machine learning solutions efficiently.

BookSep 2021348 pages

Learn Amazon SageMaker

This book will teach you how to move quickly from business questions to machine learning models in production. Using real-world examples implemented with Python and Jupyter notebooks, you’ll learn about many the features and APIs of Amazon SageMaker on a wide spectrum of use cases: tabular data, computer vision, and natural language processing.

BookAug 2020490 pages

Python Data Science Essentials

Python Data Science Essentials, Third Edition provides modern insight in setting up and performing data science operations effectively using the latest python tools and libraries. It builds faster governance on the most essential tasks such as data munging and pre-processing, along with all the techniques you require.

BookSep 2018472 pages

Mastering Azure Machine Learning

This book will help you learn how to build a scalable end-to-end machine learning pipeline in Azure from experimentation and training to optimization and deployment. By the end of this book, you will learn to build complex distributed systems and scalable cloud infrastructure using powerful machine learning algorithms to compute insights.

BookApr 2020436 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Machine Learning Engineering with MLflow

Chapter 7: Data and Feature Management

Technical requirements

Structuring your data pipeline project

Acquiring stock data

Checking data quality

Generating a feature set and training data

Using a feature store

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Distributed Data Systems with Azure Databricks

Automated Machine Learning

This guide will help you to explore automated machine learning (AutoML), a rapidly growing subfield of machine learning. You’ll learn how you can use AutoML to fully automate the machine learning process even if you’re not an expert, and in turn increase your productivity drastically.

Engineering MLOps

Amazon SageMaker Best Practices

Learn Amazon SageMaker

Python Data Science Essentials

Mastering Azure Machine Learning

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook