Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

MLOps

ML and AI is one of the most important topics in the past several years. MLOps is the practice of productionizing ML products that have been created from data science research. MLOps is very important not only for reusability but also for clear and accurate data science. In this chapter, we will go through the ins and outs of MLflow, a popular MLOps tool that manages every stage of your data science project and experiments. We will also cover AutoML, which is an automated way to get reasonable ML models and feature stores. These are data management systems that version data for historical purposes.

In this chapter, we’re going to cover the following main topics:

  • The basics of machine learning
  • MLFlow
  • HyperOpt
  • AutoML
  • FeatureStore

Technical requirements

The tooling that will be used in this chapter is tied to the tech stack that’s been chosen for this book. All vendors should offer a free trial account.

I will be using Databricks in this chapter.

Setting up your environment

Before we begin this chapter, let’s take some time to set up our working environment.

Python, AWS, and Databricks

As in the previous chapter, this chapter assumes you have a working version of Python 3.6+ installed in your development environment. It also assumes you have set up an AWS account and have set up Databricks with that AWS account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If the following command produces the tool version, then everything is working correctly:

databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host that’s been created for your Databricks instance, as well as the created token:

databricks configure –token...

Introduction to machine learning

ML is a discipline that heavily correlates with the discipline of statistics. We will go through the basics of ML at a high level so that we can appreciate the tooling mentioned later in this chapter.

Understanding data

ML is the process of using some type of learning algorithm on a set of historical data to predict things that are unknown, such as image recognition and future event forecasting, to name a few. When you’re feeding data into your ML model, you will use features. A feature is just another term for data. Data is the oil that runs ML, so we will talk about that first.

Types of data

Data can come in two forms:

  • Quantitative data: Quantitative data is data that can be boxed in and measured. Data such as age and height are good examples of quantitative data. Quantitative data can come in two flavors: discrete and continuous. Discrete data is data that is countable and finite or has a limited range of values. An example...

Understanding hyperparameters and parameters

When we start training our ML models, we will have generally two types of “knobs” to tinker with. The first knob is normally handled by the modeling software – this is typically the parameters of the model. Each modeling technique has parameters that are used within the model to train the model. It’s useful to understand what the parameters are when you train your model to see how it compares to other models.

On the other hand, every time you train your model, you can set varying hyperparameters for the software. A simple example of hyperparameters is that, in random forest model training, you can set things such as the number of trees and the number of features. You would need to search through all the varying combinations across your cross-validation to find the best combination for your validation dataset. Many software packages will do most of that heavy lifting for you now, using a grid search or a random...

AutoML

AutoML is the process of giving an AutoML program your data and having it try to find the best algorithm and hyperparameter for you. This can often give you great general results but typically, it requires more fine-tuning if your AutoML is done well. AutoML can be expensive, so it’s important to watch your bill and plan accordingly. Databricks offers a very useful AutoML feature. Databricks AutoML performs data preparation, trains several models, and then finds the best-trained models. Databricks AutoML will use a variety of the most popular ML libraries in the evaluation. So, will AutoML give you the best model possible? No, it’s not going to replace the need for further feature engineering and model tuning. Instead, it’s going to take a chunk of the work off your plate and try to give you a reasonable model to start with. In some cases, that model will be good enough for what you need.

Note

You can learn more about AutoML by going to https://learn...

MLflow

MLflow is an open source MLOps framework that aims to be compatible with whatever other tools you might have. It has libraries in many popular languages and can be accessed via the REST API. MLflow has very few opinions and will not dictate anything related to your tech stack. You don’t even need to use the provided libraries; you could choose to use just REST API interactions. This allows teams to customize and pick whatever other tooling they wish. If you want a hosted MLflow, then you can use Databrick’s hosted MLflow.

MLOps benefits

There are many useful benefits to MLOps, and they often focus on features and models. MLOps tooling will often record the details of a series of experiments. This type of documentation can include metrics around model training and performance. MLOps tooling often stores your model and, in some cases, has mechanisms for users to interact with that model. Another useful area of MLOps is around features, which often need to be...

Feature stores

A feature store is a repository of features that have been created and versioned and are ready for model training. Recording features and storing them is critical for reproducibility. I have seen many cases where data scientists have created models with no documentation on how to retrain them other than a mess of complex code. A feature store is a catalog of features, similar to a model store. Feature stores are normally organized into databases and feature tables.

Let’s jump right in and go through Databricks feature store’s APIs:

  1. First, let’s import the necessary libraries:
    from databricks import feature_store
    from databricks.feature_store import FeatureLookup
    import random
  2. Now, let’s create our name and record our database and schema. We are using the users DataFrame. We will also set our lookup_key, which in this case is user_id. A lookup key is just the value that identifies the feature store when we’re searching for...

Practical lab

So, the first problem is to create a rest API with fake data that we can predict with.

For this, I have used mockaroo.com.

Here is the schema I created with Mockaroo:

Figure 6.2: Setting fake data

Figure 6.2: Setting fake data

A sample of the data looks like this:

Figure 6.3: Fake data output

Figure 6.3: Fake data output

Mockaroo allows you to create a free API – all you need to do is hit Create API at the bottom of the schema window.

Next, we will use Python to pull the data and prepare it for modeling.

First, we will import the necessary libraries:

import requests
import pandas as pd
import io
import requests
import mlflow
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import numpy as np

Next, we will use the requests package to send a REST GET to our new Mockaroo API:

url = "https://my.api.mockaroo.com/chapter_6.json"

Note that you must put...

Summary

Phew – that was a ton of information! It’s been a long chapter, so let’s discuss what we have covered. First, we looked at the basics of ML and how it can be used. Then, we looked at the importance and usage of data in ML. We explored AutoML and managing ML projects with MLFlow. Lastly, we looked at how we can better manage our training data with a feature store. In the next chapter, we will look at managing our data workflows.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp