You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

MLOps

ML and AI is one of the most important topics in the past several years. MLOps is the practice of productionizing ML products that have been created from data science research. MLOps is very important not only for reusability but also for clear and accurate data science. In this chapter, we will go through the ins and outs of MLflow, a popular MLOps tool that manages every stage of your data science project and experiments. We will also cover AutoML, which is an automated way to get reasonable ML models and feature stores. These are data management systems that version data for historical purposes.

In this chapter, we’re going to cover the following main topics:

The basics of machine learning
MLFlow
HyperOpt
AutoML
FeatureStore

Technical requirements

The tooling that will be used in this chapter is tied to the tech stack that’s been chosen for this book. All vendors should offer a free trial account.

I will be using Databricks in this chapter.

Setting up your environment

Before we begin this chapter, let’s take some time to set up our working environment.

Python, AWS, and Databricks

As in the previous chapter, this chapter assumes you have a working version of Python 3.6+ installed in your development environment. It also assumes you have set up an AWS account and have set up Databricks with that AWS account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If the following command produces the tool version, then everything is working correctly:

databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host that’s been created for your Databricks instance, as well as the created token:

databricks configure –token...

Introduction to machine learning

ML is a discipline that heavily correlates with the discipline of statistics. We will go through the basics of ML at a high level so that we can appreciate the tooling mentioned later in this chapter.

Understanding data

ML is the process of using some type of learning algorithm on a set of historical data to predict things that are unknown, such as image recognition and future event forecasting, to name a few. When you’re feeding data into your ML model, you will use features. A feature is just another term for data. Data is the oil that runs ML, so we will talk about that first.

Types of data

Data can come in two forms:

Quantitative data: Quantitative data is data that can be boxed in and measured. Data such as age and height are good examples of quantitative data. Quantitative data can come in two flavors: discrete and continuous. Discrete data is data that is countable and finite or has a limited range of values. An example...

Understanding hyperparameters and parameters

When we start training our ML models, we will have generally two types of “knobs” to tinker with. The first knob is normally handled by the modeling software – this is typically the parameters of the model. Each modeling technique has parameters that are used within the model to train the model. It’s useful to understand what the parameters are when you train your model to see how it compares to other models.

On the other hand, every time you train your model, you can set varying hyperparameters for the software. A simple example of hyperparameters is that, in random forest model training, you can set things such as the number of trees and the number of features. You would need to search through all the varying combinations across your cross-validation to find the best combination for your validation dataset. Many software packages will do most of that heavy lifting for you now, using a grid search or a random...

AutoML

AutoML is the process of giving an AutoML program your data and having it try to find the best algorithm and hyperparameter for you. This can often give you great general results but typically, it requires more fine-tuning if your AutoML is done well. AutoML can be expensive, so it’s important to watch your bill and plan accordingly. Databricks offers a very useful AutoML feature. Databricks AutoML performs data preparation, trains several models, and then finds the best-trained models. Databricks AutoML will use a variety of the most popular ML libraries in the evaluation. So, will AutoML give you the best model possible? No, it’s not going to replace the need for further feature engineering and model tuning. Instead, it’s going to take a chunk of the work off your plate and try to give you a reasonable model to start with. In some cases, that model will be good enough for what you need.

Note

You can learn more about AutoML by going to https://learn...

MLflow

MLflow is an open source MLOps framework that aims to be compatible with whatever other tools you might have. It has libraries in many popular languages and can be accessed via the REST API. MLflow has very few opinions and will not dictate anything related to your tech stack. You don’t even need to use the provided libraries; you could choose to use just REST API interactions. This allows teams to customize and pick whatever other tooling they wish. If you want a hosted MLflow, then you can use Databrick’s hosted MLflow.

MLOps benefits

There are many useful benefits to MLOps, and they often focus on features and models. MLOps tooling will often record the details of a series of experiments. This type of documentation can include metrics around model training and performance. MLOps tooling often stores your model and, in some cases, has mechanisms for users to interact with that model. Another useful area of MLOps is around features, which often need to be...

Feature stores

A feature store is a repository of features that have been created and versioned and are ready for model training. Recording features and storing them is critical for reproducibility. I have seen many cases where data scientists have created models with no documentation on how to retrain them other than a mess of complex code. A feature store is a catalog of features, similar to a model store. Feature stores are normally organized into databases and feature tables.

Let’s jump right in and go through Databricks feature store’s APIs:

First, let’s import the necessary libraries:

from databricks import feature_store

from databricks.feature_store import FeatureLookup

import random

Now, let’s create our name and record our database and schema. We are using the users DataFrame. We will also set our lookup_key, which in this case is user_id. A lookup key is just the value that identifies the feature store when we’re searching for...

Practical lab

So, the first problem is to create a rest API with fake data that we can predict with.

For this, I have used mockaroo.com.

Here is the schema I created with Mockaroo:

Figure 6.2: Setting fake data

A sample of the data looks like this:

Figure 6.3: Fake data output

Mockaroo allows you to create a free API – all you need to do is hit Create API at the bottom of the schema window.

Next, we will use Python to pull the data and prepare it for modeling.

First, we will import the necessary libraries:

import requests
import pandas as pd
import io
import requests
import mlflow
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import numpy as np

Next, we will use the requests package to send a REST GET to our new Mockaroo API:

url = "https://my.api.mockaroo.com/chapter_6.json"

Note that you must put...

Summary

Phew – that was a ton of information! It’s been a long chapter, so let’s discuss what we have covered. First, we looked at the basics of ML and how it can be used. Then, we looked at the importance and usage of data in ML. We explored AutoML and managing ML projects with MLFlow. Lastly, we looked at how we can better manage our training data with a feature store. In the next chapter, we will look at managing our data workflows.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages