You're reading from Machine Learning Engineering with Python - Second Edition

Product typeBook

Published inAug 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781837631964

Edition2nd Edition

Languages

Python

Tools

GitHub PyTorch

Concepts

Machine Learning

Author (1)

Andrew P. McMahon

Building an Extract, Transform, Machine Learning Use Case

Similar to Chapter 8, Building an Example ML Microservice, the aim of this chapter will be to try to crystallize a lot of the tools and techniques we have learned about throughout this book and apply them to a realistic scenario. This will be based on another use case introduced in Chapter 1, Introduction to ML Engineering, where we imagined the need to cluster taxi ride data on a scheduled basis. So that we can explore some of the other concepts introduced throughout the book, we will assume as well that for each taxi ride, there is also a series of textual data from a range of sources, such as traffic news sites and transcripts of calls between the taxi driver and the base, joined to the core ride information. We will then pass this data to a Large Language Model (LLM) for summarization. The result of this summarization can then be saved in the target data location alongside the basic ride date to provide important context...

Technical requirements

As in the other chapters, to create the environment to run the code examples in this chapter you can run:

conda env create –f mlewp-chapter09.yml

This will include installs of Airflow, PySpark, and some supporting packages. For the Airflow examples, we can just work locally, and assume that if you want to deploy to the cloud, you can follow the details given in Chapter 5, Deployment Patterns and Tools. If you have run the above conda command then you will have installed Airflow locally, along with PySpark and the Airflow PySpark connector package, so you can run Airflow as standalone with the following command in the terminal:

airflow standalone

This will then instantiate a local database and all relevant Airflow components. There will be a lot of output to the terminal, but near the end of the first phase of output, you should be able to spot details about the local server that is running, including a generated user ID and password...

Understanding the batch processing problem

In Chapter 1, Introduction to ML Engineering, we saw the scenario of a taxi firm that wanted to analyze anomalous rides at the end of every day. The customer had the following requirements:

Rides should be clustered based on ride distance and time, and anomalies/outliers identified.
Speed (distance/time) was not to be used, as analysts would like to understand long-distance rides or those with a long duration.
The analysis should be carried out on a daily schedule.
The data for inference should be consumed from the company’s data lake.
The results should be made available for consumption by other company systems.

Based on the description in the introduction to this chapter, we can now add some extra requirements:

The system’s results should contain information on the rides classification as well as a summary of relevant textual data.
Only anomalous rides need to have...

Designing an ETML solution

The requirements clearly point us to a solution that takes in some data and augments it with ML inference, before outputting the data to a target location. Any design we come up with must encapsulate these steps. This is the description of any ETML solution, and this is one of the most used patterns in the ML world. In my opinion it will remain important for a long time to come as it is particularly suited to ML applications where:

Latency is not critical: If you can afford to run on a schedule and there are no high-throughput or low-latency response time requirements, then running as an ETML batch is perfectly acceptable.
You need to batch the data for algorithmic reasons: A great example of this is the clustering approach we will use here. There are ways to perform clustering in an online setting, where the model is continually updated as new data comes in, but some approaches are simpler if you have all the relevant data taken together...

Selecting the tools

For this example, and pretty much whenever we have an ETML problem, our main considerations boil down to a few simple things, namely the selection of the interfaces we need to build, the tools we need to perform the transformation and modeling at the scale we require, and how we orchestrate all of the pieces together. The next few sections will cover each of these in turn.

Interfaces and storage

When we execute the extract and load parts of ETML, we need to consider how to interface with the systems that store our data. It is important that whichever database or data technology we extract from, we use the appropriate tools to extract at whatever scale and pace we need. In this example, we can use S3 on AWS for our storage; our interfacing can be taken care of by the AWS boto3 library and the AWS CLI. Note that we could have selected a few other approaches, some of which are listed in Table 9.2 along with their pros and cons.

...

Executing the build

Execution of the build, in this case, will be very much about how we take the proof-of-concept code shown in Chapter 1, Introduction to ML Engineering, and then split this out into components that can be called by another scheduling tool such as Apache Airflow.

This will provide a showcase of how we can apply some of the ML engineering skills we learned throughout the book. In the next few sections, we will focus on how to build out an Airflow pipeline that leverages a series of different ML capabilities, creating a relatively complex solution in just a few lines of code.

Building an ETML pipeline with advanced Airflow features

We already discussed Airflow in detail in Chapter 5, Deployment Patterns and Tools, but there we covered more of the details around how to deploy your DAGs on the cloud. Here we will focus on building in more advanced capabilities and control flows into your DAGs. We will work locally here on the understanding that when you...

Summary

This chapter has covered how to apply a lot of the techniques learned in this book, in particular from Chapter 2, The Machine Learning Development Process, Chapter 3, From Model to Model Factory, Chapter 4, Packaging Up, and Chapter 5, Deployment Patterns and Tools, to a realistic application scenario. The problem, in this case, concerned clustering taxi rides to find anomalous rides and then performing NLP on some contextual text data to try and help explain those anomalies automatically. This problem was tackled using the ETML pattern, which I offered up as a way to rationalize typical batch ML engineering solutions. This was explained in detail. A design for a potential solution, as well as a discussion of some of the tooling choices any ML engineering team would have to go through, was covered. Finally, a deep dive into some of the key pieces of work that would be required to make this solution production-ready was performed. In particular we showed how you can use good...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Engineering with Python - Second Edition

Published in: Aug 2023Publisher: PacktISBN-13: 9781837631964

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages