Reader small image

You're reading from  Machine Learning Engineering with Python - Second Edition

Product typeBook
Published inAug 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781837631964
Edition2nd Edition
Languages
Right arrow
Author (1)
Andrew P. McMahon
Andrew P. McMahon
author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon

Right arrow

Building an Extract, Transform, Machine Learning Use Case

Similar to Chapter 8, Building an Example ML Microservice, the aim of this chapter will be to try to crystallize a lot of the tools and techniques we have learned about throughout this book and apply them to a realistic scenario. This will be based on another use case introduced in Chapter 1, Introduction to ML Engineering, where we imagined the need to cluster taxi ride data on a scheduled basis. So that we can explore some of the other concepts introduced throughout the book, we will assume as well that for each taxi ride, there is also a series of textual data from a range of sources, such as traffic news sites and transcripts of calls between the taxi driver and the base, joined to the core ride information. We will then pass this data to a Large Language Model (LLM) for summarization. The result of this summarization can then be saved in the target data location alongside the basic ride date to provide important context...

Technical requirements

As in the other chapters, to create the environment to run the code examples in this chapter you can run:

conda env create –f mlewp-chapter09.yml

This will include installs of Airflow, PySpark, and some supporting packages. For the Airflow examples, we can just work locally, and assume that if you want to deploy to the cloud, you can follow the details given in Chapter 5, Deployment Patterns and Tools. If you have run the above conda command then you will have installed Airflow locally, along with PySpark and the Airflow PySpark connector package, so you can run Airflow as standalone with the following command in the terminal:

airflow standalone

This will then instantiate a local database and all relevant Airflow components. There will be a lot of output to the terminal, but near the end of the first phase of output, you should be able to spot details about the local server that is running, including a generated user ID and password...

Understanding the batch processing problem

In Chapter 1, Introduction to ML Engineering, we saw the scenario of a taxi firm that wanted to analyze anomalous rides at the end of every day. The customer had the following requirements:

  • Rides should be clustered based on ride distance and time, and anomalies/outliers identified.
  • Speed (distance/time) was not to be used, as analysts would like to understand long-distance rides or those with a long duration.
  • The analysis should be carried out on a daily schedule.
  • The data for inference should be consumed from the company’s data lake.
  • The results should be made available for consumption by other company systems.

Based on the description in the introduction to this chapter, we can now add some extra requirements:

  • The system’s results should contain information on the rides classification as well as a summary of relevant textual data.
  • Only anomalous rides need to have...

Designing an ETML solution

The requirements clearly point us to a solution that takes in some data and augments it with ML inference, before outputting the data to a target location. Any design we come up with must encapsulate these steps. This is the description of any ETML solution, and this is one of the most used patterns in the ML world. In my opinion it will remain important for a long time to come as it is particularly suited to ML applications where:

  • Latency is not critical: If you can afford to run on a schedule and there are no high-throughput or low-latency response time requirements, then running as an ETML batch is perfectly acceptable.
  • You need to batch the data for algorithmic reasons: A great example of this is the clustering approach we will use here. There are ways to perform clustering in an online setting, where the model is continually updated as new data comes in, but some approaches are simpler if you have all the relevant data taken together...

Selecting the tools

For this example, and pretty much whenever we have an ETML problem, our main considerations boil down to a few simple things, namely the selection of the interfaces we need to build, the tools we need to perform the transformation and modeling at the scale we require, and how we orchestrate all of the pieces together. The next few sections will cover each of these in turn.

Interfaces and storage

When we execute the extract and load parts of ETML, we need to consider how to interface with the systems that store our data. It is important that whichever database or data technology we extract from, we use the appropriate tools to extract at whatever scale and pace we need. In this example, we can use S3 on AWS for our storage; our interfacing can be taken care of by the AWS boto3 library and the AWS CLI. Note that we could have selected a few other approaches, some of which are listed in Table 9.2 along with their pros and cons.

...

Executing the build

Execution of the build, in this case, will be very much about how we take the proof-of-concept code shown in Chapter 1, Introduction to ML Engineering, and then split this out into components that can be called by another scheduling tool such as Apache Airflow.

This will provide a showcase of how we can apply some of the ML engineering skills we learned throughout the book. In the next few sections, we will focus on how to build out an Airflow pipeline that leverages a series of different ML capabilities, creating a relatively complex solution in just a few lines of code.

Building an ETML pipeline with advanced Airflow features

We already discussed Airflow in detail in Chapter 5, Deployment Patterns and Tools, but there we covered more of the details around how to deploy your DAGs on the cloud. Here we will focus on building in more advanced capabilities and control flows into your DAGs. We will work locally here on the understanding that when you...

Summary

This chapter has covered how to apply a lot of the techniques learned in this book, in particular from Chapter 2, The Machine Learning Development Process, Chapter 3, From Model to Model Factory, Chapter 4, Packaging Up, and Chapter 5, Deployment Patterns and Tools, to a realistic application scenario. The problem, in this case, concerned clustering taxi rides to find anomalous rides and then performing NLP on some contextual text data to try and help explain those anomalies automatically. This problem was tackled using the ETML pattern, which I offered up as a way to rationalize typical batch ML engineering solutions. This was explained in detail. A design for a potential solution, as well as a discussion of some of the tooling choices any ML engineering team would have to go through, was covered. Finally, a deep dive into some of the key pieces of work that would be required to make this solution production-ready was performed. In particular we showed how you can use good...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering with Python - Second Edition
Published in: Aug 2023Publisher: PacktISBN-13: 9781837631964
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Andrew P. McMahon

Andrew P. McMahon has spent years building high-impact ML products across a variety of industries. He is currently Head of MLOps for NatWest Group in the UK and has a PhD in theoretical condensed matter physics from Imperial College London. He is an active blogger, speaker, podcast guest, and leading voice in the MLOps community. He is co-host of the AI Right podcast and was named ‘Rising Star of the Year' at the 2022 British Data Awards and ‘Data Scientist of the Year' by the Data Science Foundation in 2019.
Read more about Andrew P. McMahon