Machine Learning Engineering with Python

5 (1 reviews total)
By Andrew P. McMahon
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Introduction to ML Engineering

About this book

Machine learning engineering is a thriving discipline at the interface of software development and machine learning. This book will help developers working with machine learning and Python to put their knowledge to work and create high-quality machine learning products and services.

Machine Learning Engineering with Python takes a hands-on approach to help you get to grips with essential technical concepts, implementation patterns, and development methodologies to have you up and running in no time. You'll begin by understanding key steps of the machine learning development life cycle before moving on to practical illustrations and getting to grips with building and deploying robust machine learning solutions. As you advance, you'll explore how to create your own toolsets for training and deployment across all your projects in a consistent way. The book will also help you get hands-on with deployment architectures and discover methods for scaling up your solutions while building a solid understanding of how to use cloud-based tools effectively. Finally, you'll work through examples to help you solve typical business problems.

By the end of this book, you'll be able to build end-to-end machine learning services using a variety of techniques and design your own processes for consistently performant machine learning engineering.

Publication date:
November 2021
Publisher
Packt
Pages
276
ISBN
9781801079259

 

Chapter 1: Introduction to ML Engineering

Welcome to Machine Learning Engineering with Python, a book that aims to introduce you to the exciting world of making Machine Learning (ML) systems production-ready.

This book will take you through a series of chapters covering training systems, scaling up solutions, system design, model tracking, and a host of other topics, to prepare you for your own work in ML engineering or to work with others in this space. No book can be exhaustive on this topic, so this one will focus on concepts and examples that I think cover the foundational principles of this increasingly important discipline.

You will get a lot from this book even if you do not run the technical examples, or even if you try to apply the main points in other programming languages or with different tools. In covering the key principles, the aim is that you come away from this book feeling more confident in tackling your own ML engineering challenges, whatever your chosen toolset.

In this first chapter, you will learn about the different types of data role relevant to ML engineering and how to distinguish them; how to use this knowledge to build and work within appropriate teams; some of the key points to remember when building working ML products in the real world; how to start to isolate appropriate problems for engineered ML solutions; and how to create your own high-level ML system designs for a variety of typical business problems.

We will cover all of these aspects in the following sections:

  • Defining a taxonomy of data disciplines
  • Assembling your team
  • ML engineering in the real world
  • What does an ML solution look like?
  • High-level ML system design

Now that we have explained what we are going after in this first chapter, let's get started!

 

Technical requirements

Throughout the book, we will assume that Python 3 is installed and working. The following Python packages are used in this chapter:

  • Scikit-learn 0.23.2
  • NumPy
  • pandas
  • imblearn
  • Prophet 0.7.1
 

Defining a taxonomy of data disciplines

The explosion of data and the potential applications of that data over the past few years have led to a proliferation of job roles and responsibilities. The debate that once raged over how a data scientist was different from a statistician has now become extremely complex. I would argue, however, that it does not have to be so complicated. The activities that have to be undertaken to get value from data are pretty consistent, no matter what business vertical you are in, so it should be reasonable to expect that the skills and roles you need to perform these steps will also be relatively consistent. In this chapter, we will explore some of the main data disciplines that I think you will always need in any data project. As you can guess, given the name of this book, I will be particularly keen to explore the notion of ML engineering and how this fits into the mix.

Let's now look at some of the roles involved in using data in the modern landscape.

Data scientist

Since the Harvard Business Review declared that being a data scientist was The Sexiest Job of the 21st Century (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century), this title has become one of the most sought after, but also hyped, in the mix. A data scientist can cover an entire spectrum of duties, skills, and responsibilities depending on the business vertical, the organization, or even just personal preference. No matter how this role is defined, however, there are some key areas of focus that should always be part of the data scientist's job profile:

  • Analysis: A data scientist should be able to wrangle, mung, manipulate, and consolidate datasets before performing calculations on that data that help us to understand it. Analysis is a broad term, but it's clear that the end result is knowledge of your dataset that you didn't have before you started, no matter how basic or complex.
  • Modeling: The thing that gets everyone excited (potentially including you, dear reader) is the idea of modeling data. A data scientist usually has to be able to apply statistical, mathematical, and machine learning models to data in order to explain it or perform some sort of prediction.
  • Working with the customer or user: The data science role usually has some more business-directed elements so that the results of steps 1 and 2 can support decision making in the organization. This could be done by presenting the results of analysis in PowerPoints or Jupyter notebooks or even sending an email with a summary of the key results. It involves communication and business acumen in a way that goes beyond classic tech roles.

ML engineer

A newer kid on the block, and indeed the subject of this book, is the ML engineer. This role has risen to fill the perceived gap between the analysis and modeling of data science and the world of software products and robust systems engineering.

You can articulate the need for this type of role quite nicely by considering a classic voice assistant. In this case, a data scientist would usually focus on translating the business requirements into a working speech-to-text model, potentially a very complex neural network, and showing that it can perform the desired voice transcription task in principle. ML engineering is then all about how you take that speech-to-text model and build it into a product, service, or tool that can be used in production. Here, it may mean building some software to train, retrain, deploy, and track the performance of the model as more transcription data is accumulated, or user preferences are understood. It may also involve understanding how to interface with other systems and how to provide results from the model in the appropriate formats, for example, interacting with an online store.

Data scientists and ML engineers have a lot of overlapping skill sets and competencies, but have different areas of focus and strengths (more on this later), so they will usually be part of the same project team and may have either title, but it will be clear what hat they are wearing from what they do in that project.

Similar to the data scientist, we can define the key areas of focus for the ML engineer:

  • Translation: Taking models and research code in a variety of formats and translating this into slicker, more robust pieces of code. This could be done using OO programming, functional programming, a mix, or something else, but basically helps to take the Proof-Of-Concept work of the data scientist and turn it into something that is far closer to being trusted in a production environment.
  • Architecture: Deployments of any piece of software do not occur in a vacuum and will always involve lots of integrated parts. This is true of machine learning solutions as well. The ML engineer has to understand how the appropriate tools and processes link together so that the models built with the data scientist can do their job and do it at scale.
  • Productionization: The ML engineer is focused on delivering a solution and so should understand the customer's requirements inside out, as well as be able to understand what that means for the project development. The end goal of the ML engineer is not to provide a good model (though that is part of it), nor is it to provide something that basically works. Their job is to make sure that the hard work on the data science side of things generates the maximum potential value in a real-world setting.

Data engineer

The most important people in any data team (in my opinion) are the people who are responsible for getting the commodity that everything else in the preceding sections is based on from A to B with high fidelity, appropriate latency, and with as little effort on the part of the other team members as possible. You cannot create any type of software product, never mind a machine learning product, without data.

The key areas of focus for a data engineer are as follows:

  • Quality: Getting data from A to B is a pointless exercise if the data is garbled, fields are missing, or IDs are screwed up. The data engineer cares about avoiding this and uses a variety of techniques and tools, generally to ensure that the data that left the source system is what lands in your data storage layer.
  • Stability: Similar to the previous point on quality, if the data comes from A to B but it only does it every second Wednesday if it's not a rainy day, then what's the point? Data engineers spend a lot of time and effort and use their considerable skills to ensure that data pipelines are robust, reliable, and can be trusted to deliver when promised.
  • Access: Finally, the aim of getting the data from A to B is for it to be used by applications, analyses, and machine learning models, so the nature of the B is important. The data engineer will have a variety of technologies to hand for surfacing data and should work with the data consumers (our data scientists and machine learning engineers, among others) to define and create appropriate data models within these solutions:
Figure 1.1 – A diagram showing the relationships between data science, ML engineering, and data engineering

Figure 1.1 – A diagram showing the relationships between data science, ML engineering, and data engineering

As mentioned previously, this book focuses on the work of the ML engineer and how you can learn some of the skills useful for that role, but it is always important to remember that you will not be working in a vacuum. Always keep in mind the profiles of the other roles (and many more not covered here that will exist in your project team) so that you work most effectively together. Data is a team sport after all!

 

Assembling your team

There are no set rules about how you should pull together a team for your machine learning project, but there are some good general principles to follow, and gotchas to avoid.

First, always bear in mind that unicorns do not exist. You can find some very talented people out there, but do not ever think one person can do everything you will need to the level you require. This is not just a bit unrealistic; it is bad practice and will negatively impact the quality of your products. Even when you are severely resource-constrained, the key is for your team members to have a laser-like focus to succeed.

Secondly, blended is best. We all know the benefits of diversity for organizations and teams in general and this should, of course, apply to your machine learning team as well. Within a project, you will need the mathematics, the code, the engineering, the project management, the communication, and a variety of other skills to succeed. So, given the previous point, make sure you cover this in at least some sense across your team.

Third, tie your team structure to your projects in a dynamic way. If you are working on a project that is mostly about getting the data in the right place and the actual machine learning models are really simple, focus your team profile on the engineering and data modeling aspects. If the project requires a detailed understanding of the model, and it is quite complex, then reposition your team to make sure this is covered. This is just sensible and frees up team members who would otherwise have been underutilized to work on other projects.

As an example, suppose that you have been tasked with building a system that classifies customer data as it comes into your shiny new data lake, and the decision has been taken that this should be done at the point of ingestion via a streaming application. The classification has already been built for another project. It is already clear that this solution will heavily involve the skills of the data engineer and the ML engineer, but not so much the data scientist since that portion of work has been completed in another project.

In the next section, we will look at some important points to consider when deploying your team on a real-world business problem.

 

ML engineering in the real world

The majority of us who work in machine learning, analytics, and related disciplines do so for for-profit companies. It is important therefore that we consider some of the important aspects of doing this type of work in the real world.

First of all, the ultimate goal of your work is to generate value. This can be calculated and defined in a variety of ways, but fundamentally your work has to improve something for the company or their customers in a way that justifies the investment put in. This is why most companies will not be happy for you to take a year to play with new tools and then generate nothing concrete to show for it (not that you would do this anyway, it is probably quite boring) or to spend your days reading the latest papers and only reading the latest papers. Yes, these things are part of any job in technology, and especially any job in the world of machine learning, but you have to be strategic about how you spend your time and always be aware of your value proposition.

Secondly, to be a successful ML engineer in the real world, you cannot just understand the technology; you must understand the business. You will have to understand how the company works day to day, you will have to understand how the different pieces of the company fit together, and you will have to understand the people of the company and their roles. Most importantly, you have to understand the customer, both of the business and of your work. If you do not know the motivations, pains, and needs of the people you are building for, then how can you be expected to build the right thing?

Finally, and this may be controversial, the most important skill for you being a successful ML engineer in the real world is one that this book will not teach you, and that is the ability to communicate effectively. You will have to work in a team, with a manager, with the wider community and business, and, of course, with your customers, as mentioned above. If you can do this and you know the technology and techniques (many of which are discussed in this book), then what can stop you?

But what kind of problems can you solve with machine learning when you work in the real world? Well, let's start with another potentially controversial statement: a lot of the time, machine learning is not the answer. This may seem strange given the title of this book, but it is just as important to know when not to apply machine learning as when to apply it. This will save you tons of expensive development time and resources.

Machine learning is ideal for cases when you want to do a semi-routine task faster, with more accuracy, or at a far larger scale than is possible with other solutions. Some typical examples are given in the following table, along with some discussion as to whether or not ML would be an appropriate tool for solving the problem:

Figure 1.2 – Potential use cases for ML

Figure 1.2 – Potential use cases for ML

As this table of simple examples hopefully starts to make clear, the cases where machine learning is the answer are ones that can usually be very well framed as a mathematical or statistical problem. After all, this is what machine learning really is; a series of algorithms rooted in mathematics that can iterate some internal parameters based on data. Where the lines start to blur in the modern world are through advances in areas such as deep learning or reinforcement learning, where problems that we previously thought would be very hard to phrase appropriately for standard ML algorithms can now be tackled.

The other tendency to watch out for in the real world (to go along with let's use ML for everything) is the worry that people have that ML is coming for their job and should not be trusted. This is understandable: a report by PwC in 2018 suggested that 30% of UK jobs will be impacted by automation by the 2030s (Will Robots Really Steal Our Jobs?: https://www.pwc.co.uk/economic-services/assets/international-impact-of-automation-feb-2018.pdf). What you have to try and make clear when working with your colleagues and customers is that what you are building is there to supplement and augment their capabilities, not to replace them.

Let's conclude this section by revisiting an important point: the fact that you are working for a company means, of course, that the aim of the game is to create value appropriate to the investment. In other words, you need to show a good Return On Investment (ROI). This means a couple of things for you practically:

  • You have to understand how different designs require different levels of investment. If you can solve your problem by training a deep neural net on a million images with a GPU running 24/7 for a month, or you know you can solve the same problem with some basic clustering and a bit of statistics on some standard hardware in a few hours, which should you choose?
  • You have to be clear about the value you will generate. This means you need to work with experts and try to translate the results of your algorithm into actual dollar values. This is so much more difficult than it sounds, so you should take the time you need to get it right. And never, ever over-promise. You should always under-promise and over-deliver.

Adoption is not guaranteed. Even when building products for your colleagues within a company, it is important to understand that your solution will be tested every time someone uses it post-deployment. If you build shoddy solutions, then people will not use them, and the value proposition of what you have done will start to disappear.

Now that you understand some of the important points when using ML to solve business problems, let's explore what these solutions can look like.

 

What does an ML solution look like?

When you think of ML engineering, you would be forgiven for defaulting to imagining working on voice assistance and visual recognition apps (I fell into this trap in previous pages, did you notice?). The power of ML, however, lies in the fact that wherever there is data and an appropriate problem, it can help and be integral to the solution.

Some examples might help make this clearer. When you type a text message and your phone suggests the next words, it can very often be using a natural language model under the hood. When you scroll any social media feed or watch a streaming service, recommendation algorithms are working double time. If you take a car journey and an app forecasts when you are likely to arrive at your destination, there is going to be some kind of regression at work. Your loan application often results in your characteristics and application details being passed through a classifier. These applications are not the ones shouted about on the news (perhaps with the exception of when they go horribly wrong), but they are all examples of brilliantly put-together ML engineering.

In this book, the examples we work through will be more like these; typical scenarios for machine learning encountered in products and businesses every day. These are solutions that, if you can build them confidently, will make you an asset to any organization.

We should start by considering the broad elements that should constitute any ML solution, as indicated in the following diagram:

Figure 1.3 – Schematic of the general components or layers of any ML solution and what they are responsible for

Figure 1.3 – Schematic of the general components or layers of any ML solution and what they are responsible for

Your storage layer constitutes the endpoint of the data engineering process and the beginning of the ML one. It includes your data for training, your results from running your models, your artifacts, and important metadata. We can also consider this as including your stored code.

The compute layer is where the magic happens and where most of the focus of this book will be. It is where training, testing, prediction, and transformation all (mostly) happen. This book is all about making this layer as well-engineered as possible and interfacing with the other layers. You can blow this layer up to incorporate these pieces as in the following workflow:

Figure 1.4 – The key elements of the compute layer

Figure 1.4 – The key elements of the compute layer

Important note

The details are discussed later in the book, but this highlights the fact that at a fundamental level, your compute processes for any ML solution are really just about taking some data in and pushing some data out.

The surfacing layer is where you share your ML solution's results with other systems. This could be through anything from application database insertion to API endpoints, to message queues, to visualization tools. This is the layer through which your customer eventually gets to use the results, so you must engineer your system to provide clean and understandable outputs, something we will discuss later.

And that is it in a nutshell. We will go into detail about all of these layers and points later, but for now, just remember these broad concepts and you will start to understand how all the detailed technical pieces fit together.

Why Python?

Before moving on to more detailed topics, it is important to discuss why Python has been selected as the programming language for this book. Everything that follows that pertains to higher-level topics such as architecture and system design can be applied to solutions using any or multiple languages, but Python has been singled out here for a few reasons.

Python is colloquially known as the lingua franca of data. It is a non-compiled, not strongly typed, and multi-paradigm programming language that has clear and simple syntax. Its tooling ecosystem is also extensive, especially in the analytics and machine learning space. Packages such as scikit-learn, numpy, scipy, and a host of others form the backbone of a huge amount of technical and scientific development across the world. Almost every major new software library for use in the data world has a Python API. It is the third most popular programming language in the world, according to the TIOBE index (https://www.tiobe.com/tiobe-index/) at the time of writing (January 2021).

Given this, being able to build your systems using Python means you will be able to leverage all of the excellent machine learning and data science tools available in this ecosystem, while also ensuring that you build applications that can play nicely with other software.

 

High-level ML system design

When you get down to the nuts and bolts of building your solution, there are so many options for tools, tech, and approaches that it can be very easy to be overwhelmed. However, as alluded to in the previous sections, a lot of this complexity can be abstracted to understand the bigger picture via some back-of-the-envelope architecture and designs. This is always a useful exercise once you know what problem you are going to try and solve, and something I recommend doing before you make any detailed choices about implementation.

To give you an idea of how this works in practice, what follows are a few worked-through examples where a team has to create a high-level ML systems design for some typical business problems. These problems are similar to ones I have encountered before and will likely be similar to ones you will encounter in your own work.

Example 1: Batch anomaly detection service

You work for a tech-savvy taxi ride company with a fleet of thousands of cars. The organization wants to start making ride times more consistent and to understand longer journeys in order to improve customer experience and thereby increase retention and return business. Your ML team is employed to create an anomaly detection service to find rides that have unusual ride time or ride length behaviors. You all get to work, and your data scientists find that if you perform clustering on sets of rides using the features of ride distance and time, you can clearly identify outliers worth investigating by the operations team. The data scientists present the findings to the CTO and other stakeholders before getting the go-ahead to develop this into a service that will provide an outlier flag as a new field in one of the main tables of the company's internal analysis tool.

In this example, we will simulate some data to show how the taxi company's data scientists could proceed. All the code is contained in the Chapter1/batch-anomaly folder in the repository for this book: https://github.com/PacktPublishing/Machine-Learning-Engineering-with-Python/tree/main/Chapter01. This will be true of all code snippets shown in this book:

  1. First, let's define a function that will simulate some ride distances based on the random distribution given in numpy and return a numpy array containing the results. The reason for the repeated lines are so that we can create some base behavior and anomalies in the data, and you can clearly compare against the speeds we will generate for each set of taxis in the next step:
    def simulate_ride_distances():
        ride_dists = np.concatenate(
            (
                10 * np.random.random(size=370),
                30 * np.random.random(size=10),  
                10 * np.random.random(size=10), 
                10 * np.random.random(size=10)
            )
        )
        return ride_dists
  2. We can now do the exact same thing for speeds, and again we have split the taxis into sets of 370, 10, 10, and 10 so that we can create some data with 'typical' behavior and some sets of anomalies, while allowing for clear matching of the values with the distances function:
    def simulate_ride_speeds():
        ride_speeds = np.concatenate(
            (
                np.random.normal(loc=30, scale=5, size=370),
                np.random.normal(loc=30, scale=5, size=10),
                np.random.normal(loc=50, scale=10, size=10),
                np.random.normal(loc=15, scale=4, size=10) 
            )
        )
        return ride_speeds
  3. We can now use both of these helper functions inside a function that will call these and bring them together to create a simulated dataset containing ride IDs, speeds, distances, and times. The result is returned as a pandas DataFrame for use in modeling:
    def simulate_ride_data():
        ride_dists = simulate_ride_distances()
        ride_speeds = simulate_ride_speeds()
        ride_times = ride_dists/ride_speeds
     
        # Assemble into Data Frame
        df = pd.DataFrame(
            {
                'ride_dist': ride_dists,
                'ride_time': ride_times,
                'ride_speed': ride_speeds
            }
        )
        ride_ids = datetime.datetime.now().strftime("%Y%m%d")+df.index.astype(str)
        df['ride_id'] = ride_ids
        return df

    We can then run the simulation in lieu of getting the data from the taxi firm's system:

    df = simulate_ride_data()
  4. Now, we get to the core of what data scientists produce in their projects, which is a simple function that wraps some sklearn code for returning a dictionary with the clustering run metadata and results. We include the relevant imports here for ease:
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import DBSCAN
    from sklearn import metrics
    def cluster_and_label(data, create_and_show_plot=True):
        data = StandardScaler().fit_transform(data)
        db = DBSCAN(eps=0.3, min_samples=10).fit(data)
     
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
     
        n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
        n_noise_ = list(labels).count(-1)
     
        run_metadata = {
            'nClusters': n_clusters_,
            'nNoise': n_noise_,
            'silhouetteCoefficient': metrics.silhouette_score(data, labels),
            'labels': labels,
        }
        return run_metadata

    Finally, if we use the results of the simulation from Step 4 and apply the machine learning code, we can get the original taxi dataset with a set of labels telling us whether the taxi ride was anomalous ('-1') or not ('0'):

        X = df[['ride_dist', 'ride_time']]
        results = cluster_and_label(X, create_and_show_plot=False)
        df['label'] = results['labels']

    Then, if you plot the results, with outliers labeled as black triangles, then you get something like Figure 1.5:

Figure 1.5 – An example set of results from performing clustering on some taxi ride data

Figure 1.5 – An example set of results from performing clustering on some taxi ride data

Now that you have a basic model that works, you have to start thinking about how to pull this into an engineered solution – how could you do it?

Well, since the solution here is going to support longer-running investigations by another team, there is no need for a very low-latency solution. The stakeholders agree that the insights from clustering can be delivered at the end of each day. Working with the data-science part of the team, the ML engineers (led by you) understand that if clustering is run daily, this provides enough data to give appropriate clusters, but doing the runs any more frequently could lead to poorer results due to smaller amounts of data. So, a daily batch process is agreed upon.

What do you do next? Well, you know the frequency of runs is daily, but the volume of data is still very high, so it makes sense to leverage a distributed computing paradigm. Therefore, you decide to use Apache Spark. You know that the end consumer of the data is a table in a SQL database, so you need to work with the database team to design an appropriate handover of the results. Due to security and reliability concerns, it is not a good idea to write to the production database directly. You therefore agree that another database in the cloud will be used as an intermediate staging area for the data, which the main database can query against on its daily builds.

It might not seem like we have done anything technical here, but actually, you have already performed the high-level system design for your project. The rest of this book tells you how to fill in the gaps in the following diagram!

Figure 1.6 – Example 1 workflow

Figure 1.6 – Example 1 workflow

Let's now move on to the next example!

Example 2: Forecasting API

In this example, you are working for the logistics arm of a large retail chain. To maximize the flow of goods, the company would like to help regional logistics planners get ahead of particularly busy periods and to avoid product sell-outs. After discussions with stakeholders and subject matter experts across the business, it is agreed that the ability for planners to dynamically request and explore forecasts for particular warehouse items through a web-hosted dashboard is optimal. This allows the planners to understand likely future demand profiles before they make orders.

The data scientists come good again and find that the data has very predictable behavior at the level of any individual store. They decide to use the Facebook Prophet library for their modeling to help speed up the process of training many different models.

This example will use the open Rossman stores dataset from Kaggle, which can be found here: https://www.kaggle.com/pratyushakar/rossmann-store-sales:

  1. First, we read in the data from the folder where we have extracted the data. We will perform all the following steps on the train dataset provided in the download but treat this as an entire dataset that we wish to split into training and test sets anyway:
    df = pd.read_csv('./data/rossman/train.csv')
  2. Secondly, the data scientists prepped an initial subset of the data to work with first, so we will do the same. We do some basic tidy up, but the key points are that we select data for store number four in the dataset and only for when it is open:
    df['Date'] = pd.to_datetime(df['Date'])
    df.rename(columns= {'Date': 'ds', 'Sales': 'y'}, inplace=True)
    df_store = df[
        (df['Store']==4) &\
        (df['Open']==1)
    ].reset_index(drop=True)
    df_store = df_store.sort_values('ds', ascending=True)
  3. The data scientists then developed a little function that will take some supplied data, an index to delineate the size of the training set, and some seasonality parameters before returning a Prophet model trained on the training set:
    from fbprophet import Prophet
    def train_predict(df, train_index, seasonality=seasonality):
        # grab split data
        df_train = df.copy().iloc[0:train_index]
        df_test = df.copy().iloc[train_index:]
        
        model=Prophet(
            yearly_seasonality=seasonality['yearly'],
            weekly_seasonality=seasonality['weekly'],
            daily_seasonality=seasonality['daily'],
            interval_width = 0.95
        )
     
        # train and predict
        model.fit(df_train)
        predicted = model.predict(df_test)
        return predicted, df_train, df_test
  4. Before applying this function, we can define the relevant seasonality settings in a dictionary:
    seasonality = {
        'yearly': True,
        'weekly': True,
        'daily': False
    }
  5. Finally, we can apply the function as the data scientists envisaged:
    train_index = int(0.8*df_store1.shape[0])
    predicted, df_train, df_test = train_predict(
        df = df_store,
        train_index = train_index,
        Seasonality = seasonality
    )

    Running this model and plotting the predicted values against the ground truth gives a plot like that in Figure 1.7:

Figure 1.7 – Forecasting store sales

Figure 1.7 – Forecasting store sales

One issue here is that implementing a forecasting model like the one above for every store can quickly lead to hundreds or even thousands of models if the chain gathers enough data. Another issue is that not all stores are on the resource planning system used at the company yet, so some planners would like to retrieve forecasts for other stores they know are similar to their own. It is agreed that if users like this can explore regional profiles they believe are similar with their own data, then they can still make the optimal decisions.

Given this and the customer requirements for dynamic, ad hoc requests, you quickly rule out a full batch process. This wouldn't cover the use case for regions not on the core system and wouldn't allow for dynamic retrieval of up-to-date forecasts via the website, which would allow you to deploy models that forecast at a variety of time horizons in the future. It also means you could save on compute as you don't need to manage the storage and updating of thousands of forecasts every day and your resources can be focused on model training.

Therefore, you decide that actually, a web-hosted API with an endpoint that can return forecasts as needed by the user makes the most sense. To give efficient responses, you have to consider what happens in a typical user session. By workshopping with the potential users of the dashboard, you quickly realize that although the requests are dynamic, most planners will focus on particular items of interest in any one session. They will also not look at many regions. This helps you to design a data, forecast, and model caching strategy that means that after the user makes their first selections, results can be returned more quickly for a better user experience. This leads to the rough system sketch in Figure 1.8:

Figure 1.8 – Example 2 workflow

Figure 1.8 – Example 2 workflow

Next, let's look at the final example.

Example 3: Streamed classification

In this final example, you are working for a web-based company that wants to classify users based on their usage patterns as targets for different types of advertising, in order to more effectively target marketing spend. For example, if the user uses the site less frequently, we may want to entice them with more aggressive discounts. One of the key requirements from the business is that the end results become part of the data landed in a data store used by other applications.

Based on these requirements, your team determines that a streaming application is the simplest solution that ticks all the boxes. The data engineers focus their efforts on building the streaming and data store infrastructure, while the ML engineer works to wrap up the classification model the data science team has trained on historical data. The base algorithm that the data scientists settle on is implemented in sklearn, which we will work through below by applying it to a marketing dataset that would be similar to that produced in this use case.

This hypothetical example aligns with a lot of classic datasets, including the Bank Marketing dataset from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#. The following example code uses this dataset. Remember that all of the following code is available in the book's GitHub repository as in the other examples:

  1. First, we will read in the data, which is stored in a folder labeled data in the same directory as the script we are building:
    import pandas as pd
    df = pd.read_csv('./data/bank/bank.csv', delimiter=';', decimal=',')
  2. Next, we define the features we would like to use in our model and define our feature matrix, X, and target variable vector, y. The target variable will be translated to a numerical value, 1, if the customer went with the proposed product, and 0 if they did not. Note that we assume the features have been selected in this case via robust exploratory data analysis, which is not covered here:
    cat_feature_cols = ["marital", "education", "contact", "default", "housing", "loan", "poutcome"]
    num_feature_cols = ["age", "pdays", "previous", "emp.var.rate", "euribor3m", "nr.employed"]
    feature_cols = cat_feature_cols + num_feature_cols
    X = df[feature_cols].copy()
    y = df['y'].apply(lambda x: 1 if x == 'yes' else 0).copy()
  3. Before moving on to modeling, we split the data into an 80/20 training and test split:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  4. We then perform some very basic feature engineering and preparation by one-hot encoding all of the categorical variables, being careful to only train the transformer on the training set:
    from sklearn.preprocessing import OneHotEncoder
    enc = OneHotEncoder(handle_unknown='ignore')
    X_train_cat_encoded = enc.fit_transform(X_train[cat_feature_cols])
    X_test_cat_encoded = enc.transform(X_test[cat_feature_cols])
  5. We then standardize the numerical variables in a similar way:
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train_num_scaled = scaler.fit_transform(X_train[num_feature_cols])
    X_test_num_scaled = scaler.transform(X_test[num_feature_cols])
  6. We then have to bring the numerical and categorical data together into one set:
    X_train = np.concatenate((X_train_cat_encoded.toarray(), X_train_num_scaled), axis=1)
    X_test = np.concatenate((X_test_cat_encoded.toarray(), X_test_num_scaled), axis=1)
  7. Now we are ready to get ready for modeling. The dataset has imbalanced classes, so the data scientists have suggested that we use the SMOTE algorithm, which is contained within the imblearn package to perform oversampling of the minority class. This creates a balanced classification dataset:
    from imblearn.over_sampling import SMOTE 
    sm = SMOTE()
    X_balanced, y_balanced = sm.fit_sample(X_train, y_train)
  8. The core code that the data scientists created can now be applied. They come up with a series of different variants of code based around a simple random forest classification model:
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import f1_score
    # Define classifier
    rfc = RandomForestClassifier(n_estimators=1000)
    rfc.fit(X_balanced, y_balanced)

    When you run this code, you will find that the model performance could be improved. This, along with the need to streamline the preceding code, improve model scalability, and build a solution that can interact with the streaming pipeline, will be the focus of the ML engineer's work for this project. There will also be some subtleties around how often you want to retrain your algorithm to make sure that the classifier does not go stale. We will discuss all of these topics later in this book. Taken together, the outline of the processing steps needed in the solution gives a high-level system design like that in Figure 1.9:

Figure 1.9 – Example 3 workflow

Figure 1.9 – Example 3 workflow

We have now explored three high-level ML system designs and discussed the rationale behind our workflow choices. We have also explored in detail the sort of code that would often be produced by data scientists working on modeling, but which would act as input to future ML engineering work. This section should therefore have given us an appreciation of where our engineering work begins in a typical project and what types of problems we will be aiming to solve. And there you go. You are already on your way to becoming an ML engineer!

 

Summary

In this chapter, we have introduced the idea of ML engineering and how that fits within a modern team building valuable solutions based on data. There was a discussion of how the focus of ML engineering is complementary to the strengths of data science and data engineering and where these disciplines overlap. Some comments were made about how to use this information to assemble an appropriately resourced team for your projects.

The challenges of building machine learning products in modern real-world organizations were then discussed, along with pointers to help you overcome some of these challenges. In particular, the notion of reasonably estimating value and effectively communicating with your stakeholders were emphasized.

This chapter then rounded off with a taster of the technical content to come in later chapters, in particular, through a discussion of what typical ML solutions look like and how they should be designed (at a high level) for some common use cases.

The next chapter will focus on how to set up and implement your development processes to build the ML solutions you want and provide some insight as to how this is different from standard software development processes. Then there will be a discussion of some of the tools you can use to start managing the tasks and artifacts from your projects without creating major headaches. This will set you up for the technical details of how to build the key elements of your ML solutions in later chapters.

About the Author

  • Andrew P. McMahon

    Andrew Peter (Andy) McMahon is a machine learning engineer and data scientist with experience of working in, and leading, successful analytics and software teams. His expertise centers on building production-grade ML systems that can deliver value at scale. He is currently ML Engineering Lead at NatWest Group and was previously Analytics Team Lead at Aggreko.

    He has an undergraduate degree in theoretical physics from the University of Glasgow, as well as master's and Ph.D. degrees in condensed matter physics from Imperial College London. In 2019, Andy was named Data Scientist of the Year at the International Data Science Awards. He currently co-hosts the AI Right podcast, discussing hot topics in AI with other members of the Scottish tech scene.

    Browse publications by this author

Latest Reviews

(1 reviews total)
Like the book, it's compact and easy to follow
Machine Learning Engineering with Python
Unlock this book and the full library for $5 a month*
Start now