Understanding the Infrastructure and Tools for Building AI Products

Laying a solid foundation is an essential part of understanding anything, and the frontier of artificial intelligence (AI) products seems a lot like our universe: ever-expanding. That rate of expansion is increasing with every passing year as we go deeper into a new way to conceptualize products, organizations, and the industries we’re all a part of. Virtually every aspect of our lives will be impacted in some way by AI and we hope those reading will come out of this experience more confident about what AI adoption will look like for the products they support or hope to build someday.

Part 1 of this book will serve as an overview of the lay of the land. We will cover terms, infrastructure, types of AI algorithms, and products done well, and by the end of this section, you will understand the various considerations when attempting to build an AI strategy, whether you’re looking to create a native-AI product or add AI features to an existing product.

Managing AI products is a highly iterative process, and the work of a product manager is to help your organization discover what the best combination of infrastructure, training, and deployment workflow is to maximize success in your target market. The performance and success of AI products lie in understanding the infrastructure needed for managing AI pipelines, the outputs of which will then be integrated into a product. In this chapter, we will cover everything from databases to workbenches to deployment strategies to tools you can use to manage your AI projects, as well as how to gauge your product’s efficacy.

This chapter will serve as a high-level overview of the subsequent chapters in Part 1 but it will foremost allow for a definition of terms, which are quite hard to come by in today’s marketing-heavy AI competitive landscape. These days, it feels like every product is an AI product, and marketing departments are trigger-happy with sprinkling that term around, rendering it almost useless as a descriptor. We suspect this won’t be changing anytime soon, but the more fluency consumers and customers alike have with the capabilities and specifics of AI, machine learning (ML), and data science, the more we should see clarity about how products are built and optimized. Understanding the context of AI is important for anyone considering building or supporting an AI product.

In this chapter, we will cover the following topics:

Definitions – what is and is not AI
ML versus DL – understanding the difference
Learning types in ML
The order – what is the optimal flow and where does every part of the process live?
DB 101 – databases, warehouses, data lakes, and lakehouses
Managing projects – IaaS
Deployment strategies – what do we do with these outputs?
Succeeding in AI – how well-managed AI companies do infrastructure right
The promise of AI – where is AI taking us?

Definitions – what is and is not AI

In 1950, a mathematician and world war II war hero Alan Turing asked a simple question in his paper Computing Machinery and Intelligence – Can machines think?. Today, we’re still grappling with that same question. Depending on who you ask, AI can be many things. Many maps exist out there on the internet, from expert systems used in healthcare and finance to facial recognition to natural language processing to regression models. As we continue with this chapter, we will cover many of the facets of AI that apply to products emerging in the market.

For the purposes of applied AI in products across industries, in this book, we will focus primarily on ML and deep learning (DL) models used in various capacities because these are often used in production anywhere AI is referenced in any marketing capacity. We will use AI/ML as a blanket term covering a span of ML applications and we will cover the major areas most people would consider ML, such as DL, computer vision, natural language processing, and facial recognition. These are the methods of applied AI that most people will come across in the industry, and familiarity with these applications will serve any product manager looking to break into AI. If anything, we’d like to help anyone who’s looking to expand into the field from another product management background to choose which area of AI appeals to them most.

We’d also like to cover what is and what isn’t ML. The best way for us to express it as simply as we can is: if a machine is learning from some past behavior and if its success rate is improving as a result of this learning, it is ML! Learning is the active element. No models are perfect but we do learn a lot from employing models. Every model will have some element of hyperparameter tuning, and the use of each model will yield certain results in performance. Data scientists and ML engineers working with these models will be able to benchmark performance and see how performance is improving. If there are fixed, hardcoded rules that don’t change, it’s not ML.

AI is a subset of computer science, and all programmers are effectively doing just that: giving computers a set of instructions to fire away on. If your current program doesn’t learn from the past in any way, if it simply executes on directives it was hardcoded with, we can’t call this ML. You may have heard the terms rules-based engine or expert system thrown around in other programs. They are considered forms of AI, but they're not ML because although they are a form of AI, the rules are effectively replicating the work of a person, and the system itself is not learning or changing on its own.

We find ourselves in a tricky time in AI adoption where it can be very difficult to find information online about what makes a product AI. Marketing is eager to add the AI label to their products but there still isn’t a baseline of explainability with what that means out in the market. This further confuses the term AI for consumers and technologists alike. If you’re confused by the terms, particularly when they’re applied to products you see promoted online, you’re very much not alone.

Another area of confusion is the general term that is AI. For most people, the concept of AI brings to mind the Terminator franchise from the 1980s and other futurist depictions of inescapable technological destruction. While there certainly can be a lot of harm to come from AI, this depiction represents what’s referred to as strong AI or artificial general intelligence (AGI). We still have ways to go for something such as AGI but we’ve got plenty of what’s referred to as artificial narrow intelligence or narrow AI (ANI).

ANI is also commonly expressed as weak AI and is what’s generally meant when you see AI plastered all over products you find online. ANI is exactly what it sounds like: a narrow application of AI. Maybe it’s good at talking to you, at predicting some future value, or at organizing things; maybe it’s an expert at that, but its expertise won’t bleed into other areas. If it could, it would stop being ANI. These major areas of AI are referred to as strong and weak in comparison to human intelligence. Even the most convincing conversational AIs out there, and they are quite convincing, are demonstrating an illusionary intelligence. Effectively, all AI that exists at the moment is weak or ANI. Our Terminator days are still firmly in our future, perhaps never to be realized.

For every person out there that’s come across Reddit threads about AI being sentient or somehow having ill will toward us, we want to make the following statement very clear. AGI does not exist and there is no such thing as sentient AI. This does not mean AI doesn’t actively and routinely cause humans harm, even in its current form. The major caveat here is that unethical, haphazard applications of AI already actively cause us both minor inconveniences and major upsets. Building AI ethically and responsibly is still a work in progress. While AI systems may not be sentiently plotting the downfall of humanity, when they’re left untested, improperly managed, and inadequately vetted for bias, the applications of ANI that are deployed already have the capacity to do real damage in our lives.

For now, can machines think like us? No, they don’t think like us. Will they someday? We hope not. It’s my personal opinion that the insufferable aspects of the human condition end with us. But we do very much believe that we will experience some of our greatest ails, as well as our wildest curiosities, to be impacted considerably by the benevolence of AI and ML.

ML versus DL – understanding the difference

As a product manager, you’re going to need to build a lot of trust with your technical counterparts so that, together, you can build an amazing product that works as well as it can technically. If you’re reading this book, you’ve likely come across the phrase ML and DL. We will use the following sections titled ML and DL to go over some of the basics but keep in mind that we will be elaborating on these concepts further down in Chapter 3.

ML

In its basic form, ML is made up of two essential components: the models used and the training data it’s learning from. These data are historical data points that effectively teach machines a baseline foundation from which to learn, and every time you retrain the models, the models are theoretically improving. How the models are chosen, built, tuned, and maintained for optimized performance is the work of data scientists and ML engineers. Using this knowledge of performance toward the optimization of the product experience itself is the work of product managers. If you’re working in the field of AI product management, you’re working incredibly closely with your data science and ML teams.

We’d like to also make a distinction about the folks you’ll be working with as an AI product manager. Depending on your organization, you’re either working with data scientists and developers to deploy ML or you’re working with ML engineers who can both train and upkeep the models as well as deploy them into production. We highly suggest maintaining strong relationships with any and all of these impacted teams, along with DevOps.

All ML models can be grouped into the following four major learning categories:

Supervised learning
Unsupervised learning
Semi-supervised learning
Reinforcement learning

These are the four major areas of ML and each area is going to have its particular models and algorithms that are used in each specialization. The learning type has to do with whether or not you’re labeling the data and the method you’re using to reward the models you’ve used for good performance. These learning types are relevant whether your product is using a DL model or not, so they’re inclusive of all ML models. We will be covering the learning types in more depth in the following section titled Learning types in ML.

DL

DL is a subset of ML, but the terms are often used colloquially as almost separate expressions. The reason for this is DL is based on neural network algorithms and ML can be thought of as… the rest of the algorithms. In the preceding section covering ML, we looked at the process of taking data, using it to train our models, and using that trained model to predict new future data points. Every time you use the model, you see how off it was from the correct answer by getting some understanding of the rate of error so you can iterate back and forth until you have a model that works well enough. Every time, you are creating a model based on data that has certain patterns or features.

This process is the same in DL, but one of the key differences of DL is that patterns or features in your data are largely picked up by the DL algorithm through what’s referred to as feature learning or feature engineering through a hierarchical layered system. We will go into the various algorithms that are used in the following section because there are a few nuances between each, but as you continue developing your understanding of the types of ML out there, you’ll also start to group the various models that make up these major areas of AI (ML and DL). For marketing purposes, you will for the most part see terms such as ML, DL/neural networks, or just the general umbrella term of AI referenced where DL algorithms are used.

It’s important to know the difference between what these terms mean in practice and at the model level and how they’re communicated by non-technical stakeholders. As product managers, we are toeing the line between the two worlds: what engineering is building and what marketing is communicating. Anytime you’ve heard the term black box model, it’s referring to a neural network model, which is DL. The reason for this is DL engineers often can’t determine how their models are arriving at certain conclusions that are creating an opaque view of what the model is doing. This opacity is double-sided, both for the engineers and technologists themselves, as well as for the customers and users downstream who are experiencing the effects of these models without knowing how they make certain determinations. The DL neural networks are mimicking the structure of the way humans are able to think using a variety of layers of neural networks.

For product managers, DL poses a concern for explainability because there’s very little we can understand about how and why a model is arriving at conclusions, and, depending on the context of your product, the importance of explainability could vary. Another inherent challenge is these models essentially learn autonomously because they aren’t waiting for their engineer to choose the features that are most relevant in the data for them; the neural networks themselves do the feature selection. It learns with very little input from an engineer. Think of the models as the what and the following section of learning types as the how. A quick reminder that as we move on to cover the learning styles (whether a model is used in a supervised, unsupervised, semi-supervised, or reinforcement learning capacity), these learning styles apply to both DL and traditional ML models.

Let’s look at the different learning types in ML.

Learning types in ML

In this section, we will cover the differences between supervised, unsupervised, semi-supervised, and reinforcement learning and how all these learning types can be applied. Again, the learning type has to do with whether or not you’re labeling the data and the method you’re using to reward the models you’ve used for good performance. The ultimate objective is to understand what kind of learning model gets you the kind of performance and explainability you’re going to need when considering whether or not to use it in your product.

Supervised learning

If humans are labeling the data and the machine is looking to also correctly label current or future data points, it’s supervised learning. Because we humans know the answer the machines are trying to arrive at, we can see how off they are from finding the correct answer, and we continue this process of training the models and retraining them until we find a level of accuracy that we’re happy with.

Applications of supervised learning models include classification models that are looking to categorize data in the way spam filters do or regression models that are looking for relationships between variables in order to predict future events and find trends. Keep in mind that all models will only work to a certain point, which is why they require constant training and updating and AI teams are often using ensemble modeling or will try various models and choose the best-performing one. It won’t be perfect either way, but with enough hand-holding, it will take you closer and closer to the truth.

The following is a list of common supervised learning models/algorithms you’ll likely use in production for various products:

Naive Bayes classifier: This algorithm naively considers every feature in your dataset as its own independent variable. So, it’s essentially trying to find associations probabilistically without having any assumptions about the data. It’s one of the simpler algorithms out there and its simplicity actually is what makes it so successful with classification. It’s commonly used for binary values such as trying to decipher whether or not something is spam.
Support vector machine (SVM): This algorithm is also largely used for classification problems and will essentially try to split your dataset into two classes so that you can use it to group your data and try to predict where future data points will land along these major splits. If you’re not seeing compelling groups between the data, SVMs allow you to add more dimensions to be able to see groupings easier.
Linear regression models: These have been around since the 1950s and they’re the simplest models we have for regression problems such as predicting future data points. They essentially use one or more variables in your dataset to predict your dependent variable. The linear part of this model is trying to find the best line to fit your data, and this line is what dictates how it predicts. Here, we once again see a relatively simple model also being heavily used because of how versatile and dependable it is.
Logistic regression: This model works a lot like linear regression in that you have independent and dependent variables, but it’s not predicting a numerical value; it’s predicting a future binary categorical state such as whether or not someone might default on a loan in the future, for instance.
Decision trees: This algorithm works well with both predicting something categorical as well as something numerical, so it’s used for both kinds of ML problems, such as predicting a future state or a future price. This is less common so decision trees are used often for both kinds of problems, which has contributed to its popularity. Its comparison to a tree comes from the nodes and branches that effectively function like a flow chart. The model learns from the flow of past data to predict future values.
Random forest: This algorithm builds from the previous decision trees and is also used for both categorical and numerical problems. The way it works is it splits the data into different random”samples, creates decision trees for each sample, and then takes an average or majority vote for its predictions (depending on whether you’re using it for categorical or numerical predictions). It’s hard to understand how a random forest comes to conclusions, so if interpretability isn’t super high on the list of concerns, you can use it.
K-nearest neighbors (KNNs): This algorithm exclusively works on categorical as well as numerical predictions, so it’s looking for a future state and it offers results in groups. The number of data points in the group is set by the engineer/data scientist, and the way the model works is by grouping the data and determining characteristics the data shares with its neighbors and giving its best guess based on those neighbors for future values.

Now that we’ve covered supervised learning, let’s discuss unsupervised learning next.

Unsupervised learning

If the data is unlabeled and we’re using machines to label the data and find patterns we don’t yet know of, it’s unsupervised. Effectively, we humans either know the right answer or we don’t, and that’s how we decipher which camp the ML algorithms belong to. As you might imagine, we take the results of unsupervised learning models with some hesitancy because it may be finding an organization that isn’t actually helpful or accurate. Unsupervised learning models also require large amounts of data to train on because the results can be wildly inaccurate if it’s trying to find patterns out of a small data sample. As it ingests more and more data, its performance will improve and become more refined over time, but once again, there is no correct answer.

Applications of unsupervised learning models include clustering and dimensionality reduction. Clustering models segment or group data into certain areas. These can be used for things such as looking for patterns in medical trials or drug discovery, for instance, because you’re looking for connections and groups of data where there might not already be obvious answers. Dimensionality reduction essentially removes the features in your dataset that contribute less to the performance you’re looking for and will simplify your data so that your most important features will best improve your performance to separate real signals from the noise.

The following is a list of common unsupervised learning models/algorithms you’ll likely use in production for various products:

K-means clustering: This algorithm will group data points together to better see patterns (or clusters), but it’s looking for some optimal number of clusters as well. This is unsupervised learning, so the model is looking to find patterns that it can learn from because it’s not given any information (or supervision) to go off from the engineer that’s using it. Also, the number of clusters assigned is a hyperparameter and you will need to choose what number of clusters is optimal.
Principal component analysis (PCA): Often, the largest problem with using unsupervised ML on very large datasets is there’s actually too much uncorrelated data to find meaningful patterns. This is why PCA is used so often because it’s a great way to reduce dimension without actually losing or discarding information. This is especially useful for massive datasets such as finding patterns in genome sequencing or drug discovery trials.

Next, let’s jump into semi-supervised learning.

Semi-supervised learning

In a perfect world, we’d have massive well-labeled datasets with which to create optimal models that don’t overfit. Overfitting is when you create and tune a model to the dataset you have but it fits a bit too well, which means it’s optimized for that particular dataset and doesn’t work well with more diverse data. This is a common problem in data science. We live in an imperfect world and we can find ourselves in situations where we don’t have enough labeled data or enough data at all. This is where semi-supervised learning comes in handy. We give some labeled datasets and also include a dataset that is unlabeled to essentially give the model nudges in the right direction as it tries to come up with its own semblance of finding patterns.

It doesn’t quite have the same level of absolute truth associated with supervised learning, but it does offer the models some helpful clues with which to organize its results so that it can find an easier path to the right answer.

For instance, let’s say you’re looking for a model that works well with detecting patterns in photos or speech. You might label a few of them and then see how the performance improves over time with the examples you don’t label. You can use multiple models in semi-supervised learning. The process would be a lot like supervised learning, which learns with labeled datasets so that it knows exactly how off it is from being correct. The main difference between supervised learning and semi-supervised learning is that you’re predicting a portion of the new unlabeled data and then, essentially, checking its accuracy against the labeled data. You’re adding unlabeled new data points into the training set so that it’s training on the data it’s gotten correct.

Finally, to wrap up this section, let’s take a brief look at reinforcement learning.

Reinforcement learning

This area of ML effectively learns with trial and error, so it’s learning from past behavior and adapting its approach to finding the best performance by itself. There’s a sequence to reinforcement learning and it’s really a system based on weights and rewards to reinforce correct results. Eventually, the model tries to optimize for these rewards and gets better with time. We see reinforcement learning used a lot with robotics, for instance, where robots are trained to understand how to operate and adjust to the parameters of the real world with all its unpredictability.

Now that we’ve discussed and understood the different ML types, let’s move on and understand the optimal flow of the ML process.

The order – what is the optimal flow and where does every part of the process live?

Companies interested in creating value with AI/ML have a lot to gain compared to their more hesitant competitors. According to McKinsey Global Institute, “Companies that fully absorb AI in their value-producing workflows by 2025 will dominate the 2030 world economy with +120% cash flow growth.” The undertaking of embracing AI and productionizing it – whether in your product or for internal purposes – is complex, technical debt-heavy, and expensive. Once your models and use cases are chosen, making that happen in production becomes a difficult program to manage and this is a process many companies will struggle with as we see companies in industries other than tech starting to take on the challenge of embracing AI. Operationalizing the process, updating the models, keeping the data fresh and clean, and organizing experiments, as well as validating, testing, and the storage associated with it, are the complicated parts.

In an effort to make this entire process more digestible, we’re going to present this as a step-by-step process because there are varying layers of complexity but the basic components will be the same. Once you have gotten through the easy bit and you’ve settled on the models and algorithms you feel are optimal for your use case, you can begin to refine your process for managing your AI system.

Step 1 – Data availability and centralization

Essentially, you’ll need a central place to store the data that your AI/ML models and algorithms will be learning from. Depending on the databases you invest in or legacy systems you’re using, you might have a need for an ETL pipeline and data engineering to make the layers of data and metadata available for your productionized AI/ML models to ingest and offer insights from. Think of this as creating the pipeline needed to feed your AI/ML system.

AI feeds on data, and if your system of delivering data is clunky or slow, you’ll run into issues in production later. Choosing your preferred way of storing data is tricky in and of itself. You don’t know how your tech stack will evolve as you scale, so choosing a cost-effective and reliable solution is a mission in and of itself. For example, as we started to add more and more customers at a cybersecurity company we were previously working for, we noticed the load time for certain customer-facing dashboards was lagging behind. Part of the issue was the number of customers, and their metadata was too large to support the pipelines we already had in place.

Step 2 – Continuous maintenance

At this point, you have your models and algorithms and you’ve chosen a system for delivering data to them. Now, you’re going to be in the flow of constantly maintaining this system. In DevOps, this is referred to as continuous integration (CI)/continuous delivery (CD). In the later chapters, we will cover the concept of AI Operations (AIOps) but for now, the following is a list of the stages tailored for the continuous maintenance of AI pipelines. The following are the four major components of the continuous maintenance process:

CI: Testing/validating code and components, along with data, data schemas, and models
CD: Code changes or updates to your model are passed on continuously so that once you’ve made changes, they are slated to appear in the testing environment before going to production without pauses
CT: We’ve mentioned the idea of continuous learning being important for ML, and continuous training productionizes this process so that as your data feeds are refreshed, your models are consistently training and learning from that new data
CM: We can’t have ML/AI models continuously running without also continuously monitoring them to make sure something isn’t going horribly wrong

You can’t responsibly manage an AI program if you aren’t iterating your process constantly. Your models and hyperparameters will become stale. Your data will become stale and when an iterative process like this stagnates, it will stop being effective. Performance is something you’ll constantly be staying up to date on because the lack of performance will be self-evident, whether it is client-facing or not. With that said, things can also go wrong. For example, lags in performance or in the frequency of the model updating can lead to people losing their jobs, not getting a competitive rate on a mortgage, or getting an unfair prison sentence. Major consequences can arise from downstream effects due to improper model maintenance. We recommend exploring the Additional resources section at the end of this chapter for more examples and information on how stagnant AI systems can wreak havoc on environments and people.

B 101 – databases, warehouses, data lakes, and lakehouses

AI/ML products run on data. Where and how you store your data is a big consideration that impacts your AI/ML performance, and in this section, we will be going through some of the most popular storage vehicles for your data. Figuring out the optimal way to store and access and train your data is a specialization in and of itself, but if you’re in the business of AI product management, eventually, you’re going to need to understand the basic building blocks of what makes your AI product work. In a few words, data does.

Because AI requires big data, this is going to be a significant strategic decision for your product and business. If you don’t have a well-oiled machine, pun intended, you’re going to run into snags that will impair the performance of your models and, by extension, your product itself. Having a good grasp of the most cost-effective and performance-driven solution for your particular product, and finding the balance within these various facets, is going to help your success as a product manager. Yes, you will depend on your technical executives for a lot of these decisions, but you’ll be at the table helping make these decisions, so some familiarity is needed here.

Let’s look at some of the different options to store data for AI/ML products.

Database

Depending on your organization’s goals and budget, you’ll be centralizing your data somehow between a data lake, a database, and a data warehouse, and you might even be considering a new option: the data lakehouse. If you’re just getting your feet wet, you’re likely just storing your data in a relational database so that you can access it and query it easily. Databases are a great way to do this if you have a relatively simple setup. With a relational database, there’s a particular schema you’re operating under if you wanted to combine this data with data that’s in another database; you would run into problems aligning these schemas later.

If your primary use of the database is querying to access data and use only a certain subset of your company’s data for general trends, a relational database might be enough. If you’re looking to combine various datasets from disparate areas of your business and you’re looking to accomplish more advanced analytics, dashboards, or AI/ML functions, you’ll need to read on.

Data warehouse

If you’re looking to combine data into a location where you can centralize it somewhere and you’ve got lots of structured data coming in, you’re more likely going to use a data warehouse. This is really the first step toward maturity because it will allow you to leverage insights and trends across your various business units quickly. If you’re looking to leverage AI/ML in various ways, rather than one specific specialized way, this will serve you well.

Let’s say, for example, that you want to add AI features to your existing product as well as within your HR function. You’d be leveraging your customer data to offer trends or predictions to your customers based on the performance of others in their peer group, as well as using AI/ML to make predictions or optimizations for your internal employees. Both these use cases would be well served with a data warehouse.

Data warehouses do, however, require some upfront investment to create a plan and design your data structures. They also require a costly investment as well because they make data available for analysis on demand, so you’re paying a premium for keeping that data readily available. Depending on how advanced your internal users are, you could opt for cheaper options, but this option would be optimal for organizations where most of your business users are looking for easily digestible ways to analyze data. Either way, a data warehouse will allow you to create dashboards for your internal users and stakeholder teams.

Data lake (and lakehouse)

If you’re sitting on lots of raw, unstructured data, and you want to have a more cost-effective place to store it, you’d be looking at a data lake. Here, you can store unstructured, semi-structured, as well as structured data that can be easily accessed by your more tech-savvy internal users. For instance, data scientists and ML engineers would be able to work with this data because they would be creating their own data models to transform and analyze the data on the fly, but this isn’t the case at most companies.

Keeping your data in a data lake would be cheap if you’ve got lots of data your business users don’t need immediately, but you won’t ever really be able to replace a warehouse or a database with one. It’s more of a “nice to have.’’ If you’re sitting on a massive data lake of historical data you want to use in the future for analytics, you’ll need to consider another way to store it to get those insights.

You might also come across the term lakehouse. There are many databases, data warehouses, and data lakes out there. However, the only lakehouse we’re aware of has been popularized by a company called Databricks, which offers something like a data lake but with some of the capabilities you get with data warehouses, namely, the ability to showcase data, make it available and ingestible for non-technical internal users, and create dashboards with it. The biggest advantage here is that you’re storing it and paying for the data to be stored upfront with the ability to access and manipulate it downstream.

Data pipelines

Regardless of the tech you use to maintain and store your data, you’re still going to need to put up pipelines to make sure your data is moving, that your dashboards are refreshing as readily as your business requires, and that data is flowing the way it needs to. There are also multiple ways of processing and passing data. You might be doing it in batches (batch processing) for large amounts of data being moved at various intervals, or in real-time pipelines for getting data in real time as soon as it’s generated. If you’re looking to leverage predictive analytics, enable reporting, or have a system in place to move, process, and store data, a data pipeline will likely be enough. However, depending on what your data is doing and how much transformation is required, you’ll likely be using both data pipelines and perhaps, more specifically, ETL pipelines.

ETL stands for extract, transform, and load, so your data engineers are going to be creating specific pipelines for more advanced systems such as centralizing all your data into one place, adding data or data enrichment, connecting your data with CRM (customer relationship management) tools, or even transforming the data and adding structure to it between systems. The reason for this is that it’s a necessary step when using a data warehouse or database. If you’re exclusively using a data lake, you’ll have all the metadata you need to be able to analyze it and get your insights as you like. In most cases, if you’re working with an AI/ML product, you’re going to be working with a data engineer who will power the data flow needed to make your product a success because you’re likely using a relational database as well as a data warehouse. The analytics required to enable AI/ML features will most likely need to be powered by a data engineer who will focus on the ETL pipeline.

Managing and maintaining this system will also be the work of your data engineer, and we encourage every product manager to have a close relationship with the data engineer(s) that supports their products. One key difference between the two is that ETL pipelines are generally updated in batches and not in real time. If you’re using an ETL pipeline, for instance, to update historical daily information about how your customers are using your product to offer client-facing insights in your platform, it might be optimal to keep this batch updating twice daily. If you need insights to come in real time for a dashboard that’s being used by your internal business users and they rely on that data to make daily decisions, however, you likely will need to resort to a data pipeline that’s updated continuously.

Now, that we understand the different available options to store data and how to choose the right option for the business, let’s discuss how to manage our projects.

Managing projects – IaaS

If you’re looking to create an AI/ML system in your organization, you’ll have to think about it as its own ecosystem that you’ll need to constantly maintain. This is why you see MLOps and AIOps working in conjunction with DevOps teams. Increasingly so, we will start to see managed services and infrastructure-as-a-service (IaaS) offerings coming out more and more. There has been a shift in the industry toward companies such as Determined AI and Google’s AI platform pipeline tools to meet the needs of the market. At the heart of this need is the desire to ease some of the burdens from companies left scratching their heads as they begin to take on the mammoth task of getting started with an AI system.

Just as DevOps teams became popular with at-scale software development, the result of decades of mistakes, we will see something similar with MLOps and AIOps. Developing a solution and putting it into operation are two different key areas that need to work together. This is doubly true for AI/ML systems. The trend now is on IaaS. This is an important concept to understand because companies just approaching AI often don’t have an understanding of the cost, storage, compute power, and investment required to do AI properly, particularly for DL AI projects that require massive amounts of data to train on.

At this point, most companies haven’t been running AI/ML programs for decades and don’t have dedicated teams. Tech companies such as MAANG (Meta, Amazon, Apple, Netflix, Google) are leading the cultural norms with managing AI/ML, but most companies that will need to embrace AI are not in tech and are largely unprepared for the technical debt AI adoption will pose for their engineering teams to manage.

Shortcuts taken to get AI initiatives off the ground will require code refactoring or changing how your data is stored and managed, which is why strategizing and planning for AI adoption is so crucial. This is why so many of these IaaS services are popping up to help keep engineering teams nimble should they require changes in the future as well. The infrastructure needed to keep AI teams up and running is going to change as time goes on, and the advantage of using an IaaS provider is that you can run all your projects and only pay for the time your AI developers are actually using data to train models.

Deployment strategies – what do we do with these outputs?

Once you’re happy with the models you’ve chosen (including their performance and error rate), you’ve got a good level of infrastructure to support your product and chosen AI model’s use case; you’re ready to go to the last step of the process and deploy this code into production. Keeping up with a deployment strategy that works for your product and organization will be part of the continuous maintenance we’ve outlined in the previous section. You’ll need to think about things such as how often you’ll need to retrain your models and refresh your training data to prevent model decay and data drift. You’ll also need a system for continuously monitoring your model’s performance so this process will be really specific to your product and business, particularly because these periods of retraining will require some downtime for your system.

Deployment is going to be a dynamic process because your models are trying to effectively make predictions of real-world data for the most part, so depending on what’s going on in the world of your data, you might have to give deployment more or less of your attention. For instance, when we were working for an ML property-tech company, we were updating, retraining, and redeploying our models almost daily because we worked with real estate data that was experiencing a huge skew due to rapid changes in migration data and housing price data due to the pandemic. If those models were left unchecked and there weren’t engineers and business leaders on both sides of this product, on the client’s end and internally, we might not have caught some of the egregious liberties the models were making on behalf of under-representative data.

There are also a number of well-known deployment strategies you should be aware of. We will discuss them in the following subsections.

Shadow deployment strategy

In this deployment strategy (often referred to as shadow mode), you’re deploying a new model with new features along with a model that already exists so that the new model that’s deployed is only experienced as a shadow of the model that’s currently in production. This also means that the new model is handling all the requests it’s getting just as the existing model does but it’s not showing the results of that model. This strategy allows you to see whether the shadow model is performing better on the same real-world data it’s getting without interrupting the model that’s actually live in production. Once it’s confirmed that the new model is performing better and that it has no issues running, it will then become the predominant model fully deployed in production and the original model will be retired.

A/B testing model deployment strategy

With this strategy, we’re actually seeing two slightly different models with different features to get a sense of how it’s working in the live environment concurrently. The two models are set up at the same time and the performance is optimized to reward conversion. This is effectively like an experiment where you’re looking at the results of one model over another and you’re starting with some hypothesis or expectation of how one is performing better than another, and then you’re testing that hypothesis to see whether you were right. The differences in your models do, however, have to be slight because if there’s too much variety between the features of the two, you actually won’t understand what’s creating the most success for you.

Canary deployment strategy

Here, we see a more gradual approach to deployment where you actually create subsets of users that will then experience your new model deployment. Here, we’re seeing the number of users that are subjected to your new model gradually increasing over time. This means that you can have a buffer time between groups of users to understand how they’re reacting and interacting with this new model. Essentially, you’re using varying groups of your own users as testers before you release to a new batch so you can catch bugs more gradually as well. It’s a slow but rewarding process if you have the patience and courage.

There are more strategies to choose from but keep in mind that the selection of these strategies will depend on the nature of your product, and what’s most important to your customers and users is your budget, your metrics and performance monitoring, your technical capacity and knowledge, and the timeline you have. Beyond your deployment, you’re going to have to help your business understand how often they should be doing code refactoring and branching as well.

Now that we’ve discussed the different deployment strategies, let’s see what it takes to succeed in AI.

Succeeding in AI – how well-managed AI companies do infrastructure right

It’s indicative of the complexity of ML systems that many large technology companies that depend heavily on ML have dedicated teams and platforms that focus on building, training, deploying, and maintaining ML models. The following are a few examples of options you can take when building an ML/AI program:

Databricks has MLflow: MLflow is an open source platform developed by Databricks to help manage the complete ML life cycle for enterprises. It allows you to run experiences and work with any library, framework, or language. The main benefits are experiment tracking (so you can see how your models are doing between experiments), model management (to manage all versions of your model between teammates), and model deployment (to have a quick view of deployment in view in the tool).
Google has TensorFlow Extended (TFX): This is Google’s newest product built on TensorFlow and it’s an end-to-end platform for deploying production-level ML pipelines. It allows you to collaborate within and between teams and offers robust capabilities for scalable, high-performance environments.
Uber has Michelangelo: Uber is a great example of a company creating their own ML management tool in-house for collaboration and deployment. Earlier, they were using disparate languages, models, and algorithms and had teams that were siloed. After they implemented Michelangelo, they were able to bring in varying skill sets and capabilities under one system. They needed one place for a reliable, recreatable, and standardized pipeline to create, manage, predict, and deploy their data at scale.
Meta has FBLearner Flow: Meta also created its own system for managing its numerous AI projects. Since ML is such a foundational part of their product, Meta needed a platform that would allow the following:
- Every ML algorithm that was implemented once to have the ability to be reusable by someone else at a later date
- Every engineer to have the ability to write a training pipeline that can be reused
- Make model training easy and automated
- Everybody to have the ability to search past projects and experiments easily

Effectively, Facebook created an easy-to-use knowledge base and workflow to centralize all their ML ops.

Amazon has SageMaker: This is Amazon’s product that allows you to build, train, and deploy your ML models and programs with their own collection of fully managed infrastructure tools and workflows. The idea of this product is to meet their customers where they are and offer low-code or no-code UIs, whether you employ ML engineers or business analysts. The ability to use their infrastructure is also great if you’re already using Amazon services for your cloud infrastructure so that you can take it a step further to automate and standardize your ML/AI program and operations at scale.
Airbnb has Bighead: Airbnb created its own ML infrastructure in an effort to create standardization and centralization between their AI/ML organizations. They used a collection of tools such as Zipline, Redspot, and DeepThought to orchestrate their ML platform in an effort to do the same as Facebook and Uber: to mitigate errors and discrepancies and minimize repeatable work.

As we can see, there are multiple platforms that can be used to create, train, and deploy ML models. Finally, let’s see what the future of AI looks like.

The promise of AI – where is AI taking us?

So, where is this era of AI implementation headed and what does it mean for all industries? At this point, we’re looking at an industry of geopolitical influence, a technologically obvious decision that comes with a lot of responsibility, cost, and opportunity. As long as companies and product managers are aware of the risks, costs, and level of investment needed to properly care for an AI program, use it as a source of curiosity, and apply AI/ML to projects that create success early on and build from that knowledge, those that invest in AI will find themselves experiencing AI’s promise. This promise is rooted in quantifying prediction and optimization. For example, Highmark Inc. saved more than $260M in 2019 by using ML for fraud detection, GE helped its customers save over $1.6B with their predictive maintenance, and 35% of Amazon’s sales come from their recommendation engine.

When a third of your revenues are coming from an AI algorithm, there’s virtually no argument. Whatever investment you make in AI/ML, make sure you’re leveraging it to its maximum capacity by properly planning and strategizing, finding capable talent that’s aware of the space and potential dangers, and choosing the right scalable infrastructure to limit your refactoring.

As long as your AI/ML projects are directly married to outcomes that impact cost savings or revenue, you’ll likely experience success within your own career if you’re overseeing these projects. The recommendation of starting small, applying it to a clear business goal, tracking that goal, and showing off its effectiveness is a smart strategy because this chapter details the many areas of maintaining an AI program, as well as potential areas where it might experience hurdles. Justifying the time, investment in headcount, and infrastructure expenses will be challenging if you’re not able to communicate the strength and capabilities of AI to even your most hesitant executive.

This will also be important for your technical resources (data scientists, data engineers, and ML engineers) as well as for your business stakeholders. It’s one thing to know more about the ML algorithms you’ll be using or to get a few recommendations about how to best store your data, but you really won’t have the intimacy and fluency needed to truly be an agent of change within your organization if you don’t iterate with your own projects and grow your knowledge and intuition about what works best from there. We learn through iteration and we build confidence the more we complete a task successfully. This will be the case for you as a product manager as well.

In the previous example, GE offered cost savings to its customers, Highmark prevented future bottlenecks by predicting fraud, and Amazon grew its revenues through ML. When we think about the promise of AI and where it’s taking us, these examples drive the idea that this is the home of the latest industrial revolution. It’s not just something that will offer benefits to companies but to everyone all at once. The distribution of the benefits may not be completely equal because, ultimately, it’s the companies that are investing in this tech and they will look to experience the highest return on this investment first, but the point stands that consumers, as well as businesses, will experience benefits from AI.

Summary

We’ve covered a lot in this chapter, but keep in mind that a lot of the concepts present here will be returned to in subsequent chapters for further discussion. It’s almost impossible to overstate the infrastructure AI/ML will need to be successful because so much of the performance is dependent on how we deliver data and how we manage deployments. We covered the basic definitions of ML and DL, as well as the learning types that both can employ. We also covered some of the basics of setting up and maintaining an AI pipeline and included a few examples of how other companies manage this kind of operation.

Building products that leverage AI/ML is an ambitious endeavor, and this first chapter was meant to provide enough of a foundation for the process of setting up an AI program overall, so that we can build on the various aspects of that process in the following chapters without having to introduce too many new concepts so late in the book. If you’re feeling overwhelmed, it only means you’re grasping the scale necessary for building with AI. That’s a great sign! In Chapter 2, we will get into the specifics of using and maintaining the ML models we briefly introduced earlier in this chapter.