The Definitive Guide to Google Vertex AI

Machine Learning Project Life Cycle and Challenges

Today, machine learning (ML) and artificial intelligence (AI) are integral parts of business strategy for many organizations, and more organizations are using them every year. The major reason for this adoption is the power of ML and AI solutions to garner more revenue, brand value, and cost savings. This increase in the adoption of AI and ML demands more skilled data and ML specialists and technical leaders. If you are an ML practitioner or beginner, this book will help you become a confident ML engineer or data scientist with knowledge of Google’s best practices. In this chapter, we will discuss the basics of the life cycle and the challenges and limitations of ML when developing real-world applications.

ML projects often involve a defined set of steps from problem statements to deployments. It is essential to understand the importance and common challenges involved with these steps to complete a successful and impactful project. In this chapter, we will discuss the importance of understanding the business problem, the common steps involved in a typical ML project life cycle, and the challenges and limitations of ML in detail. This will help new ML practitioners understand the basic project flow; plus, it will help create a foundation for forthcoming chapters in this book.

This chapter covers the following topics:

ML project life cycle
Common challenges in developing real-world ML solutions
Limitations of ML

ML project life cycle

In this section, we will learn about the typical life cycle of an ML project, from defining the problem to model development, and finally, to the operationalization of the model. Figure 1.1 shows the high-level steps almost every ML project goes through. Let’s go through all these steps in detail.

Figure 1.1 – Life cycle of a typical ML project

Just like the Software Development Life Cycle (SDLC), the Machine Learning Project/Development Lifecycle (MDLC) guides the end-to-end process of ML model development and operationalization. At a high level, the life cycle of a typical ML project in an enterprise setting remains somewhat consistent and includes eight key steps:

Define the ML use case: The first step of any ML project is where the ML team works with business stakeholders to assess the business needs around predictive analytics and identifies a use case where ML can be used, along with some success criteria, performance metrics, and possible datasets that can be used to build the models.
For example, if the sales/marketing department of an insurance company called ABC Insurance Inc. wants to better utilize its resources to target customers who are more likely to buy a certain product, they might approach the ML team to build a solution that can sift through all possible leads/customers and, based on the data points for each lead (age, prior purchase, length of policy history, income level, etc.), identify the customers who are most likely to buy a policy. Then the sales team can ask their customer representatives to prioritize reaching out to these customers instead of calling all possible customers blindly. This can significantly improve the outcome of outbound calls by the reps and improve the sales-related KPIs.
Once the use case is defined, the next step is to define a set of KPIs to measure the success of the solution. For this sales use case, this could be the customer sign-up rate—what percentage of the customers whom sales reps talk to sign up for a new insurance policy?
To measure the effectiveness of the ML solution, the sales team and the ML team might agree to measure the increase or decrease in customer sign-up rate once the ML model is live and iteratively improve on the model to optimize the sign-up rate.
At this stage, there will also be a discussion about the possible datasets that can be utilized for the model training. These could include the following:
- Internal customer/product datasets being generated by marketing and sales teams, for example, customer metadata, such as their age, education profile, income level, prior purchase behavior, number and type of vehicles they own, etc.
- External datasets that can be acquired through third parties; for example, an external marketing consultancy might have collected data about the insurance purchase behavior of customers based on the car brand they own. This additional data can be used to predict how likely they are to purchase the insurance policy being sold by ABC Insurance Inc.
Explore/analyze data: The next step is to do a detailed analysis of the datasets. This is usually an iterative process in which the ML team works closely with the data and business SMEs to better understand the nuances of the available datasets, including the following:
- Data sources
- Data granularity
- Update frequency
- Description of individual data points and their business meaning
This is a key step where data scientists/ML engineers analyze the available data and decide what datasets might be relevant to the ML solution being considered, analyze the robustness of the data, and identify any gaps. Issues that the team might identify at this stage could relate to the cleanliness and completeness of data or problems with the timely availability of the data in production. For example, the age of the customer could be a great indicator of their purchase behavior, but if it’s an optional field in the customer profile, only a handful of customers might have provided their date of birth or age.
So, the team would need to figure out if they want to use the field and, if so, how to handle the samples where age is missing. They could also work with sales and marketing teams to make the field a required field whenever a new customer requests an insurance quote online and generates a lead in the system.
Select ML model type: Once the use case has been identified along with the datasets that can possibly be used to train the model, the next step is to consider the types of models that can be used to achieve the requirements. We won’t go too deep into the topic of general model selection here since entire books could be written on the topic, but in the next few chapters, you will see what different model types can be built for specific use cases in Vertex AI. At a very high level, the key considerations at this stage are as follows:
- Type of model: For example, for the insurance customer/lead ranking example, we could build a classification model that will predict whether a new customer is high/medium/low in terms of their likelihood to purchase a policy. Or a regression model could be built to output a sales probability number for each likely customer.
- Does the conventional ML model satisfy our requirements or do we need a deep learning model?
- Explainability requirements: Does the use case require an explanation for each prediction as to why the sample was classified a certain way?
- Single versus ensemble model: Do we need a single model to give us the final prediction, or do we need to employ a set of interconnected models that feed into each other? For example, a first model might assign a customer to a particular customer group, and the next model might use that grouping to identify the final likelihood of purchase.
- Separation of models: For example, sometimes we might build a single global model for the entire customer base, or we might need separate models for each region due to significant differences in products and user behavior in different regions.
Feature engineering: This process is usually the most time-consuming and involves several steps:
1. Data cleanup–Imputing missing values where possible, dropping fields with too many missing values
2. Data and feature augmentation–Joining datasets to bring in additional fields, and cross-joining existing features to generate new features
3. Feature analysis–Calculating feature correlation and analyzing collinearity, checking for data leakage in features
Again, since this is an extremely broad topic, we are not diving too deep into it and suggest you refer to other books on this topic.
Iterate over the model design/build: The actual design and build of the ML model is an iterative process involving the following key steps:
1. Select model architecture
2. Split acquired data into train/validation/test subsets
3. Run model training experiments, tune hyperparameters
4. Evaluate trained models with the test dataset
5. Rank and select the best models
Figure 1.2 shows the typical ML model development life cycle:

Figure 1.2 – ML model development life cycle

Consensus on results: Once a satisfactory model has been obtained, the ML team shares the results with the business stakeholders to ensure the results fully align with the business needs and performs additional optimizations and post-processing steps to make the model predictions usable by the business. To assure business stakeholders that ML solution is aligned with the business goals and is accurate enough to drive value, ML teams could use one of a number of approaches:
- Evaluate using historical test datasets: ML teams can run historical data through the new ML models and evaluate the predictions against the ground truth values. For example, in the insurance use case discussed previously, the ML team can take last month’s data on customer leads and use the ML model to predict which customers are most likely to purchase a new insurance policy. Then they can compare the model’s predictions against the actual purchase history from the previous month and see how accurate the model’s predictions were. If the model’s output is close to the real purchase behavior of customers, then the model is working as desired, and this information can be presented to business stakeholders to convince them of the ML solution’s efficacy in driving additional revenue. On the contrary, if the model’s output significantly deviates from the customer’s behavior, the ML team needs to go back and work on improving the model. This usually is an iterative process and can take a number of iterations, depending on the complexity of the model.
- Evaluate with live data: In some scenarios, an organization might decide to conduct a small pilot in a production environment with real-time data to assess the performance of the new ML model. This is usually done in the following scenarios:
  - When there is no historical data available to conduct the evaluation or where testing with historical data is not expected to be an accurate; for example, during the onset of COVID, customer behavior patterns abruptly changed to the extent that testing with any historical data became nearly useless
  - When there is an existing model in production being used for critical real-time predictions, the sanity check for the new model needs to be performed not just in terms of its accuracy but also its subtle impact on downstream KPIs such as revenue per user session
In such cases, teams might deploy the model in production, divert a small number of prediction requests to the newer model, and periodically compare the overall impact on the KPIs. For example, in the case of a recommendation model deployed on an e-commerce website, a recommendation model might start recommending products that are comparatively cheaper than the predictions from the older model already live in production. In this scenario, the likelihood of a customer completing a purchase would go up, but at the same time, the revenue generated per user session would decrease, impacting overall revenue for the organization. So, although it might seem like the ML model is working as designed, it might not be considered a success by the business/sales stakeholders, and more discussions would be required to optimize it.
Operationalize model: Once the model has been approved for deployment in production, the ML team will work with their organization’s IT and data engineering teams to deploy the model so that other applications can start utilizing it to generate insights. Depending on the size of the organization, there can be significant overlap in the roles these teams play.
The actual deployment architecture would depend on the following:
- Prediction SLAs – Ranging from periodic batch jobs to solutions that require sub-second prediction performance.
- Compliance requirements – Can the user data be sent to third-party cloud providers, or does it need to always reside within an organization’s data centers?
- Infrastructure requirements – This depends on the size of the model and its compute requirements. Small models can be served from a shared compute node. Some large models might need a large GPU-connected node.
We will discuss this topic in detail in later chapters, but the following figure shows some key components you might consider as part of your deployment architecture.

Figure 1.3 – Key components of ML model training and deployment

Monitor and retrain: It might seem as if the ML team’s job is done once the model has been operationalized, but in real-world deployments, most models require periodic or sometimes constant monitoring to ensure the model is operating within the required performance thresholds. Model performance could become sub-optimal for several reasons:
- Data drift: Changes in data being used to generate predictions could change significantly and impact the model’s performance. As we discussed before, during COVID, customer behavior changed significantly. Models that were trained on pre-COVID customer behavior data were not equipped to handle this sudden change in usage patterns. Change due to the pandemic was relatively rare but high-impact, but there are plenty of other smaller changes in prediction input data that might impact your model’s performance adversely. The impact could range from a subtle drop in accuracy to a model generating erroneous responses. So, it is important to keep an eye on the key performance metrics of your ML solution.
- Change prediction request volume: If your solution was designed to handle 100 requests per second but now is seeing periodic bursts in traffic of around 1,000 requests per second, your solution might not be able to keep up with the demand, or latency might go above acceptable levels. So, your solution also needs to have monitoring and certain levels of auto-scaling built in to handle such scenarios. For larger changes in traffic volumes, you might even need to completely rethink the serving architecture.
There would be scenarios where through monitoring, you will discover that your ML model no longer meets the prediction accuracy and requires retraining. If the change in data patterns is expected, the ML team should design the solution to support automatic periodic retraining. For example, in the retail industry, product catalogs, pricing, and promotions constantly evolve, requiring regular retraining of the models. In other scenarios, the change might be gradual or unexpected, and when the monitoring system alerts the ML team of the model performance degradation, they need to take a call on retraining the model with more recent data, or maybe even completely rebuilding the model with new features.

Now that we have a good idea of the life cycle of an ML project, let’s learn about some of the common challenges faced by ML developers when creating and deploying ML solutions.

Common challenges in developing real-world ML solutions

A real-world ML project is always filled with some unexpected challenges that we get to experience at different stages. The main reason for this is that the data present in the real world, and the ML algorithms, are not perfect. Though these challenges hamper the performance of the overall ML setup, they don’t prevent us from creating a valuable ML application. In a new ML project, it is difficult to know the challenges up front. They are often found during different stages of the project. Some of these challenges are not obvious and require skilled or experienced ML practitioners (or data scientists) to identify them and apply countermeasures to reduce their effect.

In this section, we will understand some of the common challenges encountered during the development of a typical ML solution. The following list shows some common challenges we will discuss in more detail:

Data collection and security
Non-representative training set
Poor quality of data
Underfitting of the training dataset
Overfitting of the training dataset
Infrastructure requirements

Now, let’s learn about each of these common challenges in detail.

Data collection and security

One of the most common challenges that organizations face is data availability. ML algorithms require a large amount of good-quality data in order to provide quality results. Thus, the availability of raw data is critical for a business if it wants to implement ML. Sometimes, even if the raw data is available, gathering data is not the only concern; we often need to transform or process the data in a way that our ML algorithm supports.

Data security is another important challenge that is very frequently faced by ML developers. When we get data from a company, it is essential to differentiate between sensitive and non-sensitive information to implement ML correctly and efficiently. The sensitive part of data needs to be stored in fully secured servers (storage systems) and should always be kept encrypted. Sensitive data should be avoided for security purposes, and only the less-sensitive data access should be given to trusted team members working on the project. If the data contains Personally Identifiable Information (PII), it can still be used by anonymizing it properly.

Non-representative training data

A good ML model is one that performs equally well on unseen data and training data. It is only possible when your training data is a good representative of most possible business scenarios. Sometimes, when the dataset is small, it may not be a true representative of the inherent distribution, and the resulting model may provide inaccurate predictions on unseen datasets despite having high-quality results on the training dataset. This kind of non-representative data is either the result of sampling bias or the unavailability of data. Thus, an ML model trained on such a non-representative dataset may have less value when it is deployed in production.

If it is impossible to get a true representative training dataset for a business problem, then it’s better to limit the scope of the problem to only the scenarios for which we have a sufficient amount of training samples. In this way, we will only get known scenarios in the unseen dataset, and the model should provide quality predictions. Sometimes, the data related to a business problem keeps changing with time, and it may not be possible to develop a single static model that works well; in such cases, continuous retraining of the model on the latest data becomes essential.

Poor quality of data

The performance of ML algorithms is very sensitive to the quality of training samples. A small number of outliers, missing data cases, or some abnormal scenarios can affect the quality of the model significantly. So, it is important to treat such scenarios carefully while analyzing the data before training any ML algorithm. There are multiple methods for identifying and treating outliers; the best method depends upon the nature of the problem and the data itself. Similarly, there are multiple ways of treating the missing values as well. For example, mean, median, mode, and so on are some frequently used methods to fill in missing data. If the training data size is sufficiently large, dropping a small number of rows with missing values is also a good option.

As discussed, the quality of the training dataset is important if we want our ML system to learn accurately and provide quality results on the unseen dataset. It means that the data pre-processing part of the ML life cycle should be taken very seriously.

Underfitting the training dataset

Underfitting an ML model means that the model is too simple to learn the inherent information or structure of the training dataset. It may occur when we try to fit a non-linear distribution using a linear ML algorithm such as linear regression. Underfitting may also occur when we utilize only a minimal set of features (that may not have much information about the target distribution) while training the model. This type of model can be too simple to learn the target distribution. An underfitted model learns too little from the training data and, thus, makes mistakes on unseen or test datasets.

There are multiple ways to tackle the problem of underfitting. Here is a list of some common methods:

Feature engineering – add more features that represent target distribution
Non-linear algorithms – switch to a non-linear algorithm if the target distribution is not linear
Removing noise from the data
Add more power to the model – increase trainable parameters, increase depth or number of trees in tree-based ensembles

Just like underfitting the model on training data, overfitting is also a big issue. Let’s deep dive into it.

Overfitting the training dataset

The overfitting problem is the opposite of the underfitting problem. Overfitting is the scenario when the ML model learns too much unnecessary information from the training data and fails to generalize on a test or unseen dataset. In this case, the model performs extremely well on the training dataset, but the metric value (such as accuracy) is very low on the test set. Overfitting usually occurs when we implement a very complex algorithm on simple datasets.

Some common methods to address the problem of overfitting are as follows:

Increase training data size – ML models often overfit on small datasets
Use simpler models – When problems are simple or linear in nature, choose simple ML algorithms
Regularization – There are multiple regularization methods that prevent complex models from overfitting on the training dataset
Reduce model complexity – Use a smaller number of trainable parameters, train for a smaller number of epochs, and reduce the depth of tree-based models

Overfitting and underfitting are common challenges and should be addressed carefully, as discussed earlier. Now, let’s discuss some infrastructure-related challenges.

Infrastructure requirements

ML is expensive. A typical ML project often involves crunching large datasets with millions or billions of samples. Slicing and dicing such datasets requires a lot of memory and high-end multi-core processors. Additionally, once the development of the project is complete, dedicated servers are required to deploy the models and match the scale of consumers. Thus, business organizations willing to practice ML need some dedicated infrastructure to implement and consume ML efficiently. This requirement increases further when working with large, deep learning models such as transformers, large language models (LLMs), and so on. Such models usually require a set of accelerators, graphical processing units (GPUs), or tensor processing units (TPUs) for training, finetuning, and deployment.

As we have discussed, infrastructure is critical for practicing ML. Companies that lack such infrastructure can consult with other firms or adopt cloud-based offerings to start developing ML-based applications.

Now that we understand the common challenges faced during the development of an ML project, we should be able to make more informed decisions about them. Next, let’s learn about some of the limitations of ML.

Limitations of ML

ML is very powerful, but it’s not the answer to every single problem. There are problems that ML is just not suitable for, and there are some cases where ML can’t be applied due to technical or business constraints. As an ML practitioner, it is important to develop the ability to find relevant business problems where ML can provide significant value instead of applying it blindly everywhere. Additionally, there are algorithm-specific limitations that can render an ML solution not applicable in some business applications. In this section, we will learn about some common limitations of ML that should be kept in mind while finding relevant use cases.

Keep in mind that the limitations we are discussing in this section are very general. In real-world applications, there are more limitations possible due to the nature of the problem we are solving. Some common limitations that we will discuss in detail are as follows:

Data-related concerns
Deterministic nature of problems
Lack of interpretability and reproducibility
Concerns related to cost and customizations
Ethical concerns and bias

Let’s now deep dive into each of these common limitations.

Data-related concerns

The quality of an ML model highly depends upon the quality of the training data it is provided with. Data present in the real world is often noisy, incomplete, unlabeled, and sometimes unusable. Moreover, most supervised learning algorithms require large amounts of properly labeled training data to produce good results. The training data requirements of some algorithms (e.g., deep learning) are so high that even manually labeling data is not an option. And even if we manage to label the data manually, it is often error-prone due to human bias.

Another major issue is incompleteness or missing data. For example, consider the problem of automatic speech recognition. In this case, model results are highly biased toward the accent present in the training dataset. A model that is trained on the American accent doesn’t produce good results on other accented speech. Since accents change significantly as we travel to different parts of the world, it is hard to gather and label relevant amounts of training data for every possible accent. For this reason, developing a single speech recognition model that works for everyone is not yet feasible, and thus, the tech giants providing speech recognition solutions often develop accent-specific models. Developing a new model for each new accent is not very scalable.

Deterministic nature of problems

ML has achieved great success in solving some highly complex problems, such as numerical weather prediction. One problem with most of the current ML algorithms is that they are stochastic in nature and thus cannot be trusted blindly when the problem is deterministic. Considering the case of numerical weather prediction, today we have ML models that can predict rain, wind speed, air pressure, and so on, with acceptable accuracy, but they completely fail to understand the physics behind real weather systems. For example, an ML model might provide negative value estimations of parameters such as density.

However, it is very likely that these kinds of limitations can be overcome in the near future. Future research in the field of ML might discover new algorithms that are smart enough to understand the physics of our world. Such models will open infinite possibilities in the future.

Lack of interpretability and reproducibility

One major issue with many ML algorithms (and often with neural networks) is the lack of interpretability of results. Many business applications, such as fraud detection and disease prediction, require a justification for model results. If an ML model classifies a financial transaction as fraud, it should also provide solid evidence for the decision; otherwise, this output may not be useful for the business. Deep learning or neural network models often lack interpretability, and the explainability of such models is an active area of research. Multiple methods have been developed for model interpretability or explainability purposes. Though these methods can provide some insights into the results, they are still far from the actual requirements.

Reproducibility, on the other hand, is another complex and growing issue with ML solutions. Some of the latest research papers might show us great improvements in results using some technological advancements on a fixed set of datasets, but the same method may not work in real-world scenarios. Secondly, ML models are often unstable, which means that they produce different results when trained on different partitions of the dataset. This is a challenging situation because models developed for one business segment may be completely useless for another business segment, even though the underlying problem statement is similar. This makes them less reusable.

Concerns related to cost and customizations

Developing and maintaining ML solutions is often expensive, more so in the case of deep learning algorithms. Development costs may come from employing highly skilled developers as well as the infrastructure needed for data analytics and ML experimentation. Deep learning models usually require high-compute resources such as GPUs and TPUs for training and experimentation. Running a hyperparameter tuning job with such models is even more costly and time-consuming. Once the model is ready for production, it requires dedicated resources for deployment, monitoring, and maintenance. This cost further increases as you scale your deployments to serve a large number of customers, and even more if there are very low latency concerns. Thus, it is very important to understand the value that our solution is going to bring before jumping into the development phase and check whether it is worth the investment.

Another concern with the ML solutions is their lack of customizations. ML models are often very difficult to customize, meaning it can be hard to change their parameters or make them adapt to new datasets. Pre-built general-purpose ML solutions often do not work well on specific business use cases, and this leaves them with two choices – either to develop the solution from scratch or customize the prebuilt general-purpose solutions. Though the customization of prebuilt models seems like a better choice here, even the customization is not easy in the case of ML models. ML model customization requires a skilled set of data engineers and ML specialists with a deep understanding of technical concepts such as deep learning, predictive modeling, and transfer learning.

Ethical concerns and bias

ML is quite powerful and is adopted today by many organizations to guide their business strategy and decisions. As we know, some of these ML algorithms are black boxes; they may not provide reasons behind their decisions. ML systems are trained on a finite set of datasets, and they may not apply to some real-world scenarios; if those scenarios are encountered in the future, we can’t tell what decision the ML system will take. There might be ethical concerns related to such black-box decisions. For example, if a self-driving car is involved in a road accident, whom should you blame – the driver, the team that developed the AI system, or the car manufacturer? Thus, it is clear that the current advancements in ML and AI are not suitable for ethical or moral decision-making. Also, we need a framework to solve ethical concerns involving ML and AI systems.

The accuracy and speed of ML solutions are often commendable, but these solutions cannot always be trusted to be fair and unbiased. Consider AI software that recognizes faces or objects in a given image; this system could go wrong on photos where the camera is not able to capture racial sensitivity properly, or it may classify a certain type of dog (that is somewhat similar to a cat) as a cat. This kind of bias may come from a biased set of training or testing datasets used for developing AI systems. Data present in the real world is often collected and labeled by humans; thus, the bias that exists in humans is transferred into AI systems. Avoiding bias completely is impossible as we all are humans and are thus biased, but there are measures that can be taken to reduce it. Establishing a culture of ethics and building teams from diverse backgrounds can be a good step to reduce bias to a certain extent.