AI is everywhere. From recommending products on your favorite websites to optimizing the supply chains of Fortune 500 companies to forecasting demand for shops of all sizes, AI has emerged as a dominant force. Yet, as AI becomes more and more prevalent in the workplace, a worrisome trend has emerged: most AI projects fail.
Failure occurs for a variety of technical and non-technical reasons. Sometimes, it's because the AI model performs poorly. Other times, it's due to data issues. Machine learning algorithms require reliable, accurate, timely data, and sometimes your data fails to meet those standards. When data isn't the issue and your model performs well, failure usually occurs because end users simply do not trust AI to guide their decision making.
For every worrisome trend, however, there is a promising solution. Microsoft and a host of other companies have developed automated machine learning (AutoML) to increase the success of your AI projects. In this book, you will learn how to use AutoML on Microsoft's Azure cloud platform. This book will teach you how to boost your productivity if you are a data scientist. If you are not a data scientist, this book will enable you to build machine learning models and harness the power of AI.
In this chapter, we will begin by understanding what AI and machine learning are and explain why companies have had such trouble in seeing a return on their investment in AI. Then, we will proceed into a deeper dive into how data scientists work and why that workflow is inherently slow and mistake-prone from a project success perspective. Finally, we conclude the chapter by introducing AutoML as the key to unlocking productivity in machine learning projects.
In this chapter, we will cover the following topics:
- Explaining data science's ROI problem
- Analyzing why AI projects fail slowly
- Solving the ROI problem with AutoML
Explaining data science's ROI problem
Data scientist has been consistently ranked the best job in America by Forbes Magazine from 2016 to 2019, yet the best job in America has not produced the best results for the companies employing them. According to VentureBeat, 87% of data science projects fail to make it into production. This means that most of the work that data scientists perform does not impact their employer in any meaningful way.
By itself, this is not a problem. If data scientists were cheap and plentiful, companies would see a return on their investment. However, this is simply not the case. According to the 2020 LinkedIn Salary stats, data scientists earn a total compensation of around $111,000 across all career levels in the United States. It's also very easy for them to find jobs.
Burtch Works, a United States-based executive recruiting firm, reports that, as of 2018, data scientists stayed at their job for only 2.6 years on average, and 17.6% of all data scientists changed jobs that year. Data scientists are expensive and hard to keep.
Likewise, if data scientists worked fast, even though 87% of their projects fail to have an impact, a return on investment (ROI) is still possible. Failing fast means that many projects still make it into production and the department is successful. Failing slow means that the department fails to deliver.
Unfortunately, most data science departments fail slow. To understand why, you must first understand what machine learning is, how it differs from traditional software development, and the five steps common to all machine learning projects.
Defining machine learning, data science, and AI
Machine learning is the process of training statistical models to make predictions using data. It is a category within AI. AI is defined as computer programs that perform cognitive tasks such as decision making that would normally be performed by a human. Data science is a career field that combines computer science, machine learning, and other statistical techniques to solve business problems.
Data scientists use a variety of machine learning algorithms to solve business problems. Machine learning algorithms are best thought of as a defined set of mathematical computations to perform on data to make predictions. Common applications of machine learning that you may experience in everyday life include predicting when your credit card was used to make a fraudulent transaction, determining how much money you should be given when applying for a loan, and figuring out which items are suggested to you when shopping online. All of these decisions, big and small, are determined mechanistically through machine learning.
There are many types of algorithms, but it's not important for you to know them all. Random Forest, XGBoost, LightGBM, deep learning, CART decision trees, multilinear regression, naïve Bayes, logistic regression, and k-nearest neighbor are all examples of machine learning algorithms. These algorithms are powerful because they work by learning patterns in data that would be too complex or subtle for any human being to detect on their own.
Imagine you are a restaurant manager and you want to forecast how much money you will make next month by running an advertising campaign. To accomplish this with machine learning, you would want to collect all of your sales data from previous years, including the results of previous campaigns. Since you have past results and are using them to make predictions, this is an example of supervised learning.
Unsupervised learning simply groups like data points together. It's useful when you have a lot of information about your customers and would like to group them into buckets so that you can advertise to them in a more targeted fashion. Azure AutoML, however, is strictly for supervised learning tasks. Thus, you always need to have past results available in your data when creating new AutoML models.
Machine learning versus traditional software
Traditional software development and machine learning development differ tremendously. Programmers are used to creating software that takes in input and delivers output based on explicitly defined rules. Data scientists, on the other hand, collect the desired output first before making a program. They then use this output data along with input data to create a program that learns how to predict output from input.
For example, maybe you would like to build an algorithm predicting how many car accidents would occur in a given city on a given day. First, you would begin by collecting historical data such as the number of car crashes (the desired output) and any data that you guess would be useful in predicting that number (input data). Weather data, day of the week, amount of traffic, and data related to city events can all be used as input.
Once you collect the data, your next step is to create a statistical program that finds hidden patterns between the input and output data; this is called model training. After you train your model, your next step is to set up an inference program that uses new input data to predict how many car accidents will happen that day using your trained model.
Another major difference is that, with machine learning, you never know what data you're going to need to create your solution before you try it out, and you never know what you're going to get until you build a solution. Since data scientists never know what data they need to solve any given problem, they need to ask for advice from business experts and use their intuition to identify the right data to collect.
These differences are important because successful machine learning projects look very different from successful traditional software projects; confusing the two leads to failed projects. Managers with an IT background but lacking a data science background often try to follow methods and timelines inappropriate for a machine learning project.
Frankly, it's unrealistic to assign hard timelines to a process where you don't know what data you will need or what algorithms will work, and many data science projects fail simply because they weren't given adequate time or support. There is, however, a recipe for success.
The five steps to machine learning success
Now that we know what machine learning is and how it differs from traditional software development, the next step is to learn how a typical machine learning project is structured. There are many ways you could divide the process, but there are roughly five parts, as shown in the following diagram:
Let's look at each of these steps in turn.
Understanding the business problem
For example, a problem in the world of professional basketball may be, we are really bad at drafting European basketball players. We would like to get better at selecting the right players for our team. You will need to figure out what the business means by a good player. Along the way, you may discover that most players brought over from Europe only play a few games and are sent home, and this costs the team millions of wasted dollars.
Armed with this information, you then need to translate the problem to make it solvable by machine learning. Think about it clearly. We will use the player's historical in-game statistics and demographic information to predict the longevity of their career in the NBA would make a good machine learning project. Translating a business problem into an AI problem always means using data to try to predict a number (the number of games played in the NBA) or a category (whether the player would head home after a handful of games).
Collecting and cleansing data
- Identifying and gaining access to data sources
- Retrieving all of the data you want
- Joining all of your data together
- Removing errors in the data
- Applying business logic to create a clean dataset even a layman could understand
This is harder than it sounds. Data is often dirty and hard to find.
With our basketball case, this would mean scraping publicly available data from the web to get each player's in-game statistics and demographic information. Errors are nearly guaranteed, so you will have to work in logic to remove or fix nonsensical numbers. No human being is 190 inches tall, for example, but centimeters and inches are often confused.
The best test for whether you have properly cleansed a dataset and made it clear is to give it to a layman and ask simple questions. "How tall is player Y? How many NBA games did player X participate in during his career?". If they can answer, you have succeeded.
Transforming data for machine learning
Once you have an easily understandable, cleansed dataset, the next step is transforming data for machine learning, which is called feature engineering. Feature engineering is the process of altering data for machine learning algorithms. Some features are necessary for the algorithm to work, while other features make it easier for the algorithm to find patterns. Common feature engineering techniques include one-hot encoding categorical variables, scaling numeric values, removing outliers, and filling in null values.
A complication is that different algorithms require different types of feature engineering. Unlike most algorithms, XGBoost does not require you to fill in null values. Decision trees aren't affected by outliers much, but outliers throw off regression models. Going back to our basketball problem, you would likely have to replace null values, scale numeric values so that each column contains only a range of numbers from 0 to 1, and one-hot encode categorical variables.
One-hot encoding categorical variables simply means taking one column with many categories and turning it into many columns with either a one or a zero. For example, if you have one column with the values USA, Canada, or Mexico, one-hot encoding that column would create three columns, one for each country. A row with a product from the United States would have a 1 in the USA column and a 0 in the Canada and Mexico columns.
Training the machine learning model
Now that you have your data and you have it in just the right format, it's time to train a machine learning model. Although this step gets a lot of glamour and hype, training machine learning models is a process that is both quick and slow. With today's technology, most machine learning models can be trained with only a few lines of code.
Hyperparameter tuning, in contrast, can take a very long time. Each machine learning algorithm has settings you can control, called hyperparameters. Hyperparameter tuning is retraining machine learning algorithms multiple times until you find the right set of parameters.
Some algorithms such as Random Forest do not get much benefit out of hyperparameter tuning. Others, such as XGBoost or LightGBM, often improve drastically. Depending on the size of your data, the algorithm you're using, and the amount of compute you have available, hyperparameter tuning can take days to weeks to finish.
Notice how much you have to know about individual algorithms to become a successful data scientist? This is one of the reasons why the field has such a high barrier to entry. Do not be intimidated, but please keep this in mind as we introduce AutoML.
Delivering results to end users
You have now trained your model and tuned its parameters, and you can confidently predict which European players the NBA team should draft. Maybe you have achieved 80% accuracy, maybe 90%, but your predictions will definitely help the business. Despite your results, you still have to get end users to accept your model, trust your model, and use it. Unlike traditional software, this can require a Herculean effort.
First, end users are going to want to know why the model is giving its prediction, and, if you used the wrong algorithm, this is impossible. Black-box models use algorithms that are inherently unknowable. Then, even if you can give the business explanations, the user may feel uncomfortable with that 80% accuracy number. "What does that mean?", they will ask.
Visualizations are key to relieving some of their fears. For your basketball model, you decide to simply show the business pictures of the players they should draft, along with some simple graphs showing how many players our model accurately predicted would be NBA stars and how many European NBA stars our model failed to predict.
Putting it all together
You now know what machine learning is, how it differs from traditional software development, and the five steps inherent to any machine learning project. Unfortunately, many people in the industry do not understand any of these things. Most businesses are new to data science. Many businesses believe that data science is much more similar to software development than it is, and this interferes with the machine learning project process.
End users are confused by data scientists' questions because they don't realize that the data scientist is trying to translate their business problem into a machine learning problem. IT is confused as to why data scientists ask for access to so much data because they don't realize that data scientists don't know what data they will need before trying it out. Management is confused as to why their data scientists spend so little time building models and so much time cleansing and transforming data.
Thus, steps 1 and 2 of the machine learning process often take longer than expected. Business users fail to communicate their business problem to data scientists in a useful way, IT is slow to grant data scientists access to data, and data scientists struggle with understanding the data they receive. Step 5 is also complicated because end users expect models to be perfectly explainable like a typical software program, and earning their trust takes time.
Given that misinterpretation slows down the other steps, the rest of the data science process must be fast for companies to see ROI. Transforming data and training models is the core of data science work, after all. It is exactly what they were trained to do and it should be fast. As we shall see in the next section, this is rarely the case.
Analyzing why AI projects fail slowly
Data scientists often come from research backgrounds and approach the work of building machine learning models methodically. After obtaining a business problem and determining what an AI solution looks like, the process looks like this:
- Gather the data.
- Cleanse the data.
- Transform the data.
- Build a machine learning model.
- Determine whether the model is acceptable.
- Tune the hyperparameters.
- Deploy the model.
If the model is acceptable in step 5, data scientists will proceed to step 6. If the model is unacceptable, they will return to step 3 and try different models and transformations. This process can be seen in Figure 1.2:
While this process follows the five steps outlined in Figure 1.1, it's long and cumbersome. There are also specific drawbacks related to transforming data, building models, tuning hyperparameters, and deploying models. We will now take a closer look at the drawbacks inherent in this process that cause data science projects to fail slowly instead of quickly, greatly impacting the ROI problem:
- Data has to be transformed differently by algorithms. Data transformation can be a tedious process. Learning how to fill in null values, one-hot encode categorical variables, remove outliers, and scale datasets appropriately isn't easy and takes years of experience to master. It also takes a lot of code, and it's easy for novice programmers to make mistakes, especially when learning a new programming language.
Furthermore, different data transformations are required for different algorithms. Some algorithms can handle null values while others cannot. Some can handle highly imbalanced datasets, those in which you only have a few samples of a category you're trying to predict, while other algorithms break.
One algorithm may perform well if you remove outliers while another will be completely unaffected. Whenever you choose to try a new model, there's a high likelihood that you will need to spend time reengineering your data for your algorithm.
- Some algorithms require hyperparameter tuning to perform well. Unlike Random Forest, models built with algorithms such as XGBoost and LightGBM only perform well if you tune their hyperparameters, but they can perform really well.
Thus, you have two choices: stick to models that perform decently without tuning or spend days to weeks tuning a model with high potential but no guarantee of success. Furthermore, you need a massive amount of algorithm-specific knowledge to become a successful data scientist using the traditional approach.
- Hyperparameter tuning takes a lot of compute hours. Days to weeks to train a single machine learning model may seem like an exaggeration, but in practice, it's common. At most companies, GPUs and CPUs are a limited resource, and datasets can be quite large. Certain companies have lines to get access to computing power, and data scientists require a lot of it.
When hyperparameter tuning, it's common to do something called a grid search where every possible combination of parameters is trained over a specified numerical space. For example, say algorithm X has parameters A, B, and C, and you want to try setting A to 1, 2, and 3, B to 0 and 1, and C to 0.5, 1, and 1.5. Tuning this model requires building 3 x 2 x 3 = 15 machine learning models to find the ideal combination.
Now, 15 models may take a few minutes or a few days to train depending on the size of the data you are using, the algorithm you wish to employ, and your computing capacity. However, 15 models is very little. Many modern machine learning algorithms require you to tune five to six parameters over a wide range of values, producing hundreds to thousands of models in search of finding the best performance.
- Deploying models is hard and it isn't taught in school. Even after you transform data correctly and find the perfect model with the best set of hyperparameters, you still need to deploy it to be used by the business. When new data comes in, it needs to get scored by the model and delivered somewhere. Monitoring is necessary as machine learning models do malfunction occasionally and must be retrained. Yet, data scientists are rarely trained in model deployment. It's considered more of an IT function.
As a result of this lack of training, many companies use hacky infrastructures with an array of slapped-together technologies for machine learning deployment. A database query may be triggered by third-party software on a trigger. Data may be transformed using one computer language, stored in a file on a file share, and picked up by another process that scores the model in another language and saves it back to the file share. Fragile, ad hoc solutions are the norm, and for most data scientists, this is all on-the-job training.
- Data scientists focus on accuracy instead of explainability. Fundamentally, data scientists are trained to build the most accurate AI they can. Kaggle competitions are all about accuracy and new algorithms are judged based on how well they perform compared to solutions of the past.
Businesspeople, on the other hand, often care more about the why behind the prediction. Forgetting to include explainability undermines the trust end users need to place in machine learning, and as a result, many models end up unused.
All in all, building machine learning models takes a lot of time, and when 87% of AI projects fail to make it to production, that's a lot of time wasted. Gathering data and cleansing data are processes that, by themselves, take a lot of time.
Transforming data for each model, tuning hyperparameters, and figuring out and implementing a deployment solution can take even longer. In such a scenario, it's easy to focus on finding the best model possible and overlook the most important part of the project: earning the trust of your end users and ensuring your solution gets used. Luckily, there's a solution to a lot of these problems.
Solving the ROI problem with AutoML
Given the high failure rate of AI projects, the lack of understanding that businesses have regarding how machine learning works, and the long length of time each project takes, Microsoft and other companies have worked to develop solutions that allow faster development and a higher success rate. One such solution is AutoML.
By automating a lot of the work that data scientists do, and by harnessing the power of cloud computing, AutoML on Azure allows data scientists to work faster and allows even non-specialists to build AI solutions.
Specifically, AutoML on Azure transforms data, builds models, and tunes hyperparameters for you. Deployment is possible with a few button clicks and explainability is inherently built into the solution. Compared to the traditional machine learning process in Figure 1.2, the AutoML process in Figure 1.3 is much more straightforward:
Following the AutoML approach allows you to fail much faster and gets you to a state where you need to decide between adding more data or dropping the project. Instead of wasting time tuning models that have no chance of working, AutoML gives you a definitive answer after only a single AutoML run. Let's take a look at a more detailed description of the advantages of AutoML.
Advantages of AutoML on Azure
- AutoML transforms data automatically: Once you have a cleansed, error-free dataset in an easy-to-understand format, you can simply load that data into AutoML. You do not need to fill in null values, one-hot encode categorical values, scale data, remove outliers, or worry about balancing datasets except in extreme cases. This is all done via AutoML's intelligent feature engineering. There are even data guardrails that automatically detect any problems in your dataset that may lead to a poorly built model.
- AutoML trains models with the best algorithms: After you load your data into AutoML, it will start training models using the most up-to-date algorithms. Depending on your settings and the size of your compute, AutoML will train these models in parallel using the Azure cloud. At the end of your run, AutoML will even build complex ensemble models combining the results of your highest performing models.
- AutoML tunes hyperparameters for you: As you use AutoML on Azure, you will notice that it will often create models using the same algorithms over and over again.
You may notice that while early on in the run it was trying a wide range of algorithms, by the end of the run, it's focusing on only one or two. This is because it is testing out different hyperparameters. While it may not find the absolute best set of hyperparameters on any given run, it is likely to deliver a high-performing, well-tuned model.
- AutoML has super-fast development: Models built using AutoML on Azure can be deployed to a REST API endpoint in just a few clicks. The accompanying script details the data schema that you need to pass through to the endpoint. Once you have created the REST API, you can deploy it anywhere to easily score data and store results in a database of your choice.
- AutoML has in-built explainability: Recently, Microsoft has focused on responsible AI. A key element of responsible AI is being able to explain how your machine learning model is making decisions.
AutoML-generated models come with a dashboard showing the importance of the different features used by your model. This is available for all of the models you train with AutoML unless you turn on the option to use black-box deep learning algorithms. Even individual data points can be explained, greatly helping your model to earn the trust and acceptance of business end users.
- AutoML enables data scientists to iterate faster: Through intelligent feature engineering, parallel model training, and automatic hyperparameter tuning, AutoML lets data scientists fail faster and succeed faster. If you cannot get decent performance with AutoML, you know that you need to add more data.
Conversely, if you do achieve great performance with AutoML, you can either choose to deploy the model as is or use AutoML as a baseline to compare against your hand-coded models. At this point in time, it's expected that the best data scientists will be able to manually build models that outperform AutoML in some cases.
- AutoML enables non-data scientists to do data science: Traditional machine learning has a high barrier to entry. You have to be an expert at statistics, computer programming, and data engineering to succeed in data science, and those are just the hard skills.
AutoML, on the other hand, can be performed by anyone who knows how to shape data. With a bit of SQL and database knowledge, you can harness the power of AI and build and deploy machine learning models that deliver business value fast.
- AutoML is the wave of the future: Just as AI has evolved from a buzzword to a practice, the way that machine learning solutions get created needs to evolve from research projects to well-oiled machines. AutoML is a key piece of that well-oiled machine, and AutoML on Azure has many features that will empower you to fail and succeed faster. From data transformation to deployment to end user acceptance, AutoML makes machine learning easier and more accessible than ever before.
- AutoML is widely available: Microsoft's AutoML is not only available on Azure but can also be used inside Power BI, ML.NET, SQL Server, Synapse, and HDInsight. As it matures further, expect it to be incorporated into more and more Azure and non-Azure Microsoft services.
AI and machine learning may have captured the world's imagination, but there's a large gap between the pie-in-the-sky promises of AI and the reality of AI projects. Machine learning projects, in particular, fail often and slowly. Traditional managers treat data science projects like software engineering projects, and data scientists work in a manual, time-consuming manner. Luckily, AutoML has emerged as a way to speed up projects, and Microsoft has created its AutoML offering with your needs in mind.
You are now primed for Chapter 2, Getting Started with Azure Machine Learning Service, which will introduce you to the Microsoft Azure Machine Learning workspace. You will create an Azure Machine Learning workspace and all of the necessary components required to start an AutoML project. By the end of the chapter, you will have a firm grasp of all of the different components of Azure Machine Learning Studio and how they interact.