In this book, we focus on allÂ three aspects of data science by explaining machine learningÂ (ML) algorithms in business applications, demonstrating how they can be implemented in a scalable environment and how to evaluate models and present evaluation metrics as business Key Performance Indicators (KPI). This book shows how Amazon Web Services (AWS) Machine Learning tools can be effectivelyÂ used on large datasets. We present various scenarios where mastering machine learning algorithms in AWS helps data scientists to perform their jobs more effectively.
Let's take a look at the topics we will cover in thisÂ chapter:
- How AWS empowers data scientists
- Identifying candidate problems that can be solved using machine learning
- Machine Learning project life cycle
- Deploying models
The number of digital data records that are stored on the internet has a lot in the last decade. Due to the drop in storage costs, and new sources of digital data, it is predicted that the amount of digital data available in 2025 will be 163 zettabytes (1,630,000,000,000 terabytes). Moreover, the amount of data that is generated every day is increasing at an alarming pace, with almost 90% of current data only being generated during the last two years. With more than 3.5 billion people with access to the internet, this data is not only generated by professionals and large companies, but by each of the 3.5 billion internet users.
Moreover, since companies understand the importance of data, they store all of their transactional data in the hope of analyzing it and uncovering interesting trends that could help their business make important decisions. Financial investors also crave storing and understanding every bit of information they can get about companies and train theirÂ quantitative analysts orÂ quants to make investment decisions.
It is up to the data scientists of the world to analyze this data and find gems of information from it. In the last decade, the data science team has become one of the most important teams in every organization. When data science teams were first created, most of the data would fit in Microsoft Excel sheets and the task was to find statistical trends in the data and provide actionable insights to business teams. However, as the amount of data has increased and machine learning algorithms have become more sophisticated and potent, the scope of data science teams has expanded.
In the following diagram, we can see the three basic skills that a data scientist needs:
The job description for a data scientists from company to company. However, in general, a data scientist needs the following three crucial skills:
- Machine learning:Â Machine learning algorithms provide tools to analyze and learn from a large amount of data and provide predictions or recommendations from that data. It is an important tool for analyzing structured (databases) and unstructured (text documents) data and inferring actionable insights from them. A data scientist should be an expert in a plethora of machine learning algorithms and should understand what algorithm should be applied in a given situation. As data scientists have access to a large library of algorithms that can solve a given problem, they should know which algorithms should be used in each situation.
- Computer programming:Â A data scientist should be an adept programmer who can write code to access various machine learning and statistical libraries. There are a lot of programming languages such as Scala, Python, and R that provide a number of libraries that let us apply machine learning algorithms on a dataset. Hence, knowledge of such tools helps a data scientist to perform complex tasks in a feasible time. This is crucial in a business environment.
- Communication:Â Along with discovering trends in the data and building complex machine learning models, a data scientist is also tasked with explaining these findings to business teams. Hence, a data scientist must not only possess good communication skills but also good analytics and visualization skills. This will help them present complex data models in a way that is easily understood by people not familiar with machine learning. This also helps data scientists to convey their findings to business teams and provide them with guidance on expected outcomes.
Machine learning research spans decades and has deep roots in mathematics and statistics. ML algorithms can be used to solve problems in many business applications. In application areas such as advertising, predictive algorithms are used to predict where to discover the further customers based on trends from previous purchasers. Regression algorithms are used to predict stock prices based on prior trends. Services such as Netflix use recommendation algorithms to study the history of a user and enhance the discoverability of new shows that they may be interested in. Artificial Intelligence (AI) applications such as self-driving cars rely heavily on image recognition algorithms that utilize deep learning to effectively discover and label objects on the road. It is important for a data scientist to understand the nuances of different machine learning algorithms and understand where they should be applied. Using pre-existing libraries helps a data scientist to explore various algorithms for a given application area and evaluate them. AWS offers a large number of libraries that can be used to perform machine learning tasks, as explained in the Machine Learning algorithms and deep learning algorithms parts of this book.
It is also important for data scientists to be able to understand the scale of data that they are working with. There might be tasks related to medical research that span thousands of patients with hundreds of features that can be processed on a single node device. However, tasks such as advertising, where companies collect several petabytes of data on customers based on every online advertisement that is served to the user, may require several thousand machines to compute and train machine learning algorithms. Deep learning algorithms are GPU-intensive and require a different type of machine than other machine learning algorithms. In this book, for each algorithm, we supply a description of how it is implemented simply using Python libraries and then how it can be scaled on large AWS clusters using technologies such as Spark and AWS SageMaker. We also discuss how TensorFlow is used for deep learning applications.
It is crucial to understand the customer of their machine learning-related tasks. Although it is challenging for data scientists to find which algorithm works for a specific application area, it is also important to gather evidence on how that algorithm enhances the application area and present it to the product-owners. Hence, we also discuss how to evaluate each algorithm and visualize the results where necessary. AWS offers a large array of tools for evaluating machine learning algorithms and presenting the results.
Finally, a data scientist also needs to be able to make decisions on what types of machines are best fitted for their needs on AWS. Once the algorithm is implemented, there are important considerations on how it can be deployed on large clusters in the most economical way. AWS offers more than 25 hardware alternatives, called instancetypes, which can be selected. We will discuss case studies on how an application is deployed on production clusters and various issues a data scientist can face during this process.
A typical machine learning project life cycle starts by understanding the problem at hand. Typically, someone in the organization (possibly a data scientist or business stakeholder) feels that some part of their business can be improved by the use of machine learning. For example, a music streaming company could conjecture that providing recommendations of songs similar to those played by a user would improve user engagement with the platform. Once we understand the business context and possible business actions to take, a data science team will need to consider several aspects during the project life cycle.
The following diagram describes various steps in a machine learning project life cycle:
We need to obtain data and organize it appropriately for the current problem (in our example, this could mean building a dataset linking users to songs they've listened to in the past). Depending on the size of the data, we might pick different technologies for storing the data. For example, it might be fine to train on a local machine using
scikit-learn if we're working through a few million records. However, if the data doesn't fit on a single computer, then we must consider AWS solutions such as S3 for storage and Apache Spark, or SageMaker's built-in algorithms for model building.
Before applying a machine learning algorithm, we need to consider how to assess the effectiveness of our strategy. In some cases, we can use part of our data to simulate the performance of the algorithm. However, on other occasions, the only viable way to evaluate the application of an algorithm is by doing some controlled testing (A/B testing) and determining whether the use cases in which the algorithm was applied resulted in a better outcome. In our music streaming example, this could mean selecting a panel of users and recommending songs to them using the new algorithm. We can run statistical tests to determine whether these users effectively stayed longer on the platform. Evaluation metrics should be determined based on the business KPI and should show significant improvement over existing processes.
We need to iterate on the complex problem of the creating the algorithm. This entails exploring the data to gain a deep understanding of the underlying variables. Once we have an idea of the kind of algorithm we want to apply, we'll need to further prepare the data, possibly combining it with other data sources (for example, census data). In our example, this could mean creating a song similarity matrix. Once we have the data, we can train a model (capable or making predictions) and test that model against holdout data to see how it performs. There are many considerations in this process that make it complex:
- How the data is encoded (for example,Â how the song matrix is constructed)
- What algorithm is used (example, collaborative filtering or content-based filtering)
- What parameter values your model takes (for example, values for smoothing constants or prior distributions)
Our goal in this book is to make this step easier for you by presenting iterations a data scientist would undergo in the task of creating a successful model using real-world applications as examples.
Once we generate a model that abides by our initial KPI requirements, we need to deploy it in the production environment. This could be something as simple as creating a list of neighborhoods and political issues to address on each neighborhood or something as complex as shipping the model to thousands of machines to make real-time decisions about which advertisements to buy for a particular marketing campaign. Once deployed to production, it is important to keep on monitoring those KPIs to make sure we're still solving the problem we aimed for initially. Sometimes, the model could have negative effects due to the change in trends and another model needs to be trained. For instance, listeners over time may lose interest in continually hearing the same music style and the process must start all over again.
In this chapter, we first learned how AWS empowers machine learning practitioners and data scientists. We then looked at the various AWS tools that are available for machine learning, after which we learned about the machine learning life cycle. And finally, we learned how to deploy models.
In the next chapter, we will discuss various popular machine learning algorithms and see how to implement them at scale on AWS. Before continuing to the next chapter we advice readers who are new to AWS to go through the appendix, getting started with AWS, which covers the process of creating a new AWS account.
- Define three applications you can identify on your mobile phone that implement machine learning. For each of the application, define what the project life cycle is based on the steps presented in this chapter.
- Search for three data scientist job positions and carefully review the job requirements. For each of the requirements, classify whether the skill falls under communication, machine learning, or computer programming.Â
- As a data scientist, it is important to be aware of the applications around you that are generating data that can be used for machine learning. Based on the electronic devices you use, make a list of data that you generate every day. Define three machine learning applications that can use the data that you generate.