I once told a friend who works as a software developer about one of the largest European data science conferences. He showed genuine interest and asked whether we could go together. Sure, I said. Let's broaden our knowledge together. It will be great to talk to you about machine learning. Several days later, we were sitting in the middle of a large conference hall. The first speaker had come on stage and told us about some technical tricks he used to win several data science competitions. When the next speaker talked about tensor algebra, I noticed a depleted look in the eyes of my friend.
— What's up? I asked.
— I'm just wondering when they'll show us the robots.
To avoid having incorrect expectations, we need to inform ourselves. Before building a house, you'd better know how a hammer works. Having basic knowledge of the domain you manage is vital for any kind of manager. A software development manager needs to understand computer programming. A factory manager needs to know the manufacturing processes. A data science manager is no exception. The first part of this book gives simple explanations of the main concepts behind data science. We will dissect and explore it bit by bit.
Data science has become popular, and many business people and technical professionals have an increasing interest in understanding data science and applying it to solve their problems. People often form their first opinions about data science from the information that they collect through the background: news sites, social networks, and so on. Unfortunately, most of those sources misguide, rather than give a realistic picture of data science and machine learning.
Instead of explaining, the media describes the ultimate magical tools that easily solve all our problems. The technological singularity is near. A universal income economy is around the corner. Well, only if machines learned and thought like humans. In fact, we are far from creating general-purpose, self-learning, and self-improving algorithms.
This chapter explores current possibilities and modern applications of the main tools of data science: machine learning and deep learning.
In this chapter, we will cover the following topics:
- Defining AI
- Introduction to machine learning
- Introduction to deep learning
- Deep learning use case
- Introduction to causal inference
Media and news use AI as a substitute buzzword for any technology related to data analysis. In fact, AI is a sub-field of computer science and mathematics. It all started in the 1950s, when several researchers started asking whether computers can learn, think, and reason. 70 years later, we still do not know the answer. However, we have made significant progress in a specific kind of AI that solves thoroughly specified narrow tasks: weak AI.
Science fiction novels tell about machines that can reason and think like humans. In scientific language, they are described as strong AI. Strong AI can think like a human, and its intellectual abilities may be much more advanced. The creation of strong AI remains the main long-term dream of the scientific community. However, practical applications are all about weak AI. While strong AI tries to solve the problem of general intelligence, weak AI is focused on solving one narrow cognition task, such as vision, speech, or listening. Examples of weak AI tasks are diverse: speech recognition, image classification, and customer churn prediction. Weak AI plays an important role in our lives, changing the way we work, think, and live. We can find successful applications of weak AI in every area of our lives. Medicine, robotics, marketing, logistics, art, and music all benefit from recent advances in weak AI.
How does AI relate to machine learning? What is deep learning? And how do we define data science? These popular questions are better answered graphically:
This diagram includes all the technical topics that will be discussed in this book:
- AI is a general scientific field that covers everything related to weak and strong AI. We won't focus much on AI, since most practical applications come from its subfields, which we define and discuss through the rest of Section 1: What is Data Science?
- Machine learning is a subfield of AI that studies algorithms that can adapt their behavior based on incoming data without explicit instructions from a programmer.
- Deep learning is a subfield of machine learning that studies a specific kind of machine learning model called deep neural networks.
- Data science is a multidisciplinary field that uses a set of tools to extract knowledge from data and support decision making. Machine learning and deep learning are among the main tools of data science.
The ultimate goal of data science is to solve problems by extracting knowledge from data and giving support for complex decisions. The first part of solving a problem is getting a good understanding of its domain. You need to understand the insurance business before using data science for risk analysis. You need to know the details of the goods manufacturing process before designing an automated quality assurance process. First, you understand the domain. Then, you find a problem. If you skip this part, you have a good chance of solving the wrong problem.
After coming up with a good problem definition, you seek a solution. Suppose that you have created a model that solves a task. A machine learning model in a vacuum is rarely interesting for anyone. So, it is not useful. To make it useful, we need to wrap our models into something that can be seen and acted upon. In other words, we need to create software around models. Data science always comes hand-in-hand with creating software systems. Any machine learning model needs software. Without software, models would just lie in computer memory, not helping anyone.
So, data science is never only about science. Business knowledge and software development are also important. Without them, no solution would be complete.
Data science has huge potential. It already affects our daily lives. Healthcare companies are learning to diagnose and predict major health issues. Businesses use it to find new strategies for winning new customers and personalize their services. We use big data analysis in genetics and particle physics. Thanks to advances in data science, self-driving cars are now a reality.
Thanks to the internet and global computerization, we create vast amounts of data daily. Ever-increasing volumes of data allow us to automate human labor.
Sadly, for each use case that improves our lives, we can easily find two that make them worse. To give you a disturbing example, let's look at China. The Chinese government is experimenting with a new social credit system. It uses surveillance cameras to track the daily lives of its citizens on a grand scale. Computer vision systems can recognize and log every action that you make while commuting to work, waiting in lines at a government office, or going home after a party. A special social score is then calculated based on your monitored actions. This score affects the lives of real people. In particular, public transport fees can change depending on your score; low scores can prohibit you from interviewing for a range of government jobs.
On the other hand, this same technology can be used to help people. For example, it can be used to track criminals in large crowds. The way you apply this new technology can bring the world closer to George Orwell's 1984, or make it a safer place. The general public must be more conscious of these choices, as they might have lasting effects on their lives.
Another example of some disturbing uses of machine learning is businesses that use hiring algorithms based on machine learning. Months later, they discovered that the algorithms introduced bias against women. It is becoming clear that we do not give the right amount of attention to the ethics of data science. While companies such as Google create internal ethics boards, there is still no governmental control over the unethical use of modern technology. Before such programs arrive, I strongly encourage you to consider the ethical implications of using data science. We all want a better world to live in. Our future, and the future of our children, depends on small decisions we make each day.
Like any set of tools, data science has its limitations. Before diving into a project with ambitious ideas, it is important to consider the current limits of possibility. A task that seems easily solvable may be unsolvable in practice.
Insufficient understanding of the technical side of data science can lead to serious problems in your projects. You can start a project only to discover that you cannot solve the task at all. Even worse, you can find out that nothing works as intended only after deployment. Depending on your use case, it can affect real people. Understanding the main principles behind data science will rid you of many technical risks that predetermine a project's fate before it has even started.
Machine learning is by far the most important tool of a data scientist. It allows us to create algorithms that discover patterns in data with thousands of variables. We will now explore different types and capabilities of machine learning algorithms.
Machine learning is a scientific field that studies algorithms that can learn to perform tasks without specific instructions, relying on patterns discovered in data. For example, we can use algorithms to predict the likelihood of having a disease or assess the risk of failure in complex manufacturing equipment. Every machine learning algorithm follows a simple formula. In the following diagram, you can see a high-level decision process that is based on a machine learning algorithm. Each machine learning model consumes data to produce information that can support human decisions or fully automate them:
We will now explore the meaning of each block in more detail in the next section.
When solving a task using machine learning, you generally want to automate a decision-making process or get insights to support your decision. For example, you may want an algorithm to output a list of possible diseases, given a patient's medical history and current condition. If machine learning solves your task completely, or end to end, this means that the algorithm's output can be used to make a final decision without further thinking; in our example, determining the disease the patient is suffering from and prescribing suitable medication automatically. The execution of this decision can be manual or automatic. We say that machine learning applications like these are end to end. They provide a complete solution to the task. Let's look at digital advertising as an example. An algorithm can predict whether you will click on an ad. If our goal is to maximize clicks, we can make automated and personalized decisions about which specific advertisement to show to each user, solving the click-through rate maximization problem end to end.
Another option is to create an algorithm that provides you with an insight. You can use this insight as part of a decision-making process. This way, the outputs of many machine learning algorithms can take part in complex decision making. To illustrate this, we'll look at a warehouse security surveillance system. It monitors all surveillance cameras and identifies employees from the video feed. If the system does not recognize a person as an employee, it raises an alert. This setup uses two machine learning models: face detection and face recognition. At first, the face detection model searches for faces in each frame of the video. Next, the face recognition model identifies a person as an employee by searching the face database. Each model does not solve the employee identification task alone. Yet each model provides an insight that is a part of the decision-making process.
You may have noticed that algorithms in our examples work with different data types. In the digital advertising example, we have used structured customer data. The surveillance example used video feeds from cameras. In fact, machine learning algorithms can work with different data types.
We can divide the entire world's data into two categories: structured and unstructured. Most data is unstructured. Images, audio recordings, documents, books, and articles all represent unstructured data. Unstructured data is a natural byproduct of our lives nowadays. Smartphones and social networks facilitate the creation of endless data streams. Nowadays, you need little to snap a photo or make a video. Analyzing unstructured data is so difficult that we didn't come up with a practical solution until 2010.
Structured data is hard to gather and maintain, but it is the easiest to analyze. The reason is that we often collect it for this exact purpose. Structured data is typically stored inside computer databases and files. In digital advertising, ad networks apply huge effort to collect as much data as possible. Data is gold for advertising companies. They collect your browsing history, link clicks, time spent on site pages, and many other features for each user. Vast amounts of data allow the creation of accurate click probability models that can personalize your ad browsing experience. Personalization increases click probabilities, which increases advertisers' profits.
To increase their data supply, modern enterprises build their business processes in a way that generates as much structured data as possible. For example, banks record every financial transaction you make. This information is necessary to manage accounts, but they also use it as the main data source for credit scoring models. Those models use customers' financial profiles to estimate their credit default risk probability.
The difficulty of analyzing unstructured data comes from its high dimensionality. To explain this, let's take a data table with two columns of numbers: x and y. We can say that this data has two dimensions.
Each value in this dataset is displayed on the following plot:
As you can see, we can have an accurate guess of what the value of y will be given a value of x. To do this, we can look at the corresponding point on the line. For example, for x = 10, y will be equal to 8, which matches the real data points depicted as blue dots.
Now, let's shift to a photo made with an iPhone. This image will have a resolution of 4,032 x 3,024 pixels. Each pixel will have three color channels: red, green, and blue. If we represent pixels in this image as numbers, we will end up with over twelve million of them in each photo. In other words, our data has a dimensionality of 12 million.
In the following screenshot, you can see an image represented as a set of numbers:
Using machine learning algorithms on high-dimensional data can become problematic. Many machine learning algorithms suffer from a problem called the curse of dimensionality. To create a good model that recognizes objects in a photo, you need a model that's much more complex than a simple line. The complexity of the model increases the data hunger of the model, so models that work well on unstructured data usually require vast amounts of training data samples.
Before the emergence of data science, machine learning, and deep learning, there was statistics. All fields related to data analysis have statistics at their core. From its very start, statistics was an alloy of many fields. The reason for this is that statistics was (and is) aimed at solving practical problems. In the 17th century, statisticians applied mathematics to data to make inferences and support decisions regarding economics and demographics. Doesn't this sound like data science? Here is an interesting fact: the first machine learning algorithm, linear regression, was invented over 200 years ago by Carl Friedrich Gauss. We still use it today, and its implementation is present in all major machine learning and statistical software packages.
Let's look at the common use cases for machine learning. Arguably the most common one is prediction. Predictive algorithms tell us when something will happen, but not necessarily why it will happen. Some examples of prediction tasks are: Will this user churn in the next month? Does this person have a high risk of developing Alzheimer's disease? Will there be a traffic jam in the next hour? Often, we want to have explanations instead of predictions. Solving an inference task means to find supporting evidence for some claim in the data by asking Why? questions. Why did this person win the election? Why does this medicine work and the other doesn't? Statistical inference helps us to find an explanation or to prove the efficiency of our actions. When we do inference, we seek answers for the present or the past. When we try to look into the future, prediction comes into play.
Sometimes, we are not interested in predicting the future or finding evidence. We want machines to recognize complex patterns in data, such as objects in an image, or to analyze the sentiment of a text message. This group of tasks is called recognition.
Machine learning covers many types and flavors of models. But why do we need such variety of different algorithms? The reason lies in a theorem called the no free lunch theorem. It states that there is no best model that will consistently give you the best results on each task for every dataset. Each algorithm has its own benefits and pitfalls. It may work flawlessly on one task, but fail miserably at another. One of the most important goals of a data scientist is to find the best model to solve the problem at hand.
The no free lunch theorem states that there is no best model that solves all tasks well. The consequence of this is that we have many algorithms that specialize in solving specific tasks.
For instance, let's look at fashion retail warehouse demand forecasting. A retailer sells a fixed set of clothes at their stores. Before an item makes it to the shelves, it must be bought from the manufacturer and transferred to a warehouse. Let's assume that their logistics cycle takes two weeks. How do we know the best quantity of each item to order? There is a good chance that we have item sales data for each store. We can use it to create a predictive model that estimates average customer demand for each item in the catalog of our warehouse over the next two weeks. That is, we forecast an average number of to-be-bought items over the next two weeks. The simplest model would be to take an average demand for each item from the last two weeks as an estimate of average future demand. Simple statistical models like this are frequently used at real retail stores, so we do not oversimplify. To be more general, let's call the thing we want to forecast the target variable. In the previous case, the target variable is the demand. To build a forecasting model, we will use two previous weeks of historical data to calculate arithmetic averages for each item. We then use those averages as estimates for the future values. In a way, historical data was used to teach our model about how it should predict the target variable. When the model learns to perform a given task using input/output examples, we call the process supervised learning. It would be an overstatement to name our average calculator as a supervised learning algorithm, but nothing stops us from doing this (at least technically).
Supervised learning is not limited to simple models. In general, the more complex your model is, the more data it requires to train. The more training data you have, the better your model will be. To illustrate, we will look into the world of information security. Our imaginary client is a large telecoms company. Over the years, they have experienced many security breaches in their network. Thankfully, specialists have recorded and thoroughly investigated all fraudulent activity on the network. Security experts labeled each fraudulent activity in network logs. Having lots of data with labeled training examples, we can train a supervised learning model to distinguish between normal and fraudulent activity. This model will recognize suspicious behavior from vast amounts of incoming data. However, this will only be the case if experts labeled the data correctly. Our model won't correct them if they didn't. This principle is called garbage in, garbage out. Your model can only be as good as your data.
Both the retail and security examples use supervised learning, but let's look at the target variables more closely. The forecasting algorithm used demand as the target variable. Demand is a continuous number ranging from 0 to infinity. On the other hand, the security model has a fixed number of outcomes.
Network activity is either normal or fraudulent. We call the first type of the target variable—continuous, and the second type—categorical. The target variable type strongly indicates which kind of task we can solve. Prediction tasks with continuous target variables are called regression problems. And when the total number of outcomes is limited, we say that we solve a classification problem. Classification models assign data points to categories, while regression models estimate quantities.
Here are some examples:
- House price estimation is a regression task.
- Predicting user ad clicks is a classification task.
- Predicting HDD utilization in a cloud storage service is a regression task.
- Identifying the risk of credit default is a classification task.
You can consider yourself lucky if you have found a good labeled dataset. You're even luckier if the dataset is large, contains no missing labels, and is a good match for an end-to-end solution to a problem. The labels we use for supervised learning are a scarce resource. A total absence of labels is a lot more common than fully labeled datasets. This means that often, we cannot use supervised learning. But the absence of labels does not mean that we are doomed. One solution is to label data by hand. You can assign this task to your employees if the data cannot be shared outside of the company. Otherwise, a much simpler and faster solution is to use crowdfunding services such as Amazon Mechanical Turk. There, you can outsource data labeling to a large number of people, paying a small fee for each data point.
While it's convenient, labeling data is not always affordable, and may be impossible. A learning process where the target variable is missing or can be derived from the data itself is called unsupervised learning. While supervised learning implies that the data was labeled, unsupervised learning removes this limitation, allowing an algorithm to learn from data without guidance.
For example, the marketing department may want to discover new segments of customers with similar buying habits. Those insights can be used to tailor marketing campaigns and increase revenue in each segment. One way to discover hidden structures in your dataset is to use a clustering algorithm. Many clustering algorithms can work with unlabeled data. This characteristic makes them particularly interesting.
Sometimes, labels hide inside the raw data. Look at the task of music creation. Suppose we want to create an algorithm that composes new music. We can use supervised learning without explicit labeling in this case. The next note in a sequence serves as a great label for this task. Starting with a single note, the model predicts the next note. Taking the previous two, it outputs the third. This way, we can add as many new notes as we need.
Now, let's see whether we can apply machine learning to games. If we take a single game, we may label some data and use supervised learning. But scaling this approach for all games is not possible in practice. On a typical gaming console, you use the same controller to play different games. Try to recall when you played for the first time in your life. I suppose it was Mario. It is likely that you were unsure of what to do. You might have tried pressing a few buttons and looked at a jumping character. Piece by piece, you must have figured out the rules of the game and started playing. I wouldn't be surprised if you felt confident playing and could finish the first levels after a few hours of experience.
Using the knowledge we have gained so far, can we design a machine learning algorithm that will learn how to play games? It might be tempting to use supervised learning, but think first. You had no training data when you took the controller for the first time in your life. But can we create algorithms that would figure out game rules by themselves?
It is easy to write a good bot for a specific computer game if we know the rules in advance. Almost all modern computer games have rule-based AIs or non-player characters that can interact with the player. Some games even have such advanced AI that all gameplay builds around this feature. If you are interested, look at Alien: Isolation, released in 2014. The biggest limitation of those algorithms is that they are game-specific. They cannot learn and act based on experience.
This was the case until 2015, when deep learning researchers discovered a way to train machine learning models to play Atari games as humans do: look at the screen and perform actions by using a game controller. The only difference was that the algorithm was not using physical eyes or hands to play the game. It received each frame of the game through RAM and acted via a virtual controller. Most importantly, the model received the current game score in each incoming frame. At the start, the model performed random actions. Some actions led to a score increase that was received as positive feedback. Over time, the model learned input/output or frame/action pairs that corresponded to higher scores. The results were stunning. After 75 hours of training, the model played Breakout at an amateur level. It wasn't provided with any prior knowledge of the game. All it saw were raw pixels. Sometime later, the algorithm had learned how to play Breakout better than humans. The exact same model can be trained to play different games. The learning framework that was used to train such models is called reinforcement learning. In reinforcement learning, an agent (player) learns a policy (a specific way of performing actions based on incoming inputs) that maximizes the reward (game score) in an unknown environment (a computer game).
Of course, there are limitations. Remember, there are no free lunches in the machine learning restaurant. While this algorithm performed well on a large set of games, it failed completely at others. In particular, a game called Montezuma's Revenge had stymied every model until 2018. The problem with this game is that you need to perform a specific series of actions over a long time before getting even a small reward signal.
To solve tasks in complex environments, reinforcement learning algorithms need extremely large amounts of data. You may have seen the news about the OpenAI Five model beating professional players in a complex multiplayer cybersports game called Dota 2. To give you an insight, the OpenAI team used a cluster of 256 GPUs and 128,000 CPU cores to train their agent. Each day, the model played 180 years' worth of games against itself. This process happened in parallel on a large computing cluster, so it took much less time in reality.
Another large victory for reinforcement learning was, of course, the game of Go. The total number of actions in Go is larger than the total number of atoms in our universe, making this game very hard to tackle using computers. Computer scientists defeated the best humans at chess in 1997. For Go, it took them another 18 years. If you are interested, Google filmed a documentary about AlphaGo, an algorithm that won at Go against world champions.
Reinforcement learning works well when you can completely simulate your environment, that is, you know the rules of the game in advance. This chicken and egg problem makes applying reinforcement learning tricky. Still, reinforcement learning can be used to solve real-world problems. A team of scientists used reinforcement learning to find the optimal parameters for a rocket engine. This was possible thanks to complete physical models of the inner workings of the engine. They used these models to create a simulation where a reinforcement learning algorithm changed the parameters and design of the engine to find the optimal setup.
Before writing this section, I was thinking about the many ways we can draw a line between machine learning and deep learning. Each of them was contradictory in some way. In truth, you can't separate deep learning from machine learning because deep learning is a subfield of machine learning. Deep learning studies a specific set of models called neural networks. The first mentions of the mathematical foundations of neural networks date back to the 1980s, and the theory behind modern neural networks originated in 1958. Still, they failed to show good results until the 2010s. Why?
The answer is simple: hardware. Training big neural networks uses a great amount of computation power. But not any computation power will suffice. It turns out that neural networks do a lot of matrix operations under the hood. Strangely, rendering computer graphics also involves many matrix operations, so many, in fact, that each computer has a dedicated circuit inside: a GPU. Nvidia knew of the scientific need for fast matrix operations, so they developed a special programming framework called CUDA. CUDA allows you to harness the power of your GPU not only for computer graphics, but for general computing tasks as well. GPUs can do insane amounts of parallel matrix operations. Modern graphics cards have thousands of cores, so you can perform thousands of operations in parallel. And all individual operations also work quickly. Modern GPUs can execute thousands of parallel, floating-point computations. GPUs specialize in solving one specific task much faster than general-purpose CPUs.
All this meant that scientists could train larger neural networks. The art and science of training large neural networks is called deep learning. The origin of the word deep in the name comes from the specific structure of neural networks that allows them to be efficient and accurate in their predictions. We will look more into the internals of neural networks in Chapter 2, Testing Your Models.
Deep learning is fantastic at solving tasks with unstructured datasets. To illustrate this, let's look at a machine learning competition called ImageNet. It contains over 14 million images, classified into 22,000 distinct categories. To solve ImageNet, an algorithm should learn to identify an object in the photo. While human performance on this task is around 95% accuracy, the best neural network model surpassed this level in 2015.
Traditional machine learning algorithms are not very good at working with unstructured data. However, they are equally important because the gap in performance between traditional machine learning models and deep learning models is not so big in the domain of structured datasets. Most winners of data science competitions on structured datasets do not use deep neural networks. They use simpler models because they showed better results on structured datasets. Those models train faster, do not need specialized hardware, and use less computational power.
To differentiate other machine learning algorithms from deep learning, professionals often refer to deep learning as a field that studies neural networks, and machine learning is used for every other model. Drawing a line between machine learning and deep learning is incorrect, but for the lack of a better term, the community has agreed on ambiguous definitions.
We write every day, whether it is documents, tweets, electronic messages, books, or emails. The list can go on and on. Using algorithms to understand natural language is difficult because our language is ambiguous, complex, and contains many exceptions and corner cases. The first attempts at natural language processing (NLP) were about building rule-based systems. Linguists carefully designed hundreds and thousands of rules to perform seemingly simple tasks, such as part of speech tagging.
From this section's title, you can probably guess it all changed with deep learning. Deep learning models can perform many more text processing tasks without the need to explicitly state complex grammatical rules and parsers. Deep learning models took the NLP world by storm. They can perform a wide range of NLP tasks with much better quality than previous generation NLP models. Deep neural networks translate text to another language with near-human accuracy. They are also quite accurate at doing part-of-speech tagging.
Neural networks also do a good job of solving comprehension problems: question answering and text summarization. In question answering, the model gets a chunk of text and a question about its contents. For example, given the introduction to this section, Introduction to deep learning, a question-answering model could correctly answer the following queries:
- How many cores do modern desktop CPUs have? (less than 10)
- When did neural networks originate? (1980s)
Text summarization seeks to extract the main points from the source. If we feed the first few paragraphs of this section into a text summarization model, we will get the following results:
Now it is time to draw a line between machine learning and deep learning. In truth, we can't do this, because deep learning is a sub-field of machine learning. Formally, deep learning studies a specific set of models called neural networks. The theory behind neural networks originated in the 1980s. You can try out a sample text summarization model online at http://textsummarization.net/.
Another intersecting NLP problem is text classification. By labeling many texts as emotionally positive or negative, we can create a sentiment analysis model. As you already know, we can train this kind of model using supervised learning. Sentiment analysis models can give powerful insights when used to measure reactions to news or the general mood around Twitter hashtags.
Text classification is also used to solve automated email and document tagging. We can use neural networks to process large chunks of emails and to assign appropriate tags to them.
The pinnacle of practical NLP is the creation of dialog systems, or chatbots. Chatbots can be used to automate common scenarios at IT support departments and call centers. However, creating a bot that can reliably and consistently solve its task is not an easy task. Clients tend to communicate with bots in rather unexpected ways, so you will have a lot of corner cases to cover. NLP research is not quite at the point of providing an end-to-end conversational model that can solve the task.
General chatbots use several models to understand user requests and prepare an answer:
- The intent classification model determines the user's request.
- The entity recognition model extracts all named entities from the user's message.
- The response generator model takes the input from the intent classification and entity recognition models and generates a response. The response generator can also use a knowledge database to look up extra information to enrich the response.
- Sometimes, the response generator creates several responses. Then, a separate response ranking model selects the most appropriate one.
NLP models can also generate texts of arbitrary length based on initial input. State-of-the-art models output results that are arguably indistinguishable from human-written texts.
While the results are very compelling, we are yet to find useful and practical applications for text generation models. While writers and content creators could potentially benefit from these models by using them to expand a list of key points into a coherent text, unfortunately, these models lack the means to control their output, making them difficult to use in practice.
We have explored how deep learning can understand text. Now, let's explore how deep learning models can see. In 2010, the first ImageNet Large Scale Visual Recognition Challenge was held. The task was to create a classification model that solved the ambitious task of recognizing an object in an image. In total, there are around 22,000 categories to choose from. The dataset contains over 14 million labeled images. If someone sat and chose the top 5 objects for each image, they would have an error rate of around 5%.
In 2015, a deep neural network surpassed human performance on ImageNet. Since then, many computer vision algorithms have been rendered obsolete. Deep learning allows us not only to classify images, but also to do object detection and instance segmentation.
The following two pictures help to describe the difference between object detection and instance segmentation:
The preceding photo shows us that object detection models recognize objects and place bounding boxes around them.
In the following image, from https://github.com/matterport/Mask_RCNN, we can see that instance segmentation models find the exact outlines of objects:
Thus, the main practical uses of deep learning in computer vision are essentially the same tasks at different levels of resolution:
- Image classification: Determining the class of an image from a predetermined set of categories
- Object detection: Finding bounding boxes for objects inside an image and assigning a class probability for each bounding box
- Instance segmentation: Doing pixel-wise segmentation of an image, outlining every object from a predetermined class list
Computer vision algorithms have found applications in cancer screening, handwriting recognition, face recognition, robotics, self-driving cars, and many other areas.
Another interesting direction in computer vision is generative models. While the models we have already examined perform recognition tasks, generative models change images, or even create entirely new ones. Style transfer models can change the stylistic appearance of an image to look more like another image. This kind of model can be used to transform photos into paintings that look and feel like the work of an artist, as follows:
Another promising approach for training generative models is called Generative Adversarial Networks (GANs). You use two models to train GANs: generator and discriminator. The generator creates images. The discriminator tries to distinguish the real images from your dataset from the generated images. Over time, the generator learns to create more realistic images, while the discriminator learns to identify more subtle mistakes in the image generation process. The results of this approach speak for themselves. State-of-the-art models can generate realistic human faces, as you can see on page 3 of Nvidia's paper, https://arxiv.org/pdf/1812.04948.pdf. Take a look at the following images. These photos are not real. A deep neural network generated these images:
We can also use GANs to perform conditional image generation. The word conditional means that we can specify some parameters for the generator. In particular, we can specify a type of object or texture that is being generated. For example, Nvidia's landscape generator software can transform a simple color-coded image, where specific colors represent soil, sky, water, and other objects, to realistic-looking photos.
To show how deep learning may work in practical settings, we will explore product matching.
Up-to-date pricing is very important for large internet retailers. In situations where your competitor lowers the price of a popular product, late reaction leads to large profit losses. If you know the correct market price distributions for your product catalog, you can always remain a step ahead of your competitors. To create such a distribution for a single product, you first need to find this product description on a competitor's site. While automated collection of product descriptions is easy, product matching is the hard part.
Once we have a large volume of unstructured text, we need to extract product attributes from it. To do this, we first need to tell whether two descriptions refer to the same product. Suppose that we have collected a large dataset of similar product descriptions. If we shuffle all pairs in our data, we will get another dataset of non-similar product descriptions. Using lots of examples of similar and dissimilar product descriptions, we can train an NLP model that can identify similar product descriptions. We may also think about comparing photos on the retailer's sites to find similar products. To do this, we can apply computer vision models to do the matching. Even with those two models, the total matching accuracy will probably be insufficient for our requirements. Another way to boost it is to extract product attributes from textual descriptions. We may train word-tagging models or develop a set of matching rules to do this task. Matching accuracy will increase, along with the diversity and descriptiveness of the data sources we use.
Up to this point, we have talked about predictive models. The main purpose of a predictive model is to recognize and forecast. The explanation behind the model's reasoning is of lower priority. On the contrary, causal inference tries to explain relationships in the data rather than to make predictions about the future events. In causal inference, we check whether an outcome of some action was not caused by so-called confounding variables. Those variables can indirectly influence action through the outcome. Let's compare causal inference and predictive models through several questions that they can help to answer:
- Prediction models:
- When will our sales double?
- What is the probability of this client buying a certain product?
- Causal inference models:
- Was this cancer treatment effective? Or is the effect apparent only because of the complex relationships between variables in the data?
- Was the new version of our recommendation model better than the other? If it is, by what amount does it increase our sales compared to the old model?
- What factors cause some books to be bestsellers?
- What factors cause Alzheimer's disease?
There is a mantra amongst statisticians: correlation does not imply causation. If some variables change together or have similar values, does not mean that they are connected in some logical way. There are many great examples of non-sensible absurd correlations in real-world data at http://www.tylervigen.com/spurious-correlations.
Look at the examples in the following screenshots:
Some findings are as bewildering as they are humorous:
Seeking explanations and measuring effects is important when data-driven decisions affect people's lives. Specifically, when a new drug is invented, we need to check that it really works. To do this, we need to collect data and measure the statistical effect of using the new drug versus not using the new drug.
Causal inference poses this question in a rather ambitious way: the simplest way to measure the effect of some treatment is to split our universe into two parts. In the first part, we do not apply the treatment and act as if the drug was never discovered. In the second part, we apply the new treatment. Unfortunately, the creation of new universes is far beyond the capabilities of statisticians. But they have come up with a reasoning framework that allows you to design experiments and collect data as if it would come from two separate independent universes.
The simplest way to do this is to conduct a randomized experiment:
- First, we will randomly sample a test group of people from the entire globe.
- Then, each person will be assigned the new drug or a sugar pill (placebo) with 50% probability.
- After a while, we can measure the treatment's effect in the two groups.
Studies like this can be very complex to execute. Imagine selecting a random sample of the entire world's population each time you need to test a new drug. Moreover, you cannot transport all individuals to a single place because a sudden change of environment or climate may affect treatment results. This experiment may also be considered unethical, especially if the treatment is associated with a risk of death. Causal inference allows you to design more complex experiment designs that will still be equivalent to a random experiment under certain conditions. This way, you can create an ethical and realistic study that has statistical rigor.
Another important feature of causal inference is a set of methods that work on observational data. It is not always feasible to conduct an experiment for a hypothesis check. Causal inference can be used to apply predictive models to measure effects on observational data that was not specifically collected for this sole purpose. For example, we can use customer data to measure and quantify the efficiency of marketing campaigns. Observational studies are convenient to execute as they require no experimental setup. However, they can only give you a strong educated guess about real causal relationships. It is always recommended to design and conduct a proper experiment before making data-driven decisions.
The framework of applying a treatment to a test group is very general. It is not limited to medical studies and can measure and explain the effects of any changes. The question—is using a machine learning model better than not using it at all? If so, how much benefit does it give? often occupies the minds of data scientists. Thanks to causal inference, we can find a solution. In this question, you can substitute a machine learning model for the treatment.
The only way to measure the real effect is to check both approaches on real users. While being difficult to conduct in a physical world, purely randomized experiments are easy on the internet. If you use machine learning models at a large internet company with a large number of customers, designing a randomized experiment can seem easy. You can randomly assign two different versions of your software to each user and wait until a sufficiently large sample of data has been collected.
However, you should be wary of many things that can distort results:
- Hidden biases in data are called confounders. Customer lifestyle, social factors, or environmental exposure can affect your seemingly random sample of users.
- A flaw in test group selection is called selection bias. For example, random selection of test participants from a single region could affect the study.
- Measurement errors: Erroneous or non-consistent data collection can lead to misleading results
In this chapter, we have explored the practical applications of AI, data science, machine learning, deep learning, and causal inference. We have defined machine learning as a field that studies algorithms that use data to support decisions and give insights without specific instructions. There are three main machine learning methodologies: supervised, unsupervised, and reinforcement learning. In practice, the most common types of task we solve using machine learning are regression and classification. Next, we described deep learning as a subset of machine learning devoted to studying neural network algorithms. The main application domains of deep learning are computer vision and NLP. We have also touched on the important topic of causal inference: the field that studies a set of methods for discovering causal relationships in data. You now know a lot about general data science capabilities. But can machine learning models successfully solve your specific set of problems?
In the next chapter, we will learn to distinguish good solutions from bad ones by performing model testing.