If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. The stories begin with today's reality: computers being taught to play simple games like tic-tac-toe and to automate routine tasks. As the stories go, machines are later given control of traffic lights and communications, followed by military drones and missiles. The machines' evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then "deleted."
Thankfully, at the time of this writing, machines still require user input.
Though your impressions of machine learning may be colored by these mass-media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us with making sense of the world's massive data stores.
Putting popular misconceptions aside, by the end of this chapter, you will gain a more nuanced understanding of machine learning. You will also be introduced to the fundamental concepts that define and differentiate the most commonly used machine learning approaches. You will learn:
The origins, applications, and pitfalls of machine learning
How computers transform data into knowledge and action
Steps to match a machine learning algorithm to your data
The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems.
Beginning at birth, we are inundated with data. Our body's sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we are able to share these experiences with others.
From the advent of written language, human observations have been recorded. Hunters monitored the movement of animal herds; early astronomers recorded the alignment of planets and stars; and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in ever-growing computerized databases.
The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors, such as cameras, microphones, chemical noses, electronic tongues, and pressure sensors mimic the human ability to see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human's limited and subjective attention, an electronic sensor never takes a break and has no emotions to skew its perception.
Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error due to hardware limitations. Others are limited by their scope. A black-and-white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope.
Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors record temperature and pressure data; surveillance cameras watch sidewalks and subway tunnels; and all manner of electronic behaviors are monitored: transactions, communications, social media relationships, and many others.
This deluge of data has led some to state that we have entered an era of big data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting datasets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense of it all.
The field of study interested in the development of computer algorithms for transforming data into intelligent action is known as machine learning. This field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in the volume of data necessitated additional computing power, which in turn spurred the development of statistical methods for analyzing large datasets. This created a cycle of advancement allowing even larger and more interesting data to be collected, and enabling today's environment in which endless streams of data are available on virtually any topic.
A closely related sibling of machine learning, data mining, is concerned with the generation of novel insight from large databases. As the term implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, a potential point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem.
Virtually all data mining involves the use of machine learning, but not all machine learning requires data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates. On the other hand, if the computer is learning how to drive the car itself, this is purely machine learning without data mining.
Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as machines replaced workers in fields and assembly lines.
The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action.
Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why "comparing smarts is a slippery business," see the Popular Science article FYI: Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.
Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer: the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.
To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good.
Machine learning is most successful when it augments, rather than replaces, the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers and programmers with efforts to create smarter homes and automobiles; and helps social scientists to build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.
Though it is impossible to list every use case for machine learning, a look at recent success stories identifies several prominent examples:
Identification of unwanted spam messages in email
Segmentation of customer behavior for targeted advertising
Forecasts of weather behavior and long-term climate changes
Reduction of fraudulent credit card transactions
Actuarial estimates of financial damage of storms and natural disasters
Prediction of popular election outcomes
Development of algorithms for auto-piloting drones and self-driving cars
Optimization of energy use in homes and office buildings
Projection of areas where criminal activity is most likely
Discovery of genetic sequences linked to diseases
By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter what the context is, the machine learning process is the same. Regardless of the task, an algorithm takes data and identifies patterns that form the basis for further action.
Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, emulates a relatively limited subset of the capabilities of the human brain. It offers little flexibility to extrapolate outside of strict parameters and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.
Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple inferences about logical next steps. Take, for instance, the banner advertisements seen on many websites. These are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling shoes is interested in buying shoes and should therefore see advertisements for shoes. The problem is that this becomes a never-ending cycle in which, even after shoes have been purchased, additional shoe advertisements are served, rather than advertisements for shoelaces and shoe polish.
Many people are familiar with the deficiencies of machine learning's ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to "Beat up Martin" was misinterpreted by the Newton as "Eat up Martha."
Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to a number of humorous "autocorrect fail" sites that illustrate computers' ability to understand basic language but completely misunderstand context.
Some of these mistakes are surely to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. In spite of the fact that machine learning is rapidly improving at language processing, the consistent shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its limited set of past experiences.
At its core, machine learning is simply a tool that assists us with making sense of the world's complex data. Like any tool, it can be used for good or for evil. Where machine learning goes most wrong is when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.
Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws; violating terms of service or data use agreements; or abusing the trust or violating the privacy of customers or the public.
The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, "above all, do no harm."
Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless, but consider what happens when this practice is taken a bit further.
One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys.
Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.
After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!
Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pattern its machine learning analysis had discovered.
For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article, titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.
As machine learning algorithms are more widely applied, we find that computers may learn some unfortunate behaviors of human societies. Sadly, this includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers have found that Google's online advertising service is more likely to show ads for high-paying jobs to men than women, and is more likely to display ads for criminal background checks to black people than white people.
Proving that these types of missteps are not limited to Silicon Valley, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda. Often, algorithms that at first seem "content neutral" quickly start to reflect majority beliefs or dominant ideologies. An algorithm created by Beauty.AI to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!
For more information about the real-world consequences of machine learning and discrimination see the New York Times article When Algorithms Discriminate, by Claire Cain Miller, 2015: https://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html.
To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.
Apart from the legal consequences, inappropriate use of data may hurt the bottom line. Customers may feel uncomfortable or become spooked if aspects of their lives they consider private are made public. In recent years, a number of high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, by age cohort, and by locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union's newly-implemented General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.
Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as "Google bombing," the crowd-sourced method of tricking Google's algorithms to highly rank a desired page.
Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.
Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms to attackers. For a recap, refer to: https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.
The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an "adversarial attack" that subtly distorts a street sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.
A formal definition of machine learning attributed to computer scientist Tom M. Mitchell states that a machine learns whenever it is able to utilize its experience such that its performance improves on similar experiences in the future. Although this definition is intuitive, it completely ignores the process of exactly how experience can be translated into future action—and, of course, learning is always easier said than done!
Where human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit. For this reason, although it is not strictly necessary to understand the theoretical basis of learning, this foundation helps us to understand, distinguish, and implement machine learning algorithms.
As you compare machine learning to human learning, you may find yourself examining your own mind in a different light.
Regardless of whether the learner is a human or a machine, the basic learning process is similar. It can be divided into four interrelated components:
Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.
Abstraction involves the translation of stored data into broader representations and concepts.
Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.
Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.
Although the learning process has been conceptualized here as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit within the confines of our mind's eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, computers make these processes explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, utilized for future action, and treated as a data "science."
The data science buzzword suggests a relationship among the data, the machine, and the people who guide the learning process. The term's growing use in job descriptions and academic degree programs reflects its operationalization as a field of study concerned with both statistical and computational theory, as well as the technological infrastructure enabling machine learning and its applications. The field often asks its practitioners to be compelling storytellers, balancing an audacity in the use of data with the limitations of what one may infer and forecast from the data. To be a strong data scientist, therefore, requires a strong understanding of how the learning algorithms work.
All learning begins with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random-access memory (RAM) in combination with a central processing unit (CPU).
It may seem obvious, but the ability to store and retrieve data alone is insufficient for learning. Stored data is merely ones and zeros on a disk. It is a collection of memories, meaningless without a broader context. Without a higher level of understanding, knowledge is purely recall, limited to what has been seen before and nothing else.
To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to know that perfect recall would unlikely be of much assistance. Even if you could memorize material perfectly, this rote learning would provide no benefit without knowing the exact questions and answers that would appear on the exam. Otherwise, you would need to memorize answers to every question that could conceivably be asked, on a subject in which there is likely to be an infinite number of questions. Obviously, this is an unsustainable strategy.
Instead, a better approach is to spend time selectively, and memorize a relatively small set of representative ideas, while developing an understanding of how the ideas relate and apply to unforeseen circumstances. In this way, important broader patterns are identified, rather than memorizing each and every detail, nuance, and potential application.
This work of assigning a broader meaning to stored data occurs during the abstraction process, in which raw data comes to represent a wider, more abstract concept or idea. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images:
The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("This is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, in spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that observers are able to connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that can be held in the hand. Abstracted connections like this are the basis of knowledge representation, the formation of logical structures that assist with turning raw sensory information into meaningful insight.
During a machine's process of knowledge representation, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte's pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts.
There are many different types of models. You may already be familiar with some. Examples include:
Relational diagrams, such as trees and graphs
Logical if/else rules
Groupings of data known as clusters
The choice of model is typically not left up to the machine. Instead, the learning task and the type of data on hand inform model selection. Later in this chapter, we will discuss in more detail the methods for choosing the appropriate model type.
The process of fitting a model to a dataset is known as training. When the model has been trained, the data has been transformed into an abstract form that summarizes the original information.
You might wonder why this step is called "training" rather than "learning." First, note that the process of learning does not end with data abstraction—the learner must still generalize and evaluate its training. Second, the word "training" better connotes the fact that the human teacher trains the machine student to understand the data in a specific way.
It is important to note that a learned model does not itself provide new data, yet it does result in new knowledge. How can this be? The answer is that imposing an assumed structure on the underlying data gives insight into the unseen. It supposes a new concept that describes a manner in which data elements may be related.
Take, for instance, the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton inferred the concept of gravity, but the force we now know as gravity was always present. It simply wasn't recognized until Newton expressed it as an abstract concept that relates some data to other data—specifically, by becoming the g term in a model that explains observations of falling objects.
Most models will not result in the development of theories that shake up scientific thought for centuries. Still, your abstraction might result in the discovery of important, but previously unseen, patterns and relationships among data. A model trained on genomic data might find several genes that when combined are responsible for the onset of diabetes, banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity, or psychologists might identify a combination of personality characteristics indicating a new disorder. These underlying patterns were always present, but by presenting information in a different format, a new idea is conceptualized.
The next step in the learning process is to use the abstracted knowledge for future action. However, among the countless underlying patterns that may be identified during the abstraction process and the myriad ways to model those patterns, some patterns will be more useful than others. Unless the production of abstractions is limited to the useful set, the learner will be stuck where it started, with a large pool of information but no actionable insight.
Formally, the term generalization is defined as the process of turning abstracted knowledge into a form that can be utilized for future action, on tasks that are similar, but not identical, to those the learner has seen before. It acts as a search through the entire set of models (that is, theories or inferences) that could be established from the data during training.
If you can imagine a hypothetical set containing every possible way the data might be abstracted, generalization involves the reduction of this set into a smaller and more manageable set of important findings.
In generalization, the learner is tasked with limiting the patterns it discovers to only those that will be most relevant to its future tasks. Normally, it is not feasible to reduce the number of patterns by examining them one-by-one and ranking them by future utility. Instead, machine learning algorithms generally employ shortcuts that reduce the search space more quickly. To this end, the algorithm will employ heuristics, which are educated guesses about where to find the most useful inferences.
Heuristics utilize approximations and other rules of thumb, which means they are not guaranteed to find the best model of the data. However, without taking these shortcuts, finding useful information in a large dataset would be infeasible.
Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized your gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics.
The incredible human ability to make quick decisions often relies not on computer-like logic, but rather on emotion-guided heuristics. Sometimes, this can result in illogical conclusions. For example, more people express fear of airline travel than automobile travel, despite automobiles being statistically more dangerous. This can be explained by the availability heuristic, which is the tendency for people to estimate the likelihood of an event by how easily examples can be recalled. Accidents involving air travel are highly publicized. Being traumatic events, they are likely to be recalled very easily, whereas car accidents barely warrant a mention in the newspaper.
The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. The algorithm is said to have a bias if the conclusions are systematically erroneous, which implies that they are wrong in a consistent or predictable manner.
For example, suppose that a machine learning algorithm learned to identify faces by finding two dark circles representing eyes, positioned above a straight line indicating a mouth. The algorithm might then have trouble with, or be biased against, faces that do not conform to its model. Faces with glasses, turned at an angle, looking sideways, or with certain skin tones might not be detected by the algorithm. Similarly, it could be biased toward faces with other skin tones, face shapes, or characteristics that conform to its understanding of the world.
In modern usage, the word "bias" has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a little difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, studies in the field of psychology have suggested that individuals born with damage to the portions of the brain responsible for emotion may be ineffectual at decision-making and might spend hours debating simple decisions, such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information, while also allowing us to utilize other information for action. It is how machine learning algorithms choose among the countless ways to understand a set of data.
Bias is a necessary evil associated with the abstraction and generalization processes inherent in any learning task. In order to drive action in the face of limitless possibility, all learning must have a bias. Consequently, each learning strategy has weaknesses; there is no single learning algorithm to rule them all. Therefore, the final step in the learning process is to evaluate its success, and to measure the learner's performance in spite of its biases. The information gained in the evaluation phase can then be used to inform additional training if needed.
Once you've had success with one machine learning technique, you might be tempted to apply it to every task. It is important to resist this temptation because no machine learning approach is best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org.
Generally, evaluation occurs after a model has been trained on an initial training dataset. Then, the model is evaluated on a separate test dataset in order to judge how well its characterization of the training data generalizes to new, unseen cases. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case—mistakes are almost always inevitable.
In part, models fail to generalize perfectly due to the problem of noise, a term that describes unexplained or unexplainable variations in data. Noisy data is caused by seemingly random events, such as:
Measurement error due to imprecise sensors that sometimes add or subtract a small amount from the readings
Issues with human subjects, such as survey respondents reporting random answers to questions in order to finish more quickly
Data quality problems, including missing, null, truncated, incorrectly coded, or corrupted values
Phenomena that are so complex or so little understood that they impact the data in ways that appear to be random
Trying to model noise is the basis of a problem called overfitting; because most noisy data is unexplainable by definition, attempting to explain the noise will result in models that do not generalize well to new cases. Efforts to explain the noise also typically result in more complex models that miss the true pattern the learner is trying to identify.
A model that performs relatively well during training but relatively poorly during evaluation is said to be overfitted to the training dataset because it does not generalize well to the test dataset. In practical terms, this means that it has identified a pattern in the data that is not useful for future action; the generalization process has failed. Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well the methods are able to handle noisy data and avoid overfitting is an important point of distinction among them.
So far, we've focused on how machine learning works in theory. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task, any machine learning algorithm can be deployed by following these steps:
Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source, such as a text file, spreadsheet, or database.
Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs.
Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model.
Model evaluation: Each machine learning model results in a biased solution to the learning problem, which means that it is important to evaluate how well the algorithm learned from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset, or you may need to develop measures of performance specific to the intended application.
Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the model's performance. Sometimes it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work, as in step two of this process.
After these steps have been completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time); for projections of financial data; to generate useful insight for marketing or research; or to automate tasks, such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next-generation learner.
The practice of machine learning involves matching the characteristics of the input data to the biases of the available learning algorithms. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes between input datasets.
The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units, such as person-years, which denote cases where the same person is tracked over multiple years, and each person-year comprises a person's data for one year.
The unit of observation is related, but not identical, to the unit of analysis, which is the smallest unit from which inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people (the unit of observation) might be used to analyze trends across countries (the unit of analysis).
Datasets that store the units of observation and their properties can be described as collections of:
Examples: Instances of the unit of observation for which properties have been recorded
Features: Recorded properties or attributes of examples that may be useful for learning
It is easiest to understand features and examples through real-world scenarios. For instance, to build a learning algorithm to identify spam emails, the unit of observation could be email messages, examples would be specific individual messages, and the features might consist of the words used in the messages.
For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and the features may be genomic markers from biopsied cells in addition to patient characteristics, such as weight, height, or blood pressure.
People and machines differ in the types of complexity they are suited to handle in the input data. Humans are comfortable consuming unstructured data, such as free-form text, pictures, or sound. They are also flexible handling cases in which some observations have a wealth of features, while others have very little.
On the other hand, computers generally require data to be structured, which means that each example of the phenomenon has the same features, and these features are organized in a form that a computer may understand. To use the brute force of the machine on large, unstructured datasets usually requires a transformation of the input data to a structured form.
The following spreadsheet shows data that has been gathered in matrix format. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles for sale, while the columns record each automobile's features, such as the price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning. As you will see in later chapters, when forms of data are encountered in specialized applications, they are ultimately transformed into a matrix prior to machine learning.
A dataset's features may come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature comprises a set of categories, the feature is called categorical or nominal. A special type of categorical variable is called ordinal, which designates a nominal variable with categories falling in an ordered list. One example of an ordinal variable is clothing sizes, such as small, medium, and large; another is a measurement of customer satisfaction on a scale from "not at all happy" to "somewhat happy" to "very happy." For any given dataset, thinking about what the features represent, their types, and their units, will assist with determining an appropriate machine learning algorithm for the learning task.
Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step toward using data to drive the desired action.
A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features.
Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hour.
Now, because predictive models are given clear instructions on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output.
The often-used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether:
An email message is spam
A person has cancer
A football team will win or lose
An applicant will default on a loan
In classification, the target feature to be predicted is a categorical feature known as the class, which is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Classification is so widely used in machine learning that there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. We will see examples of these later in this chapter and throughout this book.
Supervised learners can also be used to predict numeric data, such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression is not the only method for numeric prediction, it is by far the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between the inputs and the target, including both the magnitude and uncertainty of the relationship.
Since it is easy to convert numbers to categories (for example, ages 13 to 19 are teenagers) and categories to numbers (for example, assign 1 to all males and 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm.
A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining.
For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is the goal of market basket analysis, which is applied to retailers' transactional purchase data. Here, retailers hope to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunscreen, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items.
Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hotspots for criminal activity.
The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis, which identifies groups of individuals with similar behavior or demographic information in order to target them with advertising campaigns based on their shared characteristics. With this approach, the machine identifies the clusters, but human intervention is required to interpret them. For example, given a grocery store's five customer clusters, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. Despite this human effort, this is still less work than creating a unique appeal for each customer.
Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but rather is focused on learning how to learn more effectively. A meta-learning algorithm uses the result of past learning to inform additional learning.
This encompasses learning algorithms that learn to work together in teams called ensembles, as well as algorithms that seem to evolve over time in a process called reinforcement learning. Meta-learning can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible.
Some of the most exciting work being done in the field of machine learning today is in the domain of meta-learning. For instance, adversarial learning involves learning about a model's weaknesses in order to strengthen its future performance or harden it against malicious attack. There is also heavy investment in research and development efforts to make bigger and faster ensembles, which can model massive datasets using high-performance computers or cloud-computing environments.
The following table lists the general types of machine learning algorithms covered in this book. Although this covers only a fraction of the entire set of machine learning algorithms, learning these methods will provide a sufficient foundation for making sense of any other methods you may encounter in the future.
Supervised learning algorithms
Classification rule learners
Support vector machines
Unsupervised learning algorithms
To begin applying machine learning to a real-world project, you will need to determine which of the four learning tasks your project represents: classification, numeric prediction, pattern detection, or clustering. The task will drive the choice of algorithm. For instance, if you are undertaking pattern detection, you are likely to employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm, and numeric prediction will utilize regression analysis or regression trees.
For classification, more thought is needed to match a learning problem to an appropriate classifier. In these cases, it is helpful to consider the various distinctions among the algorithms—distinctions that will only be apparent by studying each of the classifiers in depth. For instance, within classification problems, decision trees result in models that are readily understood, while the models of neural networks are notoriously difficult to interpret. If you were designing a credit scoring model, this could be an important distinction because the law often requires that the applicant must be notified about the reasons he or she was rejected for the loan. Even if the neural network is better at predicting loan defaults, if its predictions cannot be explained, then it is useless for this application.
To assist with algorithm selection, in every chapter the key strengths and weaknesses of each learning algorithm are listed. Although you will sometimes find that these characteristics exclude certain models from consideration, in many cases the choice of algorithm is arbitrary. When this is true, feel free to use whichever algorithm you are most comfortable with. Other times, when predictive accuracy is the primary goal, you may need to test several models and choose the one that fits best, or use a meta-learning algorithm that combines several different learners to utilize the strengths of each.
Many of the algorithms needed for machine learning are not included as part of the base R installation. Instead, the algorithms are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open-source software, there is no additional charge for this functionality.
A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages.
If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN. The CRAN website is available at http://cran.r-project.org/index.html.
If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble.
The Packages link on the left side of the CRAN page will take you to a page where you can browse the packages in alphabetical order or sorted by publication date. At the time of this writing, a total of 13,904 packages were available—over two times the number since the second edition of this book was written, and over three times since the first edition! Clearly, the R community is thriving, and this trend shows no sign of slowing!
The Task Views link on the left side of the CRAN page provides a curated list of packages by subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at https://CRAN.R-project.org/view=MachineLearning.
Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the
RWeka package developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka, Computational Statistics Vol. 24, pp 225-232 for more information). The
RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. For more information on Weka, see http://www.cs.waikato.ac.nz/~ml/weka/.
To use the
RWeka package, you will need to have Java installed, if it isn't already (many computers come with Java preinstalled). Java is a set of programming tools, available for free, that allow for the use of cross-platform applications such as Weka. For more information and to download Java for your system, visit http://www.java.com.
The most direct way to install a package is via the
install.packages() function. To install the
RWeka package, at the R command prompt simply type:
R will then connect to CRAN and download the package in the correct format for your operating system. Some packages, such as
RWeka, require additional packages to be installed before they can be used. These are called dependencies. By default, the installer will automatically download and install any dependencies.
The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed.
The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the
lib option as follows:
> install.packages("RWeka", lib = "/path/to/library")
The installation function also provides additional options for installing from a local file, installing from source, or using experimental versions. You can read about these options in the help file by using the following command:
More generally, the question mark operator can be used to obtain help on any R function. Simply type
? before the name of the function.
In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users with the
library() function as they are needed.
The name of this function leads some people to incorrectly use the terms "library" and "package" interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself.
To load the
RWeka package installed previously, you can type the following:
RWeka, there are several other R packages that will be used in later chapters. Installation instructions will be provided as these additional packages are needed.
To unload an R package, use the
detach() function. For example, to unload the
RWeka package shown previously, use the following command:
> detach("package:RWeka", unload = TRUE)
This will free up any resources used by the package.
Before you begin working with R, it is highly recommended to also install the open-source RStudio desktop application. RStudio is an additional interface to R that includes functionalities that make it far easier, more convenient, and more interactive to work with R code. It is available free of charge at https://www.rstudio.com/.
The RStudio interface includes an integrated code editor, an R command-line console, a file browser, and an R object browser. R code syntax is automatically colorized, and the code's output, plots, and graphics are displayed directly within the environment, which makes it much easier to follow long or complex statements and programs. More advanced features allow R project and package management; integration with source control or version control tools, such as Git and Subversion; database connection management; and the compilation of R output to HTML, PDF, or Microsoft Word formats.
RStudio is a key reason why R is a top choice for data scientists today. It wraps the power of R programming and its tremendous library of machine learning and statistical packages in an easy-to-use and easy-to-install development interface. It is not only ideal for learning R, but can also grow with you as you learn R's more advanced functionality.
Machine learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insight in large quantities of data. Still, as we have seen in this chapter, caution must be used in order to avoid common abuses of machine learning in the real world.
Conceptually, the learning process involves the abstraction of data into a structured representation, and the generalization of the structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned, then summarizes this data in the form of a model, which is used for predictive or descriptive purposes. These purposes can be grouped into tasks including classification, numeric prediction, pattern detection, and clustering. Among the many possible methods, machine learning algorithms are chosen on the basis of the input data and the learning task.
R provides support for machine learning in the form of community-authored packages. These powerful tools are available to download at no cost, but need to be installed before they can be used. Each chapter in this book will introduce such packages as they are needed.
In the next chapter, we will further introduce the basic R commands that are used to manage and prepare data for machine learning. Though you might be tempted to skip this step and jump directly into applications, a common rule of thumb suggests that 80 percent or more of the time spent on typical machine learning projects is devoted to the step of data preparation, also known as "data wrangling." As a result, investing some effort into learning how to do this effectively will pay dividends for you later on.