Greetings to you, fellow sentient being; welcome to our exciting journey. The journey itself is to understand the concepts and inner workings behind an elusively powerful computing paradigm: the artificial neural network (ANN). While this notion has been around for almost half a century, the ideas accredited to its birth (such as what an agent is, or how an agent may learn from its surroundings), date back to Aristotelian times, and perhaps even to the dawn of civilization itself. Unfortunately, people in the time of Aristotle were not blessed with the ubiquity of big data, or the speeds of Graphical Processing Unit (GPU)-accelerated and massively parallelized computing, which today open up some very promising avenues for us. We now live in an era where the majority of our species has access to the building blocks and tools required to assemble artificially-intelligent systems. While covering the entire developmental timeline that brings us here today is slightly beyond the scope of this book, we will attempt to briefly summarize some pivotal concepts and ideas that will help us think intuitively about our problem here.
In this chapter, we will cover the following topics:
- Defining our goal
- Knowing our tools
- Understanding neural networks
- Observing the brain
- Information modeling and functional representations
- Some fundamental refreshers in data science
Essentially, our task here is to conceive a mechanism that is capable of dealing with any data that it is introduced to. In doing so, we want this mechanism to detect any underlying patterns present in our data, in order to leverage it for our own benefit. Succeeding at this task means that we will be able to translate any form of raw data into knowledge, in the form of actionable business insights, burden-alleviating services, or life-saving medicines. Hence, what we actually want is to construct a mechanism that is capable of universally approximating any possible function that could represent our data; the elixir of knowledge, if you will. Do step back and imagine such a world for a moment; a world where the deadliest diseases may be cured in minutes. A world where all are fed, and all may choose to pursue the pinnacle of human achievement in any discipline without fear of persecution, harassment, or poverty. Too much of a promise? Perhaps. Achieving this utopia will take a bit more than designing efficient computer systems. It will require us to evolve our moral perspective in parallel, reconsider our place on this planet as individuals, as a species, and as a whole. But you will be surprised by how much computers can help us get there.
It's important here to understand that it is not just any kind of computer system that we are talking about. This is something very different from what our computing forefathers, such as Babbage and Turing, dealt with. This is not a simple Turing machine or difference engine (although many, if not all, of the concepts we will review in our journey relate directly back to those enlightened minds and their inventions). Hence, our goal will be to cover the pivotal academic contributions, practical experimentation, and implementation insights that followed from centuries, if not decades, of scientific research behind the fundamental concept of generating intelligence; a concept that is arguably most innate to us humans, yet so scarcely understood.
We will mainly be working with the two most popular deep learning frameworks that exist, and are freely available to the public at large. This does not mean that we will completely limit our implementations and exercises to these two platforms. It may well occur that we experiment with other prominent deep learning frameworks and backends. We will, however, try to use either TensorFlow or Keras, due to their widespread popularity, large support community, and flexibility in interfacing with other prominent backend and frontend frameworks (such as Theano, Caffe, or Node.js, respectively). We will now provide a little background information on Keras and TensorFlow:
Many have named Keras the lingua franca of deep learning, due to its user friendliness, modularity, and extendibility. Keras is a high-level application programming interface for neural networks, and focuses on enabling fast experimentation. It is written in Python and is capable of running on top of backends such as TensorFlow or Keras. Keras was initially developed as part of the research effort of the ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System) project. Its name is a reference to the Greek word,
, which literally translates to horn. The word eludes to a play on words dating back to ancient Greek literature, referring to the horn of Amalthea (also known as Cornucopia), an eternal symbol of abundance.
TensorFlow is an open source software library for high-performance numerical computation using a data representation known as tensors. It allows people like me and you to implement something called dataflow graphs. A dataflow graph is essentially a structure that describes how data moves through a network, or a series of processing neurons. Every neuron in the network represents a mathematical operation, and each connection (or edge) between neurons is a multidimensional data array, or tensor. In this manner, TensorFlow provides a flexible API that allows easy deployment of computation across a variety of platforms (such as CPUs, GPUs, and their very own Tensor Processing Units (TPUs)), and from desktops, to clusters of servers, to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team, it provides an excellent programmatic interface that supports neural network design and deep learning.
We begin our journey with an attempt to gain a fundamental understanding of the concept of learning. Moreover, what we are really interested in is how such a rich and complex phenomenon as learning has been implemented on what many call the most advanced computer known to humankind. As we will observe, scientists seem to continuously find inspiration from the inner workings of our own biological neural networks. If nature has indeed figured out a way to leverage loosely connected signals from the outside world and patch them together as a continuous flow of responsive and adaptive awareness (something most humans will concur with), we would indeed like to know exactly what tricks and treats it may have used to do so. Yet, before we can move on to such topics, we must establish a baseline to understand why the notion of neural networks are far different from most modern machine learning (ML) techniques.
It is extremely hard to draw a parallel between neural networks and any other existing algorithmic mannerism for problem-solving that we have thus far. Linear regression, for example, simply deals with calculating a line of best fit with respect to the mean of squared errors from plotted observation points. Similarly, centroid clustering just recursively separates data by calculating ideal distances between similar points iteratively until it reaches an asymptotic configuration.
Neural networks, on the other hand, are not that easily explicable, and there are many reasons for this. One way of looking at this is that a neural network is an algorithm that itself is composed of different algorithms, performing smaller local calculations as data propagates through it. This definition of neural networks presented here is, of course, not complete. We will iteratively improve it throughout this book, as we go over more complex notions and neural network architectures. Yet, for now, we may well begin with a layman's definition: a neural network is a mechanism that automatically learns associations between the inputs you feed it (such as images) and the outputs you are interested in (that is, whether an image has a dog, a cat, or an attack helicopter).
So, now we have a rudimentary idea of what a neural network isâa mechanism that takes inputs and learns associations to predict some outputs. This versatile mechanism is, of course, not limited to being fed images only. Indeed, such networks are equally capable of taking inputs such as some text or recorded audio, and guessing whether it is looking at Shakespeare's Hamlet, or listening to Billie Jean, respectively. But how could such a mechanism compensate for the variety of data, in both form and size, while still producing relevant results? To understand this, many academics find it useful to examine how nature can solve this problem. In fact, the millions of years of evolution that occurred on our planet, through genetic mutations and environmental conditions, has produced something quite similar. Better yet, nature has even equipped each of us with a version of this universal function approximator, right between our two ears! We speak, of course, of the human brain.
Before we briefly delve into this notorious comparison, it is important for us to clarify here that it is indeed just a comparison, and not a parallel. We do not propose that neural networks work exactly in the manner that our brains do, as this would not only anger quite a few neuroscientists, but also does no justice to the engineering marvel represented by the anatomy of the mammalian brain. This comparison, however, helps us to better understand the workflow by which we may design systems that are capable of picking up relevant patterns from data. The versatility of the human brain, be it in making musical orchestras, art masterpieces, or pioneering scientific machinery such as the Large Hydron Collider, shows how the same architecture is capable of learning and applying highly complex and specialized knowledge to great feats. It turns out that nature is a pretty smart cookie, and hence we can learn a lot of valuable lessons by just observing how it has gone about implementing something so novel as a learning agent.
Quarks build up atoms, atoms build up molecules, and molecules grouped together may, once in a while, build up chemically excitable biomechanical units. We call these units cells; the fundamental building blocks of all biological life forms. Now, cells themselves come in exuberant variety, but one specific type of them is of interest to us here. It is a specific class of cells, known as nerve cells, or neurons. Why? Well, it turns out that if you take about 1011 neurons and set them up in a specific, complementary configuration, you get an organ that is capable of discovering fire, agriculture, and space travel. To realize how these bundles of neurons learn, however, we must first comprehend how one single neuron works. As you will see, it is the repetitive architecture in our brain, composed of these very same neurons, that gives rise to the grander phenomenon that we (pompously) call intelligence.
A neuron is simply an electrically excitable cell that receives, processes, and transmits information through electrical and chemical signals. Dendrites extend from the neuron cell body and receive messages from other neurons. When we say that neurons receive or send messages, what we actually mean is that they transmit electrical impulses along their axons. Lastly, neurons are excitable. In other words, the right impulse supplied to a neuron will produce electrical events, known as action potentials. When a neuron reaches its action potential (or spikes), it releases a neurotransmitter, which is a chemical that travels a tiny distance across a synapse before reaching other neurons. Any time a neuron spikes, neurotransmitters are released from hundreds of its synapses, reaching the dendrites of other neurons that themselves may or may not spike, depending on the nature of the impulse. This is the very mannerism that allows these vast networks of neurons to communicate, compute, and work together to solve complex tasks that we humans face daily:
So, all a neuron really does is take in some electric input, undergo some sort of processing, and then fire if the outcome is positive, or remain inactive if the outcome of that processing is negative. What do we mean here by whether an outcome is positive? To understand this, it is useful to have a little parenthesis on how information and knowledge is represented in our own brains.
Consider a task where you have to correctly classify images of dogs, cats, and attack helicopters. One way of thinking about a neuronal learning system is that we dedicate several neurons to represent the various features that exist in the three respective classes. In other words, let's say that we have employed three expert neurons for our classification task here. Each one of these neurons is an expert in the domain of what a dog, cat, and an attack helicopter looks like.
How are they experts? Well, for now, we can think that each of our domain expert neurons are supported by their own cabinet of employees and support staff, all diligently working for these experts, collecting and representing different breeds of dogs, cats, and attack helicopters, respectively. But we don't deal with their support staff for the time being. At the moment, we simply present any image to each of our three domain experts. If the picture is of a dog, our dog expert neuron immediately recognizes the creature and fires, almost as if it were saying, Hello, I believe this is a dog. Trust me, I'm an expert. Similarly, when we present our three experts a picture of a cat, our cat neuron will signal to us that they have detected a cat in our image by firing. While this is not exactly how each neuron represents real-world objects, such as cats and dogs, it helps us gain a functional understanding of neuron-based learning systems. Hopefully, you have enough information now to be introduced to the biological neuron's less sophisticated brother, the artificial neuron, next.
In reality, many neuroscientists argue that this idea of a unified representative neuron, such as our cat expert neuron, doesn't really exist in our brain. They note how such a mechanism will require our brain to have thousands of neurons dedicated only to specific faces we have known, such as our grandmother, the baker around the corner, or Donald Trump. Instead, they postulate a more distributed representation architecture. This distributed theory states that a specific stimulus, such as the picture of a cat, is represented by (and will trigger) a unique pattern of firing neurons, widely distributed in the brain. In other words, a cat will be represented by perhaps (a wild guess) 100 different neurons, each dedicated to identifying specific cat-like features from the image (such as its ears, tail, eyes, and general body shape). The intuition here is that some of these cat neurons may be recombined with other neurons to represent other images that have elements of cat within. The picture of a jaguar, or the cartoon cat Garfield, for example, could be reconstructed using a subset of the very same cat neurons, in conjunction with some other neurons that have learned attributes that are more specific to the size of jaguars, or Garfield's famous orange and black stripes, perhaps.
In some curious medical cases, patients with physical trauma to the head have not only failed to associate with their loved ones when confronted with them, but even claimed that these very loved ones were impostors just disguised as their loved ones! While a bizarre occurrence, such situations may shed more light onto the exact mechanisms of neural learning. Clearly, the patient recognizes this person, as some neurons encoding the visual patterns corresponding to the features of their loved ones (such as face and clothes) are fired. However, since they interestingly report this disassociation with these same loved ones despite being able to recognize them, it must mean that all the neurons that would normally fire upon coming across this loved one (including the neurons encoding the emotional representations our patient may have for this person) did not fire at the moment when our patient met their significant acquaintance.
These sorts of distributed representations may well allow our brain the versatility in extrapolating patterns from very little data, as we observe ourselves capable of doing. Modern neural networks, for example, still require you to provide it with hundreds (if not thousands) of images before it can reliably predict whether it is looking at a bus or a toaster. My three year-old niece, on the other hand, is able to parallel this accuracy with about three to five pictures of buses and toasters each. Even more fascinating is the fact that the neural networks running on your computer can, at times, take gigawatts of energy to perform computations. My niece only needs 12 watts. She will get what she needs from a few biscuits, or perhaps a small piece of a cake that she carefully sneaks away from the kitchen.
Before a deeper dive into various network architectures and some hands-on examples, it would be a pity if we did not elaborate a little on the pivotal notion of gaining information through processing real-world signals. We speak of the science of quantifying the amount of information present in a signal, also referred to as information theory. While we don't wish to provide a deep mathematical overview on this notion, it is useful to know some background on learning from a probabilistic perspective.
Intuitively, learning that an unlikely event has occurred is more informative than learning that an expected event has occurred. If I were to tell you that you can buy food at all supermarkets today, I won't be met with gasps of surprise. Why? Well, I haven't really told you something beyond your expectations. Conversely, if I told you that you cannot buy food at all supermarkets today, perhaps due to some general strike, well, then you would be surprised. You would be surprised because an unlikely piece of information has been presented (in our case, this is the word not, appearing in the configuration previously presented). Such intuitive knowledge is what we attempt to codify, in the field of information theory. Other similar notions include the following:
- An event with a lower likelihood of occurrence should have lower information content
- An event with a higher likelihood of occurrence should have higher information content
- An event with a guaranteed occurrence should have no information content
- An event with an independent likelihood of occurrence should have additive information content
Mathematically, we can actually satisfy all of these conditions by using the simple equation modeling the self-information of an event (x), as follows:
l(x) is denoted in the nat unit, quantifying the amount of information gained by observing an event of probability, 1/e. Although the preceding equation is nice and neat, it only allows us to deal with a single outcome; this is not too helpful in modeling the dependent complexities of the real world. What if we wanted to quantify the amount of uncertainty in an entire probability distribution of events? Then, we employ another measure, known as Shannon entropy, as shown in the following equation:
Let's say you're a soldier stuck behind enemy lines. Your goal is to let your allies know what kind of enemies are coming their way. Sometimes, the enemy may send tanks, but more often, they send patrols of people. Now, the only way you can signal your friends is by using a radio with simple binary signals. You need to figure out the best way to communicate with your allies, so as to not waste your precious time and get discovered by the enemy. How do you do this? Well, first you map out many sequences of binary bits, each specific sequence corresponding to a specific type of enemy (such as patrols or tanks). With a little knowledge of the environment, you already know that patrols are much more frequent than tanks. It stands to reason then, that you probably will be using the binary signal for patrol much more often than the one for tank. Hence, you will allocate fewer binary bits to communicate the presence of an incoming patrol, as you know you will be sending that signal more often than others. What you're doing is exploiting your knowledge about the distribution over types of enemies to reduce the number of bits that you need to send on average. In fact, if you have access to the overall underlining distribution of incoming patrols and tanks, then you could theoretically use the smallest number of bits to communicate most efficiently with the friendlies on the other side. We do this by using the optimal number of bits at each transmission. The number of bits to represent a signal is known as the entropy of this data, and can be formulated with the following equation:
Here,H(y)denotes a function that refers to the optimal number of bits to represent an event with the probability distribution,y.yisimply refers to the probability of another event,i. So, supposing that seeing an enemy patrol is 256 times more likely to happen than seeing an enemy tank, we would model the number of bits to use to encode the presence of an enemy patrol, as follows:
Patrol bits = log(1/256pTank)
Â Â Â Â Â Â Â Â Â Â = log(1/pTank) + log(1/(2^8))
Â Â Â Â Â Â Â Â Â Â = Tank bits - 8
Cross entropy is yet another mathematical notion, allowing us to compare two distinct probability distributions, denoted by p and q. In fact, as you will see later, we often employ entropy-based loss function in neural networks when dealing with categorical features. Essentially, the cross entropy between two probability distributions (https://en.wikipedia.org/wiki/Probability_distribution), (p, q), over the same underlying set of events, measures the average number of pieces of information needed to identify an event picked at random from a set, under a condition; the condition being that the coding scheme used is optimized for a predicted probability distribution, rather than the true distribution. We will revisit this notion in later chapters to clarify and implement our understandings:
Earlier, we discussed how neurons may electrically propagate information and communicate with other neurons using chemical reactions. These same neurons help us determine what a cat or dog look like. But these neurons never actually see the full image of a cat. All they deal with is chemical and electric impulses. These networks of neurons can carry out their task only because of other sensory preprocessing organs, such as our eyes and optic nerve, that have prepared the data in an appropriate format for our neurons to be able to interpret. Our eyes take in the electromagnetic radiation (or light) that represents the image of a cat, and convert it into efficient representations thereof, communicated through electrical impulses. Hence, a prime difference between artificial and biological neurons relates to the medium of their intercommunication. As we saw, biological neurons use chemicals and electrical impulses as a means of communication. Similarly, artificial neurons rely on the universal language of mathematics to represent patterns from data. In fact, there exists a whole discipline surrounding the concept of representing real-world phenomena mathematically for the purpose of knowledge extraction. This discipline, as many of you are familiar with, is known as data science.
Pick up any book on data science; there is a fair chance that you will come across an elaborate explanation, involving the intersection of fields such as statistics and computer science, as well as some domain knowledge. As you flip through the pages rapidly, you will notice some nice visualizations, graphs, bar chartsâthe works. You will be introduced to statistical models, significance tests, data structures, and algorithms, each providing impressive results for some demonstrative use case. This is not data science. These are indeed the very tools you will be using as a successful data scientist. However, the essence of what data science is can be summarized in a much simpler manner: data science is the scientific domain that deals with generating actionable knowledge from raw data. This is done by iteratively observing a real-world problem, quantifying the overall phenomena in different dimensions, or features, and predicting future outcomes that permit desired ends to be achieved. ML is just the discipline of teaching machines data science.
While some computer scientists may appreciate this recursive definition, some of you may ponder what is meant by quantifying a phenomenon. Well, you see, most observations in the real world, be it the amount of food you eat, the kind of programs you watch, or the colors you like on your clothes, can be all defined as (approximate) functions of some other quasi-dependent features. For example, the amount of food you will eat in any given day can be defined as a function of other things, such as how much you ate in your previous meal, your general inclination to certain types of food, or even the amount of physical exertion you get.
Similarly, the kind of programs you like to watch may be approximated by features such as your personality traits, interests, and the amount of free time in your schedule. Reductively speaking, we work with quantifying and representing differences between observations (for example, the viewing habits between people), to deduce a functional predictive rule that machines may work with.
We induce these rules by defining the possible outcomes that we are trying to predict (that is, whether a given person likes comedies or thrillers) as a function of input features (that is, how this person ranks on the Big Five personality test) that we collect when observing a phenomenon at large (such as personalities and the viewing habits of a population):
If you have selected the right set of features, you will be able to derive a function that is able to reliably predict the output classes that you are interested in (in our case, this is viewer preferences). What do I mean by the right features? Well, it stands to reason that viewing habits have more to do with a person's personality traits than their travel habits. Predicting whether someone is inclined towards horror movies as a function of, say, their eye color and real-time GPS coordinates, will be quite useless, as they are not informative to what we are trying to predict. Hence, we always choose relevant features (through domain knowledge or significance tests) to reductively represent a real-world phenomenon. Then, we simply use this representation to predict the future outcomes that we are interested in. This representation itself is what we call a predictive model.
As you saw, we can represent real-world observations by redefining them as a function of different features. The speed of an object, for example, is a function of the distance it traveled over a given time. Similarly, the color of a pixel on your TV screen is actually a function of the red, green, and blue intensity values that make up that pixel. These elements are what data scientists call features or dimensions of your data. When we have dimensions that are labeled, we deal with a supervised learning task, as we can check the learning of our model with respect to what is truly the case. When we have unlabeled dimensions, we calculate the distances between our observation points to find similar groups in our data. This is known as unsupervised ML. Hence, in this manner, we can start building a model of a real-world phenomenon, by simply representing it using informative features.
The natural question that follows is: how exactly do we build a model? Long story short, the features that we choose to collect while observing an outcome can all be plotted on a high-dimensional space. While this may sound complicated, it is just an extension of the Cartesian coordinate system that you may be familiar with from high school mathematics. Let's recall how to represent a single point on a graph, using the Cartesian coordinate system. For this task, we require two values, x and y. This is an example of a two-dimensional feature space, with the x and y axis each being a dimension in the representational space. Add a z axis, and we get a three-dimensional feature space. Essentially, we define ML problems in an n-dimensional feature space, where n refers to the number of features that we have on the phenomenon we are trying to predict. In our previous case of predicting viewer preference, if we solely use the Big Five personality test scores as input features, we will essentially have a five-dimensional feature space, where each dimension corresponds to a person's score on one of the five personality dimensions. In fact, modern ML problems can range from 100 to 100,000 dimensions (and sometimes even more). Since the number of possible configurations of features increases exponentially with respect to increases in the number of different features, it becomes quite hard, even for computers, to conceive and compute in such proportions. This problem in ML is generally referred to as the curse of dimensionality.
Once we have a high-dimensional representation of relevant data, we can commence the task of deriving a predictive function. We do this by using algorithms, which are essentially a set of preprogrammed recursive instructions that categorize and divide our high-dimensional data representation in a certain manner. These algorithms (these are most commonly clustering, classification, and regression) recursively separate our data points (that is, personality rankings per person) on the feature space into smaller groups where the data points are comparatively more alike. In this manner, we use algorithms to iteratively segment our high-dimensional feature space into smaller regions, which will eventually correspond to our output classes (ideally). Hence, we can reliably predict the output class of any future data points simply by placing them on our high-dimensional feature space and comparing them to the regions corresponding to our model's predicted output classes. Congratulations, we have a predictive model!
Every time we choose to define an observation as a function of some features, we open up a Pandora's box of semi-causally linked features, where each feature itself could be redefined (or quantified) as a function of other features. In doing so, we might want to take a step back, and consider what exactly we are trying to represent. Is our model capturing relevant patterns? Can we rely on our data? Will our resources, be it algorithms or computational firepower, suffice for learning from the data we have?
Recall our earlier scenario of predicting the quantity of food an individual is likely to consume in each meal. The features that we discussed, such as their physical exertion, could be redefined as a function of their metabolic and hormonal activity. Similarly, dietary preferences could be redefined as a function of their gut bacteria and stool composition. Each of these redefinitions adds new features to our model, bringing with them additional complexity.
Perhaps we can even achieve a greater accuracy in predicting exactly how much takeout you should order. Would this be worth the effort of getting a stomach biopsy every day? Or installing a state-of-the-art electron microscope in your toilet? Most of you will agree: no, it would not be. How did we come to this consensus? Simply by assessing our use case of dietary prediction and selecting features that are relevant enough to predict what we want to predict, in a fashion that is reliable and proportional to our situation. A complex model supplemented by high-quality hardware (such as toilet sensors) is unnecessary and unrealistic for the use case of dietary prediction. You could as easily achieve a functional predictive model based on easily obtainable features, such as purchase history and prior preferences.
The essence of this story is that you may define any observable phenomenon as a function of other phenomenon in a recursive manner, but a clever data scientist will know when to stop by picking appropriate features that reasonably fit your use case; are readily observable and verifiable; and robustly deal with all relevant situations. All we need is to approximate a function that reliably predicts the output classes for our data points. Inducting a too complex or simplistic representation of our phenomenon will naturally lead to the demise of our ML project.
Before we march forth in our journey to understand, build, and master neural networks, we must at least refresh our perception of some fundamental ML concepts. For example, it is important to understand that you are never modeling a phenomenon completely. You are only functionally representing a part of it. This helps you think about data intuitively, forming but a small piece in the large puzzle, represented by a general phenomenon that you are trying to understand. This also helps you realize that times change. The importance of features, as well as surrounding environments, are both subject to such change, eroding the predictive power of your model. Such intuition is naturally built with practice and domain knowledge.
In the following section, we will briefly refresh our memory with some classic pitfalls of ML use cases, with a few simple scenario-driven examples. This is important to do as we will notice these same problems reappear when we undertake our main journey of understanding and applying neural networks to various use cases.
Consider the problem of predicting the weather forecast. We will begin constructing our predictive model by doing some feature selection. With some domain knowledge, we initially identify the feature air pressure as a relevant predictor. We will record different Pa values (Pascals, a measure of air pressure) over different days on the island of Hawaii, where we live. Some of these days turn out to be sunny, and others rainy.
After several sunny days, your predictive model tells you that there is a very high chance of the following day also being sunny, yet it rains. Why? This is simply because your model has not seen enough instances of both prediction classes (sunny and rainy days) to accurately assess the chance of there being rain. In this case, it is said to have unbalanced class priors, which misrepresent the overall weather pattern. According to your model, there are only sunny days because it has only seen sunny days as of yet.
You have collected about two months worth of air pressure data, and balanced the number of observations in each of your output classes. Your prediction accuracy has steadily increased, but starts tapering off at a suboptimal level (let's say 61%). Suddenly, your model's accuracy starts dropping again, as it gets colder and colder outside. Here, we face the problem of underfitting, as our simplistic model is unable to capture the underlying pattern of our data, caused by the seasonal coming of winter. There are a few simple remedies to this situation. Most prominently, we may simply improve our model by adding more predictive features, such as the outside temperature. Doing so, we observe after a few days of data collection that once again, our accuracy climbs up, as the additional feature adds more information to our model, increasing its predictive power. In other cases of underfitting, we may well have chosen to select a more computationally-intensive predictive model, add more data and engineer better features, or reduce any mathematical constraints (such as the lambda hyperparameter for regularization) on the model.
After collecting about a few years of data, you confidently boast that you have developed a robust predictive model with 96% accuracy to your farmer friend. Your friend says, Well, great news, can I have it? You, being an altruist and philanthropist, immediately agree and send him the code. A day later the same friend calls back from his home in Guangdong province in China, angry that your model did not work and has ruined his crop harvest. What happened here? This was simply a case of overfitting our model to the tropical climate of Hawaii, which does not generalize well outside of this sample. Our model did not see enough variations that actually exist in the possible values of pressure and temperatures, with the corresponding labels of sunny and rainy, to sufficiently be able to predict the weather on another continent. In fact, since our model only saw Hawaiian temperatures and air pressures, it memorized trivial patterns in the data (for example, there are never two rainy days in a row) and uses these patterns as rules for making a prediction, instead of picking up on more informative trends. One simple remedy here is, of course, to gather more weather data in China, and fine-tune your prediction model to the local weather dynamics. In other similar situations involving overfitting, you may attempt to select a simpler model, denoise the data by removing outliers and errors, and center it with respect to mean values.
After explaining to your dear friend from China (henceforth referred to as Chan) the miscalculation that just occurred, you instruct him to set up sensors and start collecting local air pressure and temperature to construct a labeled dataset of sunny and rainy days, just like you did in Hawaii. Chan diligently places sensors on his roof and in his fields. Unfortunately, Chan's roof is made of a reinforced metal alloy with a high thermal conductivity, which erratically inflates the reading from both the pressure and temperature sensors from the roof in an inconsistent and unreliable manner. This corrupted data, when fed to our predictive model, will naturally produce suboptimal results, as the learned line is perturbed by noisy and misrepresentative data. A clear remedy would be to replace the sensors, or simply discard the faulty sensor readings.
Eventually, using enough data from Hawaii, China, and some other places in the world, we notice a clear and globally generalizable pattern, which we can use to predict the weather. So, everybody is happy, until one day, your prediction model tells you it is going to be a bright sunny day, and a tornado comes knocking on your door. What happened? Where did we go wrong? Well, it turns out that when it comes to tornadoes, our two-featured binary classification model does not incorporate enough information about our problem (this being the dynamics of tornadoes) to allow us to approximate a function that reliably predicts this specifically devastating outcome. So far, our model did not even try to predict tornadoes, and we only collected data for sunny and rainy days.
A climatologist here might say, Well, then start collecting data on altitude, humidity, wind speed, and direction, and add some labeled instanced of tornadoes to your data, and, indeed, this would help us fend off future tornadoes. That is, until an earthquake hits the continental shelf and causes a tsunami. This illustrative example shows how whatever model you choose to use, you need to keep tracking relevant features, and have enough data per each prediction class (such as whether it is sunny, rainy, tornado-ey, and so on) to achieve good predictive accuracy. Having a good prediction model simply means that you have discovered a mechanism that is capable of using the data you have collected so far, to induct a set of predictive rules that are seemingly being obeyed.
In this chapter, we gained a functional overview of biological neural networks, with a small and brief preview covering concepts such as neural learning and distributed representations. We also refreshed our memory on some classic data science dilemmas that are equally relevant for neural networks as they are for other ML techniques. In the following chapter, we will finally dive into the much-anticipated learning mechanism loosely inspired by our biological neural networks, as we explore the basic architecture of an ANN. We amicably describe ANNs in such a manner because, despite aiming to work as effectively as their biological counterparts, they are not quite there yet. In the next chapter, you will go over the main implementation considerations involved in designing ANNs and progressively discover the complexity that such an endeavour entails.
- Symbolic versus connectionist learning: http://www.cogsci.rpi.edu/~rsun/sun.encyc01.pdf
- History of artificial intelligence: http://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/
- History of the human brain: http://www.mybrain.co.uk/public/learn_history4.php