Probabilistic graphical models, or simply graphical models as we will refer to them in this chapter, are models that use the representation of a graph to describe the conditional independence relationships between a series of random variables. This topic has received an increasing amount of attention in recent years and probabilistic graphical models have been successfully applied to tasks ranging from medical diagnosis to image segmentation. In this chapter, we'll present some of the necessary background that will pave the way to understanding the most basic graphical model, the Naïve Bayes classifier. We will then look at a slightly more complicated graphical model, known as the Hidden Markov Model, or HMM for short. To get started in this field, we must first learn about graphs and why they are useful.
You're reading from Mastering Predictive Analytics with R
Graph theory is a branch of mathematics that deals with mathematical objects known as graphs. Here, a graph does not have the everyday meaning that we are more used to talking about, in the sense of a diagram or plot with an x and y axis. In graph theory, a graph consists of two sets. The first is a set of vertices, which are also referred to as nodes. We typically use integers to label and enumerate the vertices. The second set consists of edges between these vertices.
Thus, a graph is nothing more than a description of some points and the connections between them. The connections can have a direction so that an edge goes from the source or tail vertex to the target or head vertex. In this case, we have a directed graph. Alternatively, the edges can have no direction, so that the graph is undirected.
A common way to describe a graph is via the adjacency matrix. If we have V vertices in the graph, an adjacency matrix is a V×V matrix whose entries are 0 if the vertex represented...
Suppose we are interested in two events, A and B. In this case, event A might represent the event that a patient has appendicitis and event B might represent a patient having a high white blood cell count. The conditional probability of event A given event B is essentially the probability that event A will occur when we know that event B has already happened.
Formally, we define the conditional probability of event A given event B as the joint probability of both events occurring divided by the probability of event B occurring:
Note that this is consistent with the way in which we define statistical independence. Statistical independence occurs when the joint probability of two events occurring is just the product of the individual probabilities of the two events. If we substitute this in our previous equation, we have:
This makes sense intuitively because if we know that two events are independent of each other, knowing that event B has occurred does not change the probability...
We know from statistics that the notion of statistical independence says that the joint probability of two random variables A and B is just the product of their (marginal) probabilities. Sometimes, two variables may not be statistically independent of each other to begin with, but observing a third variable, C, might result in them becoming independent of each other. In short, we say that events A and B are conditionally independent given C, and we can express this as:
For example, suppose that J represents the probability of being given a job offer at a particular company and G represents the probability of being accepted into graduate school at a particular university. Both of these might depend on a variable U, a person's performance on their undergraduate degree. This can be summarized in a graph as:
When we don't know U, a person's performance on their undergraduate degree, knowing that they were accepted into graduate school might increase our belief in their...
Bayesian networks are a type of graphical model that involves a directed acyclic graph structure. We often refer to the tail node of a directed edge in a graphical model as the parent and the head node as the child or descendant. In fact, we generalize this latter notion so that if there is a path from node A to node B in the model, node B is a descendant of node A. We can distinguish the special case of node A connected to node B by saying that the latter is a direct descendant.
The parent relationship and the descendant relationship are mutually exclusive in a Bayesian network because it has no cycles. Bayesian networks have the distinguishing property that given its parents, every node in the network is conditionally independent of all other nodes in the network that are not its descendants. This is sometimes referred to as the local Markov property. It is an important property because it means that we can easily factorize the joint probability function of all the random...
We now have the necessary tools to learn about our first and simplest graphical model, the Naïve Bayes classifier. This is a directed graphical model that contains a single parent node and a series of child nodes representing random variables that are dependent only on this node with no dependencies between them. Here is an example:
We usually interpret our single parent node as the causal node, so in our particular example, the value of the Sentiment node will influence the value of the sad node, the fun node, and so on. As this is a Bayesian network, the local Markov property can be used to explain the core assumption of the model. Given the Sentiment node, all other nodes are independent of each other.
In practice, we use the Naïve Bayes classifier in a context where we can observe and measure the child nodes and attempt to estimate the parent node as our output. Thus, the child nodes will be the input features of our model, and the parent node will be the output...
The first application we will study in detail comes from the field of biology. There, we learn that the basic building blocks of DNA molecules are actually four fundamental molecules known as nucleotides. These are called Thymine, Cytosine, Adenine, and Guanine, and it is the order in which these molecules appear in a DNA strand that encodes the genetic information carried by the DNA.
An interesting problem in molecular biology is finding promoter sequences within a larger DNA strand. These are special sequences of nucleotides that play an important role in regulating a genetic process known as gene transcription. This is the first step in the mechanism by which information in the DNA is read.
The molecular biology (promoter gene sequences) data set, hosted by the UCI Machine Learning repository at https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Promoter+Gene+Sequences) contains a number of gene sequences from DNA belonging to the bacterium E....
In this section, we will model the patterns of letters that form English words. Beyond having different words, and sometimes alphabets, languages differ from each other in the patterns of letters that are used to form words. English words have a characteristic distribution of letters and letter sequences, and in this section we will try to model the process of word formation in a very simplistic way by using a hidden Markov model.
The emitted symbols of our model will be the letters themselves, but this time we don't know what the states could be as we are using unlabeled data. For this reason, we are going to provide just the number of states that we want our model to have, and then use the Baum-Welch algorithm to train the parameters of our HMM.
All we need for this task is a corpus of text in English. Earlier in this chapter, we studied movie reviews with the Naïve Bayes classifier, so we will use these for convenience, although other sources...
In this chapter, we introduced ourselves to one of the very active areas of research in machine learning, namely the field of probabilistic graphical models. These models involve using a graphical structure to encode conditional independence relations between random variables. We saw how Bayes' Theorem, a very simple formula that essentially tells us how we can predicate cause by observing effect, can be used to build a simple classifier known as the Naïve Bayes classifier. This is a simple model where we are trying to predict an output class that best explains a set of observed features, all of which are assumed to be independent of each other given the output class.
We used this model to predict user sentiment on a set of movie reviews where the features were the words that were present in the reviews. Although we obtained reasonable accuracy, we found that the assumptions in our model are quite strict and prevent us from doing substantially better. Often, a Naïve Bayes model is...