# Introduction to the Markov Process

In this chapter, we will develop the basic concepts that we need to understand **Hidden Markov Models** (**HMM**). We will cover the following topics:

- Random processes
- Markov processes
- Markov chains or discrete-time Markov processes
- Continuous-time Markov chains

# Random variables

As we always do in statistics, let's start with a simple example of rolling a dice. If we consider rolling a fair dice, the outcome of the dice can be anything from 1 to 6, and is random. To represent such situations (the outcome of rolling the dice in this case), in mathematics we use the concept of random variables. We come across a lot of such variables in our everyday lives. Another example could be ordering food at a restaurant. In this case, the outcome could be any food item on the menu. In general terms, a random variable is a variable whose possible values are outcomes of a random phenomenon. The possible states of the outcomes are also known as the **domain of the random variable**, and the outcome is based on the probability distribution defined over the domain of the random variable.

Coming back to rolling the dice, the domain of the random variable outcome*, O, *is given by *domain(O) = (1, 2, 3, 4, 5, 6),* and the probability distribution is given by a uniform distribution *P(o) = 1/6 ∀ ∈ domain(O)*. Similarly, in the case of the restaurant example, for the random variable *choosing a dish*, the domain would be every item on the menu, and the probability distribution would depend on your food preference. In both of the previous examples, the domain of the random variable has discrete variables; such random variables are known as **discrete random variables**. But it's also possible for the domain to be a continuous space. For example, consider the random variable representing the stock price of Google* *tomorrow. The domain of this random variable will be all positive real numbers with most of the probability mass distributed around ±5% of today's price. Such random variables are known as **continuous random variables**.

# Random processes

In the previous section, we discussed random variables that are able to mathematically represent the outcomes of a single random phenomenon. But what if we want to represent these random events over some period of time or the length of an experiment? For example, let's say we want to represent the stock prices for a whole day at intervals of every one hour, or we want to represent the height of a ball at intervals of every one second after being dropped from some height in a vacuum. For such situations, we would need a set of random variables, each of which will represent the outcome at the given instance of time. These sets of random variables that represent random variables over a period of time are also known as **random processes**.** **It is worth noting that the domains of all these random variables are the same. Therefore, we can also think of the process as just changing the states.

Now let's discuss the previous two examples in more detail. Starting with the example of dropping the ball from a height in a vacuum, if we know the exact value of gravity and the height from which the ball is being dropped, we will be able to determine the exact location of the ball at every interval of one second using Newton's laws of motion.

Such random processes, in which we can deterministically find the state of each random variable given the initial conditions (in this case, dropping the ball, zero initial velocity) and the parameters of the system (in this case, the value of gravity), are known as **deterministic random processes** (commonly called **deterministic processes**).

Now let's go to the second example; representing the stock price over time. In this case, even if we know the current price and the exact probability distribution of the price at the next one hour mark, we won't be able to deterministically compute the value. These random processes, in which we can't determine the state of a process, even if we are given the initial conditions and all the parameters of the system, are known as **stochastic random processes** (commonly called **processes**). A very good way of understanding or getting a feel for a stochastic process is to think of it as being the opposite of a deterministic process.

# Markov processes

A stochastic process is called a **Markov process **if the state of the random variable at the next instance of time depends only on the outcome of the random variable at the current time. In simplistic mathematical terms, for a stochastic process, *S = {R1, R _{2}, . . ., R_{n}} = {R}_{t=1, . . ., n}*, to be a Markov process, it must satisfy the following condition:

According to the previous condition, the probability distribution for any variable at any given instance in a Markov process is a conditional distribution, which is conditioned only on the random variable at the last time instance. This property of a system, such that the future states of the system depend only on the current state of the system, is also known as the **Markov property**. Systems satisfying the Markov property are also known as **memoryless systems** since they don't need to remember the previous states to compute the distribution of the next state, or, in other words, the next state depends only on the current state of the system.

A very common example used to explain the Markov process is a drunk man walking along a street. We consider that, since the man is drunk, he can either take a step backward, a step forward, or stay in his current position, which is given by some distribution of these, let's say *[0.4, 0.4, 0.2]*. Now, given the position of the man at any given instance in time, his position at the next instance depends only on his current position and the parameters of the system (his step size and the probability distribution of possible actions). Therefore, this is an example of a Markov process.

In the previous example, let's assume that the drunk man takes an action (steps forward/backward or stays in his position) at fixed intervals of time and his step size is always the same. With these considerations, the Markov process in our example has a discrete state space. Also, since the man takes steps after fixed intervals of time, we can think of it as a discrete time. But Markov processes don't need to have discrete state space or discrete time intervals. Considering discrete and continuous time as well as discrete and continuous state space, we can categorize Markov processes into four main categories:

- Discrete time and discrete state space
- Discrete time and continuous state space
- Continuous time and discrete state space
- Continuous time and continuous state space

We will discuss each of these categories of Markov process in more detail in the following sections.

# Installing Python and packages

Before moving ahead, we need to set up Python and all the packages required to run the code examples. For all the code examples in this book, we will be using Python 3.4. All the example code in the book is also available on GitHub at https://github.com/PacktPublishing/HandsOnMarkovModelswithPython. We highly recommend using Miniconda to set up your environment for running the examples. Miniconda can be downloaded from https://conda.io/miniconda.html.

# Installation on Windows

Miniconda can be installed on a Windows system by just double-clicking on the downloaded `.exe` file and following the installation instructions. After installation, we will need to create a `conda` environment and install all the required packages in the environment. To create a new Python 3.4 environment with the name `hmm`, run the following command:

conda create -n hmm python=3.4

After creating the environment, we will need to activate it and install the required packages in it. This can be done using the following commands:

activate hmmconda install numpy scipy

# Installation on Linux

On Linux, after downloading the `Miniconda` file, we will need to give it execution permissions and then install it. This can be done using the following commands:

chmod +x Miniconda.sh./Miniconda.sh

After executing the file, we can simply follow the installation instructions. Once installed, we will need to create a new environment and install the required packages. We can create a new Python 3.4 environment with the name `hmm` using the following commands:

conda create -n hmm python=3.4

Once the environment has been created, we will need to activate it and install the packages inside it using the following:

source activate hmmconda install numpy scipy

# Markov chains or discrete-time Markov processes

A Markov chain is a type of Markov process in which the time is discrete. However, there is a lot of disagreement among researchers on what categories of Markov process should be called **Markov chain**. But, most commonly, it is used to refer to discrete-state-space Markov processes. Therefore, a Markov chain is a stochastic process over a discrete state space satisfying the Markov property. More formally, we can say that a discrete-time Markov chain is a sequence of random variables *X _{1}*,

*X*,

_{2}*X*, ... that satisfy the Markov property, namely that the probability of moving from the current state to the next state depends only on the present state and not on any of the previous states. In terms of the probability distribution, we can say that, given that the system is at time instance

_{3}*n*, the conditional distribution of the states at the next time instance,

*n + 1*, is conditionally independent of the state of the system at time instances

*{1, 2, . . ., n-1}*, given the state of the random variable at time instance

*n*. This can be written as follows:

Markov chains are often represented using directed graphs. The nodes in the directed graphs represent the different possible states of the random variables, and the edges represent the probability of the system going from one state to the other in the next time instance. Let's take a simple example of predicting the weather to understand this representation better. We will consider that there are three possible states of the random variable *Weather={Sunny, Rainy, Snowy}*, and possible Markov chains for this can be represented as shown in *Figure 1.1*:

*and showing the probability of the random variable switching to other states in the next time instance*

One of the main points to understand in Markov chains is that we are modeling the outcomes of a sequence of random variables over time. This is sometimes confusing for people since the model is represented using a single graph, which doesn't mention anything about time. So, the name state transitions is not a particularly good name for this, since the state is not changing for any random variable; rather, we are trying to determine the state of the next random variable given the observed state of our current random variable. Coming back to our example, we can see that the nodes of the graph represent the different possible states of the random variable *Weather, *and the edges between them show the probability of the next random variable taking the different possible states, given the state of the current random variable. The self-loops show the probability of the model staying in its current state. In the previous Markov chain, let's say we know that the observed state of the current random variable is *Sunny, *then the probability that the random variable at the next time instance will also take the value *Sunny *is *0.8*. It could also take the value *Rainy *with a probability of *0.19*, or *Snowy* with a probability of *0.01*. One thing to note here is that the sum of all the probability values on all the outward edges from any state should equal 1, since it's an exhaustive event.

Now, let's try to code this simple Markov chain. We will start by defining a simple `MarkovChain` class, and we will keep on adding methods to this class as we go through this chapter:

import numpy as np

class MarkovChain(object):

def __init__(self, transition_prob):

"""

Initialize the MarkovChain instance.

Parameters

----------

transition_prob: dict

A dict object representing the transition probabilities in

Markov Chain. Should be of the form: {'state1': {'state1':

0.1, 'state2': 0.4}, 'state2': {...}}

"""

self.transition_prob = transition_prob

self.states = list(transition_prob.keys())

def next_state(self, current_state):

"""

Returns the state of the random variable at the next time

instance.

Parameters

----------

current_state: str

The current state of the system.

"""

return np.random.choice(

self.states, p=[self.transition_prob[current_state][next_state]

for next_state in self.states])

def generate_states(self, current_state, no=10):

"""

Generates the next states of the system.

Parameters

----------

current_state: str

The state of the current random variable.

no: int

The number of future states to generate.

"""

future_states = []

for i in range(no):

next_state = self.next_state(current_state)

future_states.append(next_state)

current_state = next_state

return future_states

Now, we can try out our example with this `MarkovChain` class:

>>> transition_prob = {'Sunny': {'Sunny': 0.8, 'Rainy': 0.19,

'Snowy': 0.01},

'Rainy': {'Sunny': 0.2, 'Rainy': 0.7,

'Snowy': 0.1},

'Snowy': {'Sunny': 0.1, 'Rainy': 0.2,

'Snowy': 0.7}}

>>> weather_chain = MarkovChain(transition_prob=transition_prob)

>>> weather_chain.next_state(current_state='Sunny')

'Sunny'

>>> weather_chain.next_state(current_state='Snowy')

'Snowy'

>>> weather_chain.generate_states(current_state='Snowy', no=10)

['Snowy', 'Snowy', 'Snowy', 'Rainy', 'Snowy', 'Snowy', 'Rainy', 'Rainy', 'Snowy', 'Snowy']

So far in the discussion, we have considered that the probability space of the variables doesn't change over different instances of time. This is known as a **time-homogeneous Markov chain**, but it is also possible to have a **time-inhomogeneous Markov chain**, which also has a lot of applications but is outside the scope of this book.

# Parameterization of Markov chains

In the code for the Markov chain in the previous section, we used a dictionary to parameterize the Markov chain that had the probability values of all the possible state transitions. Another way of representing state transitions is using a **transition matrix**. The transition matrix, as the name suggests, uses a tabular representation for the transition probabilities. The transition matrix for the example in *Figure 1.1* is shown in the following table.

The following table shows the transition matrix for the Markov chain shown in *Figure 1.1*. The probability values represent the probability of the system going from the state in the row to the states mentioned in the columns:

States |
Sunny |
Rainy |
Snowy |

Sunny |
0.8 | 0.19 | 0.01 |

Rainy |
0.2 | 0.7 | 0.1 |

Snowy |
0.1+ | 0.2 | 0.7 |

The transition matrix represents the same information as in the dictionary, but in a more compact way. For this reason, the transition matrix is the standard way of representing Markov chains. Let's modify our `MarkovChain` class so that it can accept a transition matrix:

import numpy as np

class MarkovChain(object):

def __init__(self, transition_matrix, states):

"""

Initialize the MarkovChain instance.

Parameters

----------

transition_matrix: 2-D array

A 2-D array representing the probabilities of change of

state in the Markov Chain.

states: 1-D array

An array representing the states of the Markov Chain. It

needs to be in the same order as transition_matrix.

"""

self.transition_matrix = np.atleast_2d(transition_matrix)

self.states = states

self.index_dict = {self.states[index]: index for index in

range(len(self.states))}

self.state_dict = {index: self.states[index] for index in

range(len(self.states))}

def next_state(self, current_state):

"""

Returns the state of the random variable at the next time

instance.

Parameters

----------

current_state: str

The current state of the system.

"""

return np.random.choice(

self.states,

p=self.transition_matrix[self.index_dict[current_state], :])

def generate_states(self, current_state, no=10):

"""

Generates the next states of the system.

Parameters

----------

current_state: str

The state of the current random variable.

no: int

The number of future states to generate.

"""

future_states = []

for i in range(no):

next_state = self.next_state(current_state)

future_states.append(next_state)

current_state = next_state

return future_states

Running this code should also give similar results to what we got in the previous section. Using a transition matrix might not seem like a good idea because it requires us to create extra variables to store the indices. But, in cases when we have hundreds of states, using a transition matrix is much more efficient than using the simple dictionary implementation. In the case of a transition matrix, we can simply use NumPy indexing to get the probability values in the `next_state` method, whereas we were looping over all the state names in the previous implementation:

>>> transition_matrix = [[0.8, 0.19, 0.01],

[0.2, 0.7, 0.1],

[0.1, 0.2, 0.7]]

>>> weather_chain = MarkovChain(transition_matrix=transition_matrix,

states=['Sunny', 'Rainy', 'Snowy'])

>>> weather_chain.next_state(current_state='Sunny')

'Sunny'

>>> weather_chain.next_state(current_state='Snowy')

'Sunny'

>>> weather_chain.generate_states(current_state='Snowy', no=10)

['Snowy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',

'Rainy', 'Rainy', 'Sunny', 'Sunny']

# Properties of Markov chains

In this section, we will talk about the different properties of Markov chains, namely reducibility, periodicity, transience and recurrence, ergodicity, and steady-state analysis and limiting distributions. We will also try some simple examples of our `MarkovChain` class to show these properties.

# Reducibility

A Markov chain is said to be **irreducible** if we can reach any state of the given Markov chain from any other state. In terms of states, state *j* is said to be **accessible** from another state *i *if a system that started at state *i* has a non-zero probability of getting to the state *j*. In more formal terms, state *j* is said to be accessible from state *i* if an integer *n _{ij} ≥ 0* exists such that the following condition is met:

The *n _{ij}* here is basically the number of steps it takes to go from state

*i*to

*j*, and it can be different for different pairs of values for

*i*and

*j*. Also, for a given state

*i*, if all the values for

*n*, it means that all the states of the Markov chain are directly accessible from it. The accessibility relation is reflexive and transitive, but not necessary symmetric. We can take a simple example to understand this property:

_{ij}= 0In the previous example, it can be clearly seen that all of the states are accessible from all other states and hence are irreducible.

*Figure 1.2*and

*Figure 1.3*that we haven't represented edges if probability values are 0. This helps to keep the model less complicated and easier to read.

In the following example, we can see that state **D** is not accessible from **A**, **B**, or **C**. Also, state **C** is not accessible from either **A** or **B**. But all the states are accessible from state **D**, and states **A** and **B** are accessible from **C**:

We can also add a couple of methods to our `MarkovChain` class to check which states in our chain are reachable and whether our chain is irreducible:

from itertools import combinations

def is_accessible(self, i_state, f_state):

"""

Check if state f_state is accessible from i_state.

Parameters

----------

i_state: str

The state from which the accessibility needs to be checked.

f_state: str

The state to which accessibility needs to be checked.

"""

reachable_states = [i_state]

for state in reachable_states:

if state == self.index_dict[f_state]:

return True

else:

reachable_states.append(np.nonzero(

self.transition_matrix[self.index_dict[i_state], :])[0])

return False

def is_irreducible(self):

"""

Check if the Markov Chain is irreducible.

"""

for (i, j) in combinations(self.states, self.states):

if not self.is_accessible(i, j):

return False

return True

Let's give our examples a try using the examples in *Figure 1.2* and *Figure 1.3*:

>>> transition_irreducible = [[0.5, 0.5, 0, 0],

[0.25, 0, 0.5, 0.25],

[0.25, 0.5, 0, 0.25],

[0, 0, 0.5, 0.5]]

>>> transition_reducible = [[0.5, 0.5, 0, 0],

[0, 1, 0, 0],

[0.25, 0.5, 0, 0],

[0, 0, 0.25, 0.75]]

>>> markov_irreducible = MarkovChain(transition_matrix=transition_irreducible,

states=['A', 'B', 'C', 'D'])

>>> markov_reducible = MarkovChain(transition_matrix=transition_reducible,

states=['A', 'B', 'C', 'D'])

>>> markov_irreducible.is_accessible(i_state='A', f_state='D')

True

>>> markov_irreducible.is_accessible(i_state='B', f_state='D')

True

>>> markov_irreducible.is_irreducible()

True

>>> markov_reducible.is_accessible(i_state='A', f_state='D')

False

>>> markov_reducible.is_accessible(i_state='D', f_state='A')

True

>>> markov_reducible.is_accessible(i_state='C', f_state='D')

False

>>> markov_reducible.is_irreducible()

False

# Periodicity

State *i* is said to have period *k* if any possible path to return to state *i* would be a multiple of *k* steps. Formally, it is defined like this:

Here, *gcd *means the **greatest common divisor** (**GCD**). Basically, *k* is the GCD of the length/number of steps of all possible paths from state *i* back to itself. If there are no possible paths from state *i *back to itself, then the period for it is not defined. We also need to note that *k* has nothing to do with the number of steps required to return to the starting state. For example, let's say that for any given state the number of steps required to return to it are *(4, 6, 8, 12, 16)*. In this case *k=2*, but the minimum number of steps required to return is *4,* and *2* doesn't even appear in the list of possible numbers of steps.

For any given state in the Markov chain, if *k=1*, the state is said to be **aperiodic**. A Markov chain is called aperiodic if all of its states are aperiodic. One major thing to note is that, in the case of an irreducible Markov chain, a single aperiodic state is enough to imply that all the states are aperiodic. Let's take a simple example and check the periodicity of different states:

In the previous example, we can easily see that for state **A** the possible paths to return are **A** -> **B** -> **C** -> **A** or **A** -> **B** -> **C** -> **D** -> **E** -> **C** -> **A**. For these two paths, the path lengths are 3 and 6, respectively, and hence state **A** has a period of 3. Similarly, **B**, **C**, **D**, and **E** also each has a period of 3 in the Markov chain, and hence the Markov chain is also periodic:

In this example, we added a couple of extra edges, due to which the possible path lengths for **A** are now *3, 5, 7, ...;* and for **B** are *2, 3, 4, 5, ...* And, since the GCD of these path lengths is 1, states **A** and **B** are both now aperiodic. Similarly, we can compute the period of other nodes, each of which is also 1, and hence the Markov chain is also aperiodic.

Let's now add a couple of new methods to our `MarkovChain` class to compute the period of different states and check whether our model is aperiodic:

def get_period(self, state):

"""

Returns the period of the state in the Markov Chain.

Parameters

----------

state: str

The state for which the period needs to be computed.

"""

return gcd([len(i) for i in all_possible_paths])

def is_aperiodic(self):

"""

Checks if the Markov Chain is aperiodic.

"""

periods = [self.get_period(state) for state in self.states]

for period in periods:

if period != 1:

return False

return True

Let's now try out our methods on our examples. In this example, we will randomly assign probability values to different transitions:

>>> transition_periodic = [[0, 1, 0, 0, 0],

[0, 0, 1, 0, 0],

[0.5, 0, 0, 0.5, 0],

[0, 0, 0, 0, 1],

[0, 0, 1, 0, 0]]

>>> transition_aperiodic = [[0, 1, 0, 0, 0],

[0, 0, 1, 0, 0],

[0.5, 0.25, 0, 0.25, 0],

[0, 0, 0, 0, 1],

[0, 0, 0.5, 0.5, 0]]

>>> markov_periodic = MarkovChain(transition_matrix=transition_periodic,

states=['A', 'B', 'C', 'D', 'E'])

>>> markov_aperiodic = MarkovChain(transition_matrix=transition_aperiodic,

states=['A', 'B', 'C', 'D', 'E'])

>>> markov_periodic.get_period('A')

3

>>> markov_periodic.get_period('C')

3

>>> markov_aperiodic.is_periodic()

False

>>> markov_aperiodic.get_period('A')

1

>>> markov_aperiodic.get_period('B')

1

>>> markov_aperiodic.is_periodic()

True

# Transience and recurrence

Given that we start at state *i*, it is called **transient **if there is a non-zero probability that we will never return to state *i*. To define this in more formal terms, let's consider a random variable *T _{i}* as the first return time to state

*i*:

Let's now define another term, , as the probability of the system returns to state *i* after *n* steps:

Now we can define that any given state *i* is transient if the following condition is met:

In the preceding equation, we are basically checking whether the total sum of probabilities of returning to state *i* in step sizes less than is less than *1*. If the total sum is less than *1*, it would mean that the probability of *T _{i}* to be is greater than

*0*which would mean that the state

*i*is transient. The given state

*i*is called

**recurrent**if it is not transient:

In the preceding example, we can see that states **A** and **B** are transient because **A** doesn't have any incoming edge. **B** does have an incoming edge, but it's incoming from another transient state and therefore it is also transient. Hence, once the system leaves state **A** or **B**, it won't be able to come back.

It is really simple to check whether a given state is transient or not. We can simply check whether there are any incoming edges from other states or not. If not, the state is transient. Let's write a simple method to check this for our `MarkovChain` class:

def is_transient(self, state):

"""

Checks if a state is transient or not.

Parameters

----------

state: str

The state for which the transient property needs to be checked.

"""

if all(self.transition_matrix[~self.index_dict[state], self.index_dict[state]] == 0):

return True

else:

return False

Now we can use this method in our example in *Figure 1.6* to check which nodes are transient:

>>> transient_matrix = [[0, 0.5, 0.5, 0],

[0, 0, 0.25, 0.75],

[0, 0, 0, 1],

[0, 0, 0.5, 0.5]]

>>> transient_markov = MarkovChain(transition_matrix=transient_matrix,

states=['A', 'B', 'C', 'D']) >>> transient_markov.is_transient('A')

True

>>> transient_markov.is_transient('B')

True

>>> transient_markov.is_transient('C')

False

In the following subsections, we will talk about the statistical properties of the random variable *T _{i}*.

# Mean recurrence time

The first-return time for the initial state *i* is also known as the **hitting time**. It was represented using the random variable *T _{i}* in the previous section. The

**mean recurrence time**of state

*i*is defined as its expected return time:

If the mean recurrence time, *M _{i}*, is finite, the state is called

**positive recurrent**, otherwise it is called

**null recurrent**.

# Expected number of visits

As is evident from the name, the **expected number of visits **for any state *i* is the number of times the system is expected to be in that state. Also, a given state *i* is recurrent if and only if the expected number of visits to *i* is infinite:

# Absorbing states

State *i* is said to be an **absorbing state **if it is impossible for a system to leave that state once it reaches it. For a state to be an absorbing state, the probability of staying in the same state should be *1*, and all the other probabilities should be *0*:

In a Markov chain, if all the states are absorbing, then we call it an absorbing Markov chain:

*1*Again, we can add a very simple method to check for absorbing states in our `MarkovChain` class:

def is_absorbing(self, state):

"""

Checks if the given state is absorbing.

Parameters

----------

state: str

The state for which we need to check whether it's absorbing

or not.

"""

state_index = self.index_dict[state]

if self.transition_matrix[state_index, state_index]

We can again check whether our state in the example is absorbing by creating a Markov chain and using the `is_absorbing` method:

>>> absorbing_matrix = [[0, 1, 0],

[0.5, 0, 0.5],

[0, 0, 1]]

>>> absorbing_chain = MarkovChain(transition_matrix=absorbing_matrix,

states=['A', 'B', 'C'])

>>> absorbing_chain.is_absorbing('A')

False

>>> absorbing_chain.is_absorbing('C')

True

# Ergodicity

State *i* is said to be ergodic** **if it is recurrent, has a period of *1*, and has a finite mean recurrence time. If all the states of a Markov chain are ergodic, then it's an ergodic Markov chain. In general terms, a Markov chain is ergodic if there is a number *N*, such that any state in the system can be reached from any other state in any number of steps greater than or equal to the number *N*. Therefore, in the case of a fully connected transition matrix, where all transitions have a non-zero probability, this condition is fulfilled with *N=1*.

# Steady-state analysis and limiting distributions

In a Markov chain, vector *π* is called the **stationary distribution** if *∀ j ∈ s* satisfies the following conditions:

The stationary distribution is one of the most important properties of Markov chains, and we will talk about it in much more detail in later sections of this chapter.

# Continuous-time Markov chains

Continuous-time Markov chains are quite similar to discrete-time Markov chains except for the fact that in the continuous case we explicitly model the transition time between the states using a positive-value random variable. Also, we consider the system at all possible values of time instead of just the transition times.

# Exponential distributions

The random variable *x* is said to have an exponential distribution with a rate of distribution of *λ* if its probability density function is defined as follows:

Here, the rate of distribution *λ* needs to be greater than *0*. We can also compute the expectation of *X* as this:

We see that the expectation of *X* is inversely proportional to the rate of learning. This means that an exponential distribution with a higher rate of learning would have a lower expectation. The exponential distribution is often used to model problems that involve modelling time until some event happens. A simple example could be modelling the time before an alarm clock goes off, or the time before a server comes to your table in a restaurant. And, as we know , the higher the learning rate, the sooner we would expect the event to happen, and hence the name *learning rate*.

We can also compute the second moment and the variance of the exponential distribution:

And, using the first moment and the second moment, we can compute the variance of the distribution:

Now we will move on to some of the properties of the exponential distribution that are relevant to our example:

**Memoryless**:*Figure 1.x*shows a plot of an exponential distribution. In the diagram, we can clearly see that the graph after any given point (*a*in this case) is an exact copy of the original distribution. We can also say that an exponential distribution that is conditioned on (*X > a*) still stays exponential. If we think about this property in terms of our examples, it means that if we had an alarm clock, and at any time,*t*, we check that it still hasn't gone off, we can still determine the distribution over the time ahead of*t*, which will be the same exponential distribution. This property of the exponential distribution is known as being**memoryless**,**Probability of minimum value**:*n*independent exponential distributions over the random variables*X*, . . .,_{0}*X*with learning rates_{n}*λ*, respectively. For these distributions, we can prove that the distribution of_{0}, ..., λ_{n}*min(X*is also an exponential distribution with learning rate :_{0}, . . ., X_{n})

We will use both of these properties of the exponential distribution in our example for the continuous time Markov chain in a later section.

# Poisson process

The Poisson process is a continuous process, and there can be multiple interpretations of it, which lead to different possible definitions. In this section, we will start with the formal definition and build up to a more simple, intuitive definition. A continuous-time stochastic process *N(t):t > 0* is a **Poisson process** with a rate *λ > 0* if the following conditions are met:

*N(0) = 0*- It has
**stationary**and**independent increments** - The distribution of
*N(t)*is Poisson with mean*λt*:

First of all, we need to define what the stationary and independent increments are. For a continuous-time stochastic process, *X(t): ≥ 0*, an increment is defined as the difference in state of the system between two time instances; that is, given two time instances *s* and *t* with *s < t*, the increment from time *s* to time *t* is *X(t) - X(s)*. As the name suggests, a process is said to have a stationary increment if its distribution for the increment depends only on the time difference.

In other words, a process is said to have a stationary increment if the distribution of *X(t _{1}) - X(s_{1})* is equal to

*X(t*if

_{2}) - X(s_{2})*t*and

_{1}> s_{1},t_{2}> s_{2}*t*. A process is said to have an independent increment if any two increments in disjointed time intervals are independent; that is, if

_{1}- s_{1}= t_{2}-s_{2}*t*, then the increments

_{1}> s_{1}> t_{2}> s_{2}*X(t*and

_{2}) - X(s_{2})*X(t1) - X(s1)*are independent.

Now let's come back to defining the Poisson process. The Poisson process is essentially a counting process that counts the number of events that have occurred before time *t*. This count of the number of events before time *t* is given by *N(t),* and, similarly, the number of events occurring between time intervals *t* and *t + s* is given by *N(t + s) - N(t)*. The value *N(t + s) - N(t)* is Poisson-distributed with a mean *λ _{s}*. We can see that the Poisson process has stationary increments in fixed time intervals, but as , the value of

*N(t)*will also approach infinity; that is, . Another thing worth noting is that, as the value of

*λ*increases, the number of events happening will also increase, and that is why

*λ*is also known as the

**rate of the process**.

This brings us to our second simplified definition of the Poisson process. A continuous-time stochastic process *N(t): t ≥ 0* is called a Poisson process with the rate of learning *λ > 0* if the following conditions are met:

*N(0) = 0*- It is a counting process; that is,
*N(T)*gives the count of the number of events that have occurred before time*t* - The times between the events are distributed independently and identically, with an exponential distribution having a learning rate of
*λ*

# Continuous-time Markov chain example

Now, since we have a basic understanding of exponential distributions and the Poisson process, we can move on to the example to build up a continuous-time Markov chain. In this example, we will try to show how the properties of exponential distributions can be used to build up generic continuous-time Markov chains. Let's consider a hotel reception where *n* receptionists are working in parallel. Also consider that the guests arrive according to a Poisson process, with rate *λ*, and the service time for each guest is represented using an exponential random variable with learning rate *µ*. Also, if all the receptionists are busy when a new guest arrives, he/she will depart without getting any service. Now let's consider that a new guest arrives and finds all the receptionists are busy, and let's try to compute the expected number of busy receptionists in the next time interval.

Let's start by assuming that *T _{k}* represents the number of

*k*busy receptionists in the next time instance. We can also use

*T*to represent the expected number of busy receptionists found by the next arriving guest if

_{k}*k*receptionists are busy at the current time instance. These two representations of

*T*are equivalent because of the memoryless property of exponential distributions.

_{k}Firstly, *T _{0}* is clearly

*0*, because if there are currently

*0*busy receptionists, the next arrival will also find

*0*busy receptionists for sure. Now, considering

*T*, if there are currently

_{1}*i*busy receptionists, the next arriving guest will find

*1*busy receptionist if the time to the next arrival is less than the remaining service time for the busy receptionist. From the memoryless property, we know that the next arrival time is exponentially distributed with a learning rate of

*λ*, and the remaining service time is also exponentially distributed with a learning rate of

*µ*. Therefore, the probability that the next guest will find one receptionist busy is and hence the following is true:

In general, we consider the situation that *k* receptionists are busy. We can obtain an expression for *T _{k}* by conditioning on what happens first. When we have

*k*receptionists busy, we can think of basically

*k+1*independent exponential distributions:

*k*exponentials with a learning rate of

*µ*for the remaining service time for each receptionist, and

*1*exponential distribution with a learning rate of

*λ*for the next arriving guest. In our case, we want to condition on whether a service completion happens first or a new guest arrives first. The time for a service completion will be the minimum of the

*k*exponentials. This first completion time is also exponentially distributed with a learning rate of

*kµ*. Now, the probability of having a service completion before the next guest arrives is . Similarly, the probability of the next thing happening being a guest arrival is .

Now, based on this, we can say that if the next event is service completion, then the expected number of busy receptionists will be *T _{k-1}*. Otherwise, if a guest arrives first, there will be

*k*busy receptionists. Therefore we have the following:

We need to just solve this recursion now. *T _{2}* will be given by this equation:

If we continue this same pattern, we will get *T _{3}* as follows:

We can see a pattern in the values of *T _{1}* and

*T*, and therefore we can write a general term for it as follows:

_{2}Let's point out our observations on the previous example:

- At any given time instance, if there are
*i*busy receptionists, for*i < n*there are*i + 1*independent exponential distributions, with*i*of them having rate*µ*, and*1*of them having rate*λ*. The time until the process makes a jump is exponential, and its rate is given by*iµ + λ*. If all the receptionists are busy, then only the*n*exponential distributions corresponding to the service time can trigger a jump, and the time until the process makes a jump is exponential with rate*nµ*. - When the process jumps from state
*i*, for*i < n*, it jumps to state*i + 1*with probability , and jumps to state*i - 1*with probability of . - When the process makes a jump from state
*i*, we can start up a whole new set of distributions corresponding to the state we jumped to. This is because, even though some of the old exponential distributions haven't triggered, it's equivalent to resetting or replacing those distributions.

*i*, regardless of when the time is, the distribution of how long we stay in state

*i*and the probabilities of where we jump to next when we leave state

*i*are the same. In other words, the process is time-homogeneous.

The preceding description of a continuous-time stochastic process corresponds to a continuous-time Markov chain. In the next section, we will try to define it in a more formal way.

# Continuous-time Markov chain

In the previous section, we showed an example of a continuous-time Markov chain to give an indication of how it works. Let's now move on to formally define it. In a continuous-time Markov chain with a discrete state space *S,* for each state *i ∈ S* we have an associated set of *n _{i}* independent exponential distributions with rates

*q*, where

_{i}, j_{1}, ..., q_{i},j_{ni}*j*is the set of possible states the process may jump to when it leaves state

_{1}, ..., j_{ni}*i*. And, when the process enters state

*i*, the amount of time it spends in state

*i*is exponentially distributed with rate

*v*, and when it leaves state

_{i}= q_{i}j_{1}+...+q_{i}j_{ni}*i*it will go to state

*j*with probability for

_{l}*l = 1, ...,n*.

_{i}We can also extend the Markov property from the discrete-time case to continuous time.

For a continuous-time stochastic process *(X(t) : t ≥ 0)* with state space *S*, we say it has the Markov property if the following condition is met:

Here, *0 ≤ t _{1} ≤ t2 ≤. . . .t_{n-1 }≤ s ≤ t* is any non-decreasing sequence of

*n + 1*times, and

*i*are any

_{1, i}2, . . ., i_{n-1}, i, j_{ }∈ S*n + 1*states in the state space, for any integer

*n ≥ 1*.

Similarly, we can extend time-homogeneity to the case of continuous-time Markov chains. We say that a continuous-time Markov chain is time homogenous if, for any *s ≤ t*, and any states *i*, *j ∈ S*, the following condition is met:

As in the case of discrete-time Markov chains, a continuous-time Markov chain does not need to be time-homogeneous, but non-homogeneous Markov chains are out of scope for this book. For more details on non-homogeneous Markov chains, you can refer to Cheng-Chi Huang's thesis on the topic: https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=8613&context=rtd.

Now let's define the transition probability for a continuous-time Markov chain. Just as the rates *q _{ij}* in a continuous-time Markov chain are the counterpart of the transition probabilities

*p*in a discrete-time Markov chain, there is a counterpart to the n-step transition probabilities

_{ij}*p*for a time-homogeneous, continuous-time Markov chain, which is defined as follows:

_{ij}(t)# Summary

In this chapter, we gave a detailed introduction to Markov chains. We talked about different types of Markov chains, mainly chains with a discrete state space, with either discrete time or continuous time. We also introduced the concepts of time-homogeneous and non-time-homogeneous Markov chains. We discussed the different properties of Markov chains in detail, and provided relevant examples and code.

Markov chains and their properties are the basic concepts on which HMMs are built. In the next chapter, we will discuss HMMs in much more detail.