As the title suggests, in this article, we will be taking a look at some of the deep learning models in R. Some of the pioneering advancements in neural networks research in the last decade have opened up a new frontier in machine learning that is generally called by the name **deep learning**. The general definition of deep learning is, *a class of machine learning techniques, where many layers of information processing stages in hierarchical supervised architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones*. Although there are many similar definitions and architectures for deep learning, two common elements in all of them are: *multiple layers of nonlinear information processing* and *supervised or unsupervised learning of feature representations at each layer from the features learned at the previous layer*. The initial works on deep learning were based on multilayer neural network models. Recently, many other forms of models are also used such as deep kernel machines and deep Q-networks.

Researchers have experimented with multilayer neural networks even in previous decades. However, two reasons were limiting any progress with learning using such architectures. The first reason is that the learning of parameters of the network is a nonconvex optimization problem and often one gets stuck at poor local minima's starting from random initial conditions. The second reason is that the associated computational requirements were huge. A breakthrough for the first problem came when Geoffrey Hinton developed a fast algorithm for learning a special class of neural networks called **deep belief nets** (**DBN**). We will describe DBNs in more detail in the later sections. The high computational power requirements were met with the advancement in computing using **general purpose graphical processing units** (**GPGPUs**). What made deep learning so popular for practical applications is the significant improvement in accuracy achieved in automatic speech recognition and computer vision. For example, the **word error rate** in automatic speech recognition of a switchboard conversational speech had reached a saturation of around 40% after years of research. However, using deep learning, the word error rate was reduced dramatically to close to 10% in a matter of a few years. Another well-known example is **how deep convolution neural network** achieved the least error rate of 15.3% in the 2012 ImageNet Large Scale Visual Recognition Challenge compared to state-of-the-art methods that gave 26.2% as the least error rate.

In this article, we will describe one class of deep learning models called deep belief networks. Interested readers are requested to read the book by Li Deng and Dong Yu for a detailed understanding of various methods and applications of deep learning. We will also illustrate the use of DBN with the R package **darch**.

## Restricted Boltzmann machines

A **restricted Boltzmann machine** (**RBM**) is a two-layer network (bi-partite graph), in which one layer is a visible layer (*v*) and the second layer is a hidden layer (*h*). All nodes in the visible layer and all nodes in the hidden layer are connected by undirected edges, and there no connections between nodes in the same layer:

An RBM is characterized by the joint distribution of states of all visible units v={V_{1},V_{2},...,V_{M}}and states of all hidden units h={h_{1},h_{2},...,h_{N}} given by:

Here, E(v,h|Ɵ) is called the **energy function ** Z=ƩvƩhexp(-E(v,h|Ɵ) and is the normalization constant known by the name **partition function** from Statistical Physics nomenclature.

There are mainly two types of RBMs. In the first one, both *v* and *h* are Bernoulli random variables. In the second type, *h* is a Bernoulli random variable whereas *v* is a Gaussian random variable. For Bernoulli RBM, the energy function is given by:

Here, W_{ij} represents the weight of the edge between nodes V_{i} and h_{j}; b_{i} and a_{j} are bias parameters for the visible and hidden layers, respectively. For this energy function, the exact expressions for the conditional probability can be derived as follows:

Here, is the logistic function 1/(1+exp(-x)).

If the input variables are continuous, one can use the Gaussian RBM; the energy function of it is given by:

Also, in this case, the conditional probabilities of v_{i} and h_{j} will become as follows:

This is a normal distribution with mean Ʃ^{M}_{I=1}W_{ij}h_{j}+b_{i} and variance 1.

Now that we have described the basic architecture of an RBM, how is it that it is trained? If we try to use the standard approach of taking the gradient of log-likelihood, we get the following update rule:

Here, IE_{data}(v_{i},h_{j}) is the expectation of v_{i},h_{j} computed using IE_{model}(v_{i},h_{j}) the dataset and is the same expectation computed using the model. However, one cannot use this exact expression for updating weights because IE_{model}(v_{i},h_{j}) is difficult to compute.

The first breakthrough came to solve this problem and, hence, to train deep neural networks, when Hinton and team proposed an algorithm called **Contrastive Divergence** (**CD**). The essence of the algorithm is described in the next paragraph.

The idea is to approximate IE_{model}(v_{i},h_{j}) by using values of v_{i} and h_{j} generated using Gibbs sampling from the conditional distributions mentioned previously. One scheme of doing this is as follows:

- Initialize V
^{t=0}from the dataset. - Find h
^{t=0}by sampling from the conditional distribution h^{t=0}~ p(h|v^{t=0}). - Find V
^{t=1 }by sampling from the conditional distribution v^{t=1}~ p(v|h^{t=0}). - Find h
^{t=1}by sampling from the conditional distribution h^{t=1}~ p(h|v^{t=1}).

Once we find values of V^{t=1 } and h^{t=1} , use (v_{i}^{t=1}h_{j}^{t=1}) which is the product of *i*th component of V^{t=1} and *j*th component of h^{t=1}, as an approximation for IE_{model}(v_{i},h_{j}). This is called **CD-1 algorithm**. One can generalize this to use the values from the *k*th step of Gibbs sampling and it is known as **CD-k algorithm**. One can easily see the connection between RBMs and Bayesian inference. Since the CD algorithm is like a posterior density estimate, one could say that RBMs are trained using a Bayesian inference approach.

Although the Contrastive Divergence algorithm looks simple, one needs to be very careful in training RBMs, otherwise the model can result in overfitting. Readers who are interested in using RBMs in practical applications should refer to the technical report where this is discussed in detail.

## Deep belief networks

One can stack several RBMs, one on top of each other, such that the values of hidden units in the layer n-1(h_{i,n-1}) would become values of visible units in the *n*th layer (v_{i,n}), and so on. The resulting network is called a deep belief network. It was one of the main architectures used in early deep learning networks for pretraining. The idea of pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN, one can start with random initial values for the weights and using the backpropagation algorithm can find a good minimum of the log-likelihood function. However, when the number of layers increases, the straightforward application of backpropagation does not work because starting from output layer, as we compute the gradient values for the layers deep inside, their magnitude becomes very small. This is called the **gradient vanishing** problem. As a result, the network will get trapped in some poor local minima. Backpropagation still works if we are starting from the neighborhood of a good minimum. To achieve this, a DNN is often pretrained in an unsupervised way using a DBN. Instead of starting from random values of weights, first train a DBN in an unsupervised way and use weights from the DBN as initial weights for a corresponding supervised DNN. It was seen that such DNNs pretrained using DBNs perform much better.

The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and train it using input data in the visible layer and the CD algorithm (or its latest better variants). Then, stack a second RBM on top of this. For this RBM, use values sample from as the values for the visible layer. Continue this process for the desired number of layers. The outputs of hidden units from the top layer can also be used as inputs for training a supervised model. For this, add a conventional NN layer at the top of DBN with the desired number of classes as the number of output nodes. Input for this NN would be the output from the top layer of DBN. This is called **DBN-DNN architecture**. Here, a DBN's role is generating highly efficient features (the output of the top layer of DBN) automatically from the input data for the supervised NN in the top layer. The architecture of a five-layer DBN-DNN for a binary classification task is shown in the following figure:

The last layer is trained using the backpropagation algorithm in a supervised manner for the two classes c_{1} and c_{2} . We will illustrate the training and classification with such a DBN-DNN using the darch R package.

## The darch R package

The darch package, written by Martin Drees, is one of the R packages by which one can begin doing deep learning in R. It implements the DBN described in the previous section. The package can be downloaded from https://cran.r-project.org/web/packages/darch/index.html.

The main class in the darch package implements deep architectures and provides the ability to train them with Contrastive Divergence and fine-tune with backpropagation, resilient backpropagation, and conjugate gradients. The new instances of the class are created with the *newDArch* constructor. It is called with the following arguments: a vector containing the number of nodes in each layers, the batch size, a Boolean variable to indicate whether to use the **ff** package for computing weights and outputs, and the name of the function for generating the weight matrices. Let us create a network having two input units, four hidden units, and one output unit:

```
install.packages("darch") #one time
>library(darch)
>darch ← newDArch(c(2,4,1),batchSize = 2,genWeightFunc
= generateWeights)
INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers.
INFO [2015-07-19 18:50:29] Generating RBMs.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible and 4 hidden units.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible and 1 hidden units.
```

Let us train the DBN with a toy dataset. We are using this because for training any realistic examples, it would take a long time, hours if not days. Let us create an input data set containing two columns and four rows:

```
>inputs ← matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE)
>outputs ← matrix(c(0,1,1,0),nrow=4)
```

Now, let us pretrain the DBN using the input data:

`>darch ← preTrainDArch(darch,inputs,maxEpoch=1000)`

We can have a look at the weights learned at any layer using the *getLayerWeights( )* function. Let us see how the hidden layer looks like:

```
>getLayerWeights(darch,index=1)
[[1]]
[,1] [,2] [,3] [,4]
[1,] 8.167022 0.4874743 -7.563470 -6.951426
[2,] 2.024671 -10.7012389 1.313231 1.070006
[3,] -5.391781 5.5878931 3.254914 3.000914
```

Now, let's do a backpropagation for supervised learning. For this, we need to first set the layer functions to *sigmoidUnitDerivatives*:

```
>layers ← getLayers(darch)
>for(i in length(layers):1){
layers[[i]][[2]] ← sigmoidUnitDerivative
}
>setLayers(darch) ← layers
>rm(layers)
```

Finally, the following two lines perform the backpropagation:

```
>setFineTuneFunction(darch) ← backpropagation
>darch ← fineTuneDArch(darch,inputs,outputs,maxEpoch=1000)
```

We can see the prediction quality of DBN on the training data itself by running *darch* as follows:

```
>darch ← getExecuteFunction(darch)(darch,inputs)
>outputs_darch ← getExecOutputs(darch)
>outputs_darch[[2]]
[,1]
[1,] 9.998474e-01
[2,] 4.921130e-05
[3,] 9.997649e-01
[4,] 3.796699e-05
```

Comparing with the actual output, DBN has predicted the wrong output for the first and second input rows. Since this example was just to illustrate how to use the darch package, we are not worried about the 50% accuracy here.

## Other deep learning packages in R

Although there are some other deep learning packages in R such as **deepnet** and **RcppDL**, compared with libraries in other languages such as **Cuda** (C++) and **Theano** (Python), R yet does not have good native libraries for deep learning. The only available package is a wrapper for the Java-based deep learning open source project H2O. This R package, **h20**, allows running H2O via its REST API from within R. Readers who are interested in serious deep learning projects and applications should use H2O using h2o packages in R. One needs to install H2O in your machine to use h2o.

# Summary

We have learned one of the latest advances in neural networks that is called deep learning. It can be used to solve many problems such as computer vision and natural language processing that involves highly cognitive elements. The artificial intelligent systems using deep learning were able to achieve accuracies comparable to human intelligence in tasks such as speech recognition and image classification.

To know more about Bayesian modeling in R, check out *Learning Bayesian Models with R* (https://www.packtpub.com/big-data-and-business-intelligence/learning-bayesian-models-r).

You can also check out our other R books, *Data Analysis with R* (https://www.packtpub.com/big-data-and-business-intelligence/data-analysis-r), and *Machine Learning with R - Second Edition* (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition).

## Resources for Article:

**Further resources on this subject:**

- Working with Data – Exploratory Data Analysis [article]
- Big Data Analytics [article]
- Deep learning in R [article]