In this chapter, we will cover the following topics:
- Installing R with an IDE
- Installing a Jupyter Notebook application
- Starting with the basics of machine learning in R
- Setting up deep learning tools/packages in R
- Installing MXNet in R
- Installing TensorFlow in R
- Installing H2O in R
- Installing all three packages at once using Docker
This chapter will get you started with deep learning and help you set up your systems to develop deep learning models. The chapter is more focused on giving the audience a heads-up on what is expected from the book and the prerequisites required to go through the book. The current book is intended for students or professionals who want to quickly build a background in the applications of deep learning. The book will be more practical and application-focused using R as a tool to build deep learning models.
For a detailed theory on deep learning, refer to Deep Learning by Goodfellow et al. 2016. For a machine learning background refer Python Machine Learning by S. Raschka, 2015.
We will use the R programming language to demonstrate applications of deep learning. You are expected to have the following prerequisites throughout the book:
- Basic R programming knowledge
- Basic understanding of Linux; we will use the Ubuntu (16.04) operating system
- Basic understanding of machine learning concepts
- For Windows or macOS, a basic understanding of Docker
Before we begin, let's install an IDE for R. For R the most popular IDEs are Rstudio and Jupyter. Rstudio is dedicated to R whereas Jupyter provide multi-language support including R. Jupyter also provides an interactive environment and allow you to combine code, text, and graphics into a single notebook.
R supports multiple operating systems such as Windows, macOS X, and Linux. The installation files for R can be downloaded from any one of the mirror sites at Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/. The CRAN is also a major repository for packages in R. The programming language R is available under both 32-bit and 64-bit architectures.
- Of r-base-dev is also highly recommended as it has many inbuilt functions. It also enables the
install.packages()command, which is used to compile and install new R packages directly from the CRAN using the R console. The default R console looks as follows:
Default R console
- For programming purposes, an Integrated Development Environment (IDE) is recommended as it helps enhance productivity. One of the most popular open source IDEs for R is Rstudio. Rstudio also provides you with an Rstudio server, which facilitates a web-based environment to program in R. The interface for the Rstudio IDE is shown in the following screenshot:
Rstudio Integrated Development Environment for R
Another famous editor these days is the Jupyter Notebook app. This app produces notebook documents that integrate documentation, code, and analysis together. It supports many computational kernels including R. It is a server, client-side, web-based application that can be accessed using a browser.
Jupyter Notebook can be installed using the following steps:
- Jupyter Notebook can be installed using
pip3 install --upgrade pip pip3 install jupyter
- If you have installed Anaconda, then the default computational kernel installed is Python. To install an R computation kernel in Jupyter within the same environment, type the following command in a terminal:
conda install -c r r-essentials
- To install the R computational kernel in a new environment named
new-envwithin conda, type as follows:
conda create -n new-env -c r r-essentials
- Another way to include the R computational kernel in Jupyter Notebook uses the
IRkernelpackage. To install through this process, start the R IDE. The first step is to install dependencies required for the
chooseCRANmirror(ind=55) # choose mirror for installation install.packages(c('repr', 'IRdisplay', 'crayon', 'pbdZMQ', 'devtools'), dependencies=TRUE)
- Once all the dependencies are installed from CRAN, install the
IRkernalpackage from GitHub:
library(devtools) library(methods) options(repos=c(CRAN='https://cran.rstudio.com')) devtools::install_github('IRkernel/IRkernel')
- Once all the requirements are satisfied, the R computation kernel can be set up in Jupyter Notebook using the following script:
library(IRkernel) IRkernel::installspec(name = 'ir32', displayname = 'R 3.2')
- Jupyter Notebook can be started by opening a shell/terminal. Run the following command to start the Jupyter Notebook interface in the browser, as shown in the screenshot following this code:
Jupyter Notebook with the R computation engine
R, as with most of the packages utilized in this book, is supported by most operating systems. However, you can make use of Docker or VirtualBox to set up a working environment similar to the one used in this book.
For Docker installation and setup information, refer to https://docs.docker.com/ and select the Docker image appropriate to your operating system. Similarly, VirtualBox binaries can be downloaded and installed at https://www.virtualbox.org/wiki/Downloads.
Deep learning is a subcategory of machine learning inspired by the structure and functioning of a human brain. In recent times, deep learning has gained a lot of traction primarily because of higher computational power, bigger datasets, and better algorithms with (artificial) intelligent learning abilities and more inquisitiveness for data-driven insights. Before we get into the details of deep learning, let's understand some basic concepts of machine learning that form the basis for most analytical solutions.
Machine learning is an arena of developing algorithms with the ability to mine natural patterns from the data such that better decisions are made using predictive insights. These insights are pertinent across a horizon of real-world applications, from medical diagnosis (using computational biology) to real-time stock trading (using computational finance), from weather forecasting to natural language processing, from predictive maintenance (in automation and manufacturing) to prescriptive recommendations (in e-commerce and e-retail), and so on.
The following figure elucidates two primary techniques of machine learning; namely, supervised learning and unsupervised learning:
Classification of different techniques in machine learning
Supervised learning: Supervised learning is a form of evidence-based learning. The evidence is the known outcome for a given input and is in turn used to train the predictive model. Models are further classified into regression and classification, based on the outcome data type. In the former, the outcome is continuous, and in the latter the outcome is discrete. Stock trading and weather forecasting are some widely used applications of regression models, and span detection, speech recognition, and image classification are some widely used applications of classification models.
Some algorithms for regression are linear regression, Generalized Linear Models (GLM), Support Vector Regression (SVR), neural networks, decision trees, and so on; in classification, we have logistic regression, Support Vector Machines (SVM), Linear discriminant analysis (LDA), Naive Bayes, nearest neighbors, and so on.
Semi-supervised learning: Semi-supervised learning is a class of supervised learning using unsupervised techniques. This technique is very useful in scenarios where the cost of labeling an entire dataset is highly impractical against the cost of acquiring and analyzing unlabeled data.
Unsupervised learning: As the name suggests, learning from data with no outcome (or supervision) is called unsupervised learning. It is a form of inferential learning based on hidden patterns and intrinsic groups in the given data. Its applications include market pattern recognition, genetic clustering, and so on.
Some widely used clustering algorithms are k-means, hierarchical, k-medoids, Fuzzy C-means, hidden markov, neural networks, and many more.
Let's take a look at linear regression in supervised learning:
- Let's begin with a simple example of linear regression where we need to determine the relationship between men's height (in cms) and weight (in kgs). The following sample data represents the height and weight of 10 random men:
data <- data.frame("height" = c(131, 154, 120, 166, 108, 115, 158, 144, 131, 112), "weight" = c(54, 70, 47, 79, 36, 48, 65, 63, 54, 40))
- Now, generate a linear regression model as follows:
lm_model <- lm(weight ~ height, data)
- The following plot shows the relationship between men's height and weight along with the fitted line:
plot(data, col = "red", main = "Relationship between height and weight",cex = 1.7, pch = 1, xlab = "Height in cms", ylab = "Weight in kgs") abline(lm(weight ~ height, data))
Linear relationship between weight and height
- In semi-supervised models, the learning is primarily initiated using labeled data (a smaller quantity in general) and then augmented using unlabeled data (a larger quantity in general).
Let's perform K-means clustering (unsupervised learning) on a widely used dataset, iris.
- This dataset consists of three different species of iris (Setosa, Versicolor, and Virginica) along with their distinct features such as sepal length, sepal width, petal length, and petal width:
data(iris) head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
- The following plots show the variation of features across irises. Petal features show a distinct variation as against sepal features:
library(ggplot2) library(gridExtra) plot1 <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) geom_point(size = 2) ggtitle("Variation by Sepal features") plot2 <- ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) geom_point(size = 2) ggtitle("Variation by Petal features") grid.arrange(plot1, plot2, ncol=2)
Variation of sepal and petal features by length and width
- As petal features show a good variation across irises, let's perform K-means clustering using the petal length and petal width:
set.seed(1234567) iris.cluster <- kmeans(iris[, c("Petal.Length","Petal.Width")], 3, nstart = 10) iris.cluster$cluster <- as.factor(iris.cluster$cluster)
- The following code snippet shows a cross-tab between clusters and species (irises). We can see that cluster 1 is primarily attributed with setosa, cluster 2 with versicolor, and cluster 3 with virginica:
> table(cluster=iris.cluster$cluster,species= iris$Species) species cluster setosa versicolor virginica 1 50 0 0 2 0 48 4 3 0 2 46 ggplot(iris, aes(Petal.Length, Petal.Width, color = iris.cluster$cluster)) + geom_point() + ggtitle("Variation by Clusters")
- The following plot shows the distribution of clusters:
Variation of irises across three clusters
Model evaluation is a key step in any machine learning process. It is different for supervised and unsupervised models. In supervised models, predictions play a major role; whereas in unsupervised models, homogeneity within clusters and heterogeneity across clusters play a major role.
Some widely used model evaluation parameters for regression models (including cross validation) are as follows:
- Coefficient of determination
- Root mean squared error
- Mean absolute error
- Akaike or Bayesian information criterion
Some widely used model evaluation parameters for classification models (including cross validation) are as follows:
- Confusion matrix (accuracy, precision, recall, and F1-score)
- Gain or lift charts
- Area under ROC (receiver operating characteristic) curve
- Concordant and discordant ratio
Some of the widely used evaluation parameters of unsupervised models (clustering) are as follows:
- Contingency tables
- Sum of squared errors between clustering objects and cluster centers or centroids
- Silhouette value
- Rand index
- Matching index
- Pairwise and adjusted pairwise precision and recall (primarily used in NLP)
Bias and variance are two key error components of any supervised model; their trade-off plays a vital role in model tuning and selection. Bias is due to incorrect assumptions made by a predictive model while learning outcomes, whereas variance is due to model rigidity toward the training dataset. In other words, higher bias leads to underfitting and higher variance leads to overfitting of models.
In bias, the assumptions are on target functional forms. Hence, this is dominant in parametric models such as linear regression, logistic regression, and linear discriminant analysis as their outcomes are a functional form of input variables.
Variance, on the other hand, shows how susceptible models are to change in datasets. Generally, target functional forms control variance. Hence, this is dominant in non-parametric models such as decision trees, support vector machines, and K-nearest neighbors as their outcomes are not directly a functional form of input variables. In other words, the hyperparameters of non-parametric models can lead to overfitting of predictive models.
The major deep learning packages are developed in C/C++ for efficiency purposes and wrappers are developed in R to efficiently develop, extend, and execute deep learning models.
A lot of open source deep learning libraries are available. The prominent libraries in this area are as follows:
There are other prominent packages available on the market such as H2O, CNTK (Microsoft Cognitive Toolkit), darch, Mocha, and ConvNetJS. There are a lot of wrappers that are developed around these packages to support the easy development of deep learning models, such as Keras and Lasagne in Python and MXNet, both supporting multiple languages.
- This chapter will cover the MXNet and TensorFlow packages (developed in C++ and CUDA for a highly optimized performance in GPU).
- Additionally, the
h2opackage will be used to develop some deep learning models. The
h2opackage in R is implemented as a REST API, which connects to the H2O server (it runs as Java Virtual Machines (JVM)). We will provide quick setup instructions for these packages in the following sections
This section will cover the installation of MXNet in R.
The MXNet package is a lightweight deep learning architecture supporting multiple programming languages such as R, Python, and Julia. From a programming perspective, it is a combination of symbolic and imperative programming with support for CPU and GPU.
The CPU-based MXNet in R can be installed using the prebuilt binary package or the source code where the libraries need to be built. In Windows/mac, prebuilt binary packages can be download and installed directly from the R console. MXNet requires the R version to be 3.2.0 and higher. The installation requires the
drat package from CRAN. The
drat package helps maintain R repositories and can be installed using the
To install MXNet on Linux (13.10 or later), the following are some dependencies:
- Git (to get the code from GitHub)
- libatlas-base-dev (to perform linear algebraic operations)
- libopencv-dev (to perform computer vision operations)
To install MXNet with a GPU processor, the following are some dependencies:
- Microsoft Visual Studio 2013
- The NVIDIA CUDA Toolkit
- The MXNet package
- cuDNN (to provide a deep neural network library)
Another quick way to install
mxnet with all the dependencies is to use the prebuilt Docker image from the
chstone repository. The
mxnet-gpu Docker image will be installed using the following tools:
- MXNet for R and Python
- Ubuntu 16.04
- CUDA (Optional for GPU)
- cuDNN (Optional for GPU)
- The following R command installs MXNet using prebuilt binary packages, and is hassle-free. The
dratpackage is then used to add the
dlmcrepository from git followed by the
install.packages("drat", repos="https://cran.rstudio.com") drat:::addRepo("dmlc") install.packages("mxnet")
2. The following code helps install MXNet in Ubuntu (V16.04). The first two lines are used to install dependencies and the remaining lines are used to install MXNet, subject to the satisfaction of all the dependencies:
sudo apt-get update sudo apt-get install -y build-essential git libatlas-base-dev libopencv-dev git clone https://github.com/dmlc/mxnet.git ~/mxnet --recursive cd ~/mxnet cp make/config.mk . echo "USE_BLAS=openblas" >>config.mk make -j$(nproc)
3. If MXNet is to be built for GPU, the following
config needs to be updated before the
echo "USE_CUDA=1" >>config.mk echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk echo "USE_CUDNN=1" >>config.mk
A detailed installation of MXNet for other operating systems can be found at http://mxnet.io/get_started/setup.html.
4. The following command is used to install MXNet (GPU-based) using Docker with all the dependencies:
docker pull chstone/mxnet-gpu
This section will cover another very popular open source machine learning package, TensorFlow, which is very effective in building deep learning models.
TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The
tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations.
tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the
tensorflow package in both R and Python to make R work. The following are the dependencies for
- Python 2.7 / 3.x
- R (>3.2)
- devtools package in R for installing TensorFlow from GitHub
- TensorFlow in Python
- Once all the mentioned dependencies are installed,
tensorflowcan be installed from
devtoolsdirectly using the
install_githubcommand as follows:
- Before loading
tensorflowin R, you need to set up the path for Python as the system environment variable. This can be done directly from the R environment, as shown in the following command:
If the Python
tensorflow module is not installed, R will give the following error:
Error raised by R if tensorflow in Python is not installed
tensorflow in Python can be installed using
pip install tensorflow # Python 2.7 with no GPU support pip3 install tensorflow # Python 3.x with no GPU support pip install tensorflow-gpu # Python 2.7 with GPU support pip3 install tensorflow-gpu # Python 3.x with GPU support
TensorFlow follows directed graph philosophy to set up computational models where mathematical operations are represented as nodes with each node supporting multiple input and output while the edges represent the communication of data between nodes. There are also edges known as control dependencies in TensorFlow that do not represent the data flow; rather the provide information related to control dependence such as node for the control dependence must finish processing before the destination node of control dependence starts executing.
An example TensorFlow graph for logistic regression scoring is shown in the following diagram:
TensorFlow graph for logistic regression
The preceding figure illustrates the TensorFlow graph to score logistic regression with optimized weights:
The MatMul node performs matrix multiplication between input feature matrix X and optimized weight β. The constant C is then added to the output from the MatMul node. The output from Add is then transformed using the Sigmoid function to output Pr(y=1|X).
Get started with TensorFlow in R using resources at https://rstudio.github.io/tensorflow/.
H2O is another very popular open source library to build machine learning models. It is produced by H2O.ai and supports multiple languages including R and Python. The H2O package is a multipurpose machine learning library developed for a distributed environment to run algorithms on big data.
To set up H2O, the following systems are required:
- 64-bit Java Runtime Environment (version 1.6 or later)
- Minimum 2 GB RAM
H2O from R can be called using the
h2o package. The
h2o package has the following dependencies:
For machines that do not have curl-config installed, the RCurl dependency installation will fail in R and curl-config needs to be installed outside R.
- H2O can be installed directly from CRAN with the dependency parameter TRUE to install all CRAN-related
h2odependencies. This command will install all the R dependencies required for the
install.packages("h2o", dependencies = T)
- The following command is used to call the
h2opackage in the current R environment. The first-time execution of the
h2opackage will automatically download the JAR file before launching H2O, as shown in the following figure:
library(h2o) localH2O = h2o.init()
Starting H2O cluster
- The H2O cluster can be accessed using clusterip and port information. The current H2O cluster is running on localhost at port
54321, as shown in the following screenshot:
H2O cluster running in the browser
Let's build a logistic regression interactively using the H2O browser.
- Start a new flow, as shown in the following screenshot:
Creating a new flow in H2O
- Import a dataset using the Data menu, as shown in the following screenshot:
Importing files to the H2O environment
- The imported file in H2O can be parsed into the hex format (the native file format for H2O) using the Parse these files action, which will appear once the file is imported to the H2O environment:
Parsing the file to the hex format
- The parsed data frame in H2O can be split into training and validation using the
Split Frameaction, as shown in the following screenshot:
Splitting the dataset into training and validation
- Select the model from the
Model menu and set up the model-related parameters. An example for a glm model is seen in the following screenshot:
Building a model in H2O
predictaction can be used to score another hex data frame in H2O:
Scoring in H2O
For more complicated scenarios that involve a lot of preprocessing, H2O can be called from R directly. This book will focus more on building models using H2O from R directly. If H2O is set up at a different location instead of localhost, then it can be connected within R by defining the correct
port at which the cluster is running:
localH2O = h2o.init(ip = "localhost", port = 54321, nthreads = -1)
Another critical parameter is the number of threads to be used to build the model; by default, n threads are set to -2, which means that two cores will be used. The value of -1 for n threads will make use of all available cores.
http://docs.h2o.ai/h2o/latest-stable/index.html#gettingstarted is very good using H2O in interactive mode.
Docker is a software-contained platform that is used to host multiple software or apps side by side in isolated containers to get better computing density. Unlike virtual machines, containers are built only using libraries and the settings required by any software but do not bundle the entire operating system, thus making it lightweight and efficient.
Setting up all three packages could be quite cumbersome depending on the operating system utilized. The following dockerfile code can be used to set up an environment with
mxnet with GPU, and
h2o installed with all the dependencies:
FROM chstone/mxnet-gpu:latest MAINTAINER PKS Prakash <prakash5801> # Install dependencies RUN apt-get update && apt-get install -y python2.7 python-pip python-dev ipython ipython-notebook python-pip default-jre # Install pip and Jupyter notebook RUN pip install --upgrade pip && pip install jupyter # Add R to Jupyter kernel RUN Rscript -e "install.packages(c('repr', 'IRdisplay', 'crayon', 'pbdZMQ'), dependencies=TRUE, repos='https://cran.rstudio.com')" && Rscript -e "library(devtools); library(methods); options(repos=c(CRAN='https://cran.rstudio.com')); devtools::install_github('IRkernel/IRkernel')" && Rscript -e "library(IRkernel); IRkernel::installspec(name = 'ir32', displayname = 'R 3.2')" # Install H2O RUN Rscript -e "install.packages('h2o', dependencies=TRUE, repos='http://cran.rstudio.com')" # Install tensorflow fixing the proxy port RUN pip install tensorflow-gpu RUN Rscript -e "library(devtools); devtools::install_github('rstudio/tensorflow')"
The current image is created on top of the
chstone/mxnet-gpu Docker image.
The chstone/mxnet-gpu is a docker hub repository at https://hub.docker.com/r/chstone/mxnet-gpu/.
Docker will all dependencies can be installed using following steps:
- Save the preceding code to a location with a name, say,
- Using the command line, go to the file location and use the following command and it is also shown in the screenshot after the command:
docker run -t "TagName:FILENAME"
Building the docker image
- Access the image using the
docker imagescommand as follows:
View docker images
- Docker images can be executed using the following command:
docker run -it -p 8888:8888 -p 54321:54321 <<IMAGE ID>>
Running a Docker image
Here, the option -i is for interactive mode and -t is to allocate --tty. The option -p is used to forward the port. As we will be running Jupyter on port
8888 and H2O on
54321, we have forwarded both ports to accessible from the local browser.