Reader small image

You're reading from  Learning Bayesian Models with R

Product typeBook
Published inOct 2015
Reading LevelBeginner
PublisherPackt
ISBN-139781783987603
Edition1st Edition
Languages
Right arrow
Author (1)
Hari Manassery Koduvely
Hari Manassery Koduvely
author image
Hari Manassery Koduvely

Dr. Hari M. Koduvely is an experienced data scientist working at the Samsung R&D Institute in Bangalore, India. He has a PhD in statistical physics from the Tata Institute of Fundamental Research, Mumbai, India, and post-doctoral experience from the Weizmann Institute, Israel, and Georgia Tech, USA. Prior to joining Samsung, the author has worked for Amazon and Infosys Technologies, developing machine learning-based applications for their products and platforms. He also has several publications on Bayesian inference and its applications in areas such as recommendation systems and predictive health monitoring. His current interest is in developing large-scale machine learning methods, particularly for natural language understanding.
Read more about Hari Manassery Koduvely

Right arrow

Chapter 2. The R Environment

R is currently one of the most popular programming environments for statistical computing. It was evolved as an open source language from the S programming language developed at Bell Labs. The main creators of R are two academicians, Robert Gentleman and Ross Ihaka, from the University of Auckland in New Zealand.

The main reasons for the popularity of R, apart from free software under GNU General Public License, are the following:

  • R is very easy to use. It is an interpreted language and at the same time can be used for procedural programming.

  • R supports both functional and object-oriented paradigms. It has very strong graphical and data visualization capabilities.

  • Through its LaTex-like documentation support, R can be used for making high-quality documentation.

  • Being an open source software, R has a large number of contributed packages that makes almost all statistical modeling possible in this environment.

This chapter is intended to give a basic introduction to R...

Setting up the R environment and packages


R is a free software under GNU open source license. R comes with a basic package and also has a large number of user-contributed packages for advanced analysis and modeling. It also has a nice graphics user interface-based editor called RStudio. In this section, we will learn how to download R, set up the R environment in your computer, and write a simple R program.

Installing R and RStudio

The Comprehensive R Archive Network (CRAN) hosts all releases of R and the contributed packages. R for Windows can be installed by downloading the binary of the base package from http://cran.r-project.org; a standard installation should be sufficient. For Linux and Mac OS X, the webpage gives instructions on how to download and install the software. At the time of writing this book, the latest release was version 3.1.2. Various packages need to be installed separately from the package page. One can install any package from the R command prompt using the following...

Managing data in R


Before we start any serious programming in R, we need to learn how to import data into an R environment and which data types R supports. Often, for a particular analysis, we will not use the entire dataset. Therefore, we need to also learn how to select a subset of the data for any analysis. This section will cover these aspects.

Data Types in R

R has five basic data types as follows:

  • Integer

  • Numeric (real)

  • Complex

  • Character

  • Logical (True/False)

The default representation of numbers in R is double precision real number (numeric). If you want an integer representation explicitly, you need to add the suffix L. For example, simply entering 1 on the command prompt will store 1 as a numeric object. To store 1 as an integer, you need to enter 1L. The command class(x) will give the class (type) of the object x. Therefore, entering class(1) on command prompt will give the answer numeric whereas entering class(1L) will give the answer integer.

R also has a special number Inf that represents...

Writing R programs


Although much data analysis in R can be carried out in an interactive manner using command prompt, often for more complex tasks, one needs to write R scripts. As mentioned in the introduction, R has both the perspective of a functional and object-oriented programming language. In this section, some of the standard syntaxes of the programming in R are described.

Control structures

Control structures are meant for controlling the flow of execution of a program. The standard control structures are as follows:

  • if and else: To test a condition

  • for: To loop over a set of statements for a fixed number of times

  • while: To loop over a set of statements while a condition is true

  • repeat: To execute an infinite loop

  • break: To break the execution of a loop

  • next: To skip an iteration of a loop

  • return: To exit a function

Functions

If one wants to use R for more serious programming, it is essential to know how to write functions. They make the language more powerful and elegant. R has many...

Data visualization


One of the powerful features of R is its functions for generating high-quality plots and visualize data. The graphics functions in R can be divided into three groups:

  • High-level plotting functions to create new plots, add axes, labels, and titles.

  • Low-level plotting functions to add more information to an existing plot. This includes adding extra points, lines, and labels.

  • Interactive graphics functions to interactively add information to, or extract information from, an existing plot.

The R base package itself contains several graphics functions. For more advanced graph applications, one can use packages such as ggplot2, grid, or lattice. In particular, ggplot2 is very useful for generating visually appealing, multilayered graphs. It is based on the concept of grammar of graphics. Due to lack of space, we are not covering these packages in this book. Interested readers should consult the book by Hadley Wickham (reference 4 in the References section of this chapter).

High...

Sampling


Often, we would be interested in creating a representative dataset, for some analysis or design of experiments, by sampling from a population. This is particularly the case for Bayesian inference, as we will see in the later chapters, where samples are drawn from posterior distribution for inference. Therefore, it would be useful to learn how to sample N points from some well-known distributions in this chapter.

Before we use any particular sampling methods, readers should note that R, like any other computer program, uses pseudo random number generators for sampling. It is useful to supply a starting seed number to get reproducible results. This can be done using the set.seed(n) command with an integer n as the seed.

Random uniform sampling from an interval

To generate n random numbers (numeric) that are uniformly distributed in the interval [a, b], one can use the runif() function:

>runif(5,1,10)  #generates 5 random numbers between 1 and 10
[1]  7.416    9.846    3.093   2.656...

Exercises


For the following exercises in this chapter, we use the Auto MPG dataset from the UCI Machine Learning repository (references 5 and 6 in the References section of this chapter). The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets.html. The dataset contains the fuel consumption of cars in the US measured during 1970-1982. Along with consumption values, there are attribute variables, such as the number of cylinders, displacement, horse power, weight, acceleration, year, origin, and the name of the car:

  • Load the dataset into R using the read.table() function.

  • Produce a box plot of mpg values for each car name.

  • Write a function that will compute the scaled value (subtract the mean and divide by standard deviation) of a column whose name is given as an argument of the function.

  • Use the lapply() function to compute scaled values for all variables.

  • Produce a scatter plot of mgp versus acceleration for each car name using coplot(). Use legends to annotate the graph.

References


  1. Matloff N. The Art of R Programming – A Tour of Statistical Software Design. No Starch Press. 2011. ISBN-10: 1593273843

  2. Teetor P. R Cookbook. O'Reilly Media. 2011. ISBN-10: 0596809158

  3. Wickham H. Advanced R. Chapman & Hall/CRC The R Series. 2015. ISBN-10: 1466586966

  4. Wickham H. ggplot2: Elegant Graphics for Data Analysis (Use R!). Springer. 2010. ISBN-10: 0387981403

  5. Auto MPG Data Set, UCI Machine Learning repository, https://archive.ics.uci.edu/ml/datasets/Auto+MPG

  6. Quinlan R. "Combining Instance-Based and Model-Based Learning". In: Tenth International Conference of Machine Learning. 236-243. University of Massachusetts, Amherst. Morgan Kaufmann. 1993

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Summary


In this chapter, you were introduced to the R environment. After reading through this chapter, you learned how to import data into R, make a selection of subsets of data for their analysis, and write simple R programs using functions and control structures. Also, you should now be familiar with the graphical capabilities of R and some advanced capabilities, such as loop functions. In the next chapter, we will begin the central theme of this book, Bayesian inference.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Bayesian Models with R
Published in: Oct 2015Publisher: PacktISBN-13: 9781783987603
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hari Manassery Koduvely

Dr. Hari M. Koduvely is an experienced data scientist working at the Samsung R&D Institute in Bangalore, India. He has a PhD in statistical physics from the Tata Institute of Fundamental Research, Mumbai, India, and post-doctoral experience from the Weizmann Institute, Israel, and Georgia Tech, USA. Prior to joining Samsung, the author has worked for Amazon and Infosys Technologies, developing machine learning-based applications for their products and platforms. He also has several publications on Bayesian inference and its applications in areas such as recommendation systems and predictive health monitoring. His current interest is in developing large-scale machine learning methods, particularly for natural language understanding.
Read more about Hari Manassery Koduvely