Packt+ | Advance your knowledge in tech

You're reading from Jupyter for Data Science

Product type Book

Published in Oct 2017

Publisher Packt

ISBN-13 9781785880070

Pages 242 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Dan Toomey

Table of Contents (17) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Jupyter and Data Science

Working with Analytical Data on Jupyter

Data Visualization and Prediction

Data Mining and SQL Queries

R with Jupyter

Data Wrangling

Jupyter Dashboards

Statistical Modeling

Machine Learning Using Jupyter

Optimizing Jupyter Notebooks

Chapter 9. Machine Learning Using Jupyter

In this chapter, we will use several algorithms for machine learning under Jupyter. We have coding in both R and Python to portray the breadth of options available to the Jupyter developer.

Naive Bayes

Naive Bayes is an algorithm that uses probability to classify the data according to Bayes theorem for strong independence of the features. Bayes theorem estimates the probability of an event based on prior conditions. So, overall, we use a set of feature values to estimate a value assuming the same conditions hold true when those features have similar values.

Naive Bayes using R

Our first implementation of naive Bayes uses the R programming language. The R implementation of the algorithm is encoded in the e1071 library. e1071 appears to have been the department identifier at the school where the package was developed.

We first install the package, and load the library:

#install.packages("e1071", repos="http://cran.r-project.org") 
library(e1071) 
library(caret) 
set.seed(7317) 
data(iris)

Some notes on these steps:

The install.packages call is commented out as we don't want to run this every time we run the script.
e1071 is the naive Bayes algorithm package.
The caret package contains...

Nearest neighbor estimator

Using nearest neighbor, we have an unclassified object and a set of objects that are classified. We then take the attributes of the unclassified object, compare against the known classifications in place, and select the class that is closest to our unknown. The comparison distances resolve to Euclidean geometry computing the distances between two points (where known attributes fall in comparison to the unknown's attributes).

Nearest neighbor using R

For this example, we are using the housing data from ics.edu. First, we load the data and assign column names:

housing <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data") 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", "B", "LSTAT", "MDEV") 
summary(housing)

We reorder the data so the key (the housing price MDEV) is in ascending order:

housing <- housing[order(housing$MDEV),]

Now, we can split the data into a training...

Decision trees

In this section, we will use decision trees to predict values. A decision tree has a logical flow where the user makes decisions based on attributes following the tree down to a root level where a classification is then provided.

For this example, we are using automobile characteristics, such as vehicle weight, to determine whether the vehicle will produce good mileage. The information is extracted from the page at https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.DecisionTrees. I copied the data out to Excel and then wrote it as a CSV for use in this example.

Decision trees in R

We load the libraries to use rpart and caret. rpart has the decision tree modeling package. caret has the data partition function:

library(rpart) 
library(caret) 
set.seed(3277)

We load in our mpg dataset and split it into a training and testing set:

carmpg <- read.csv("car-mpg.csv") 
indices <- createDataPartition(carmpg$mpg, p=0.75, list=FALSE) 
training <- carmpg[indices,] 
testing...

Neural networks

We can model the housing data as a neural network where the different data elements are inputs into the system and the output of the network is the house price. With a neural net we end up with a graphical model that provides the factors to apply to each input in order to arrive at our housing price.

Neural networks in R

There is a neural network package available in R. We load that in:

#install.packages('neuralnet', repos="http://cran.r-project.org") 
library("neuralnet")

Load in the housing data:

filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" 
housing <- read.table(filename) 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",  
                       "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", 
                       "B", "LSTAT", "MDEV")

Split up the housing data into training and test sets (we have seen this coding in prior examples):

housing <- housing[order(housing$MDEV),] 
#install.packages("caret") 
library(caret...

Random forests

The random forests algorithm attempts a number of random decision trees and provides the tree that works best within the parameters used to drive the model.

Random forests in R

With R we include the packages we are going to use:

install.packages("randomForest", repos="http://cran.r-project.org") 
library(randomForest)

Load the data:

filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" 
housing <- read.table(filename) 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",  
                       "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", 
                       "B", "LSTAT", "MDEV")

Split it up:

housing <- housing[order(housing$MDEV),] 
#install.packages("caret") 
library(caret) 
set.seed(5557) 
indices <- createDataPartition(housing$MDEV, p=0.75, list=FALSE) 
training <- housing[indices,] 
testing <- housing[-indices,] 
nrow(training) 
nrow(testing)

Calculate our model:

forestFit <- randomForest(MDEV ~ CRIM ...

Summary

In this chapter, we used several machine learning algorithms, some of them in R and Python to compare and contrast. We used naive Bayes to determine how the data might be used. We applied nearest neighbor in a couple of different ways to see our results. We used decision trees to come up with an algorithm for predicting. We tried to use neural network to explain housing prices. Finally, we used the random forest algorithm to do the same—with the best results! In the next chapter, we will look at optimizing Jupyter notebooks.