Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Jupyter for Data Science

You're reading from  Jupyter for Data Science

Product type Book
Published in Oct 2017
Publisher Packt
ISBN-13 9781785880070
Pages 242 pages
Edition 1st Edition
Languages
Author (1):
Dan Toomey Dan Toomey
Profile icon Dan Toomey

Table of Contents (17) Chapters

Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Jupyter and Data Science Working with Analytical Data on Jupyter Data Visualization and Prediction Data Mining and SQL Queries R with Jupyter Data Wrangling Jupyter Dashboards Statistical Modeling Machine Learning Using Jupyter Optimizing Jupyter Notebooks

Chapter 9. Machine Learning Using Jupyter

In this chapter, we will use several algorithms for machine learning under Jupyter. We have coding in both R and Python to portray the breadth of options available to the Jupyter developer.

Naive Bayes


Naive Bayes is an algorithm that uses probability to classify the data according to Bayes theorem for strong independence of the features. Bayes theorem estimates the probability of an event based on prior conditions. So, overall, we use a set of feature values to estimate a value assuming the same conditions hold true when those features have similar values.

Naive Bayes using R

Our first implementation of naive Bayes uses the R programming language. The R implementation of the algorithm is encoded in the e1071 library. e1071 appears to have been the department identifier at the school where the package was developed.

We first install the package, and load the library:

#install.packages("e1071", repos="http://cran.r-project.org") 
library(e1071) 
library(caret) 
set.seed(7317) 
data(iris)

Some notes on these steps:

  • The install.packages call is commented out as we don't want to run this every time we run the script.
  • e1071 is the naive Bayes algorithm package.
  • The caret package contains...

Nearest neighbor estimator


Using nearest neighbor, we have an unclassified object and a set of objects that are classified. We then take the attributes of the unclassified object, compare against the known classifications in place, and select the class that is closest to our unknown. The comparison distances resolve to Euclidean geometry computing the distances between two points (where known attributes fall in comparison to the unknown's attributes).

Nearest neighbor using R

For this example, we are using the housing data from ics.edu. First, we load the data and assign column names:

housing <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data") 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", "B", "LSTAT", "MDEV") 
summary(housing)

We reorder the data so the key (the housing price MDEV) is in ascending order:

housing <- housing[order(housing$MDEV),] 

Now, we can split the data into a training...

Decision trees


In this section, we will use decision trees to predict values. A decision tree has a logical flow where the user makes decisions based on attributes following the tree down to a root level where a classification is then provided.

For this example, we are using automobile characteristics, such as vehicle weight, to determine whether the vehicle will produce good mileage. The information is extracted from the page at https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.DecisionTrees. I copied the data out to Excel and then wrote it as a CSV for use in this example.

Decision trees in R

We load the libraries to use rpart and caret. rpart has the decision tree modeling package. caret has the data partition function:

library(rpart) 
library(caret) 
set.seed(3277)

We load in our mpg dataset and split it into a training and testing set:

carmpg <- read.csv("car-mpg.csv") 
indices <- createDataPartition(carmpg$mpg, p=0.75, list=FALSE) 
training <- carmpg[indices,] 
testing...

Neural networks


We can model the housing data as a neural network where the different data elements are inputs into the system and the output of the network is the house price. With a neural net we end up with a graphical model that provides the factors to apply to each input in order to arrive at our housing price.

Neural networks in R

There is a neural network package available in R. We load that in:

#install.packages('neuralnet', repos="http://cran.r-project.org") 
library("neuralnet")

Load in the housing data:

filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" 
housing <- read.table(filename) 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",  
                       "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", 
                       "B", "LSTAT", "MDEV")

Split up the housing data into training and test sets (we have seen this coding in prior examples):

housing <- housing[order(housing$MDEV),] 
#install.packages("caret") 
library(caret...

Random forests


The random forests algorithm attempts a number of random decision trees and provides the tree that works best within the parameters used to drive the model.

Random forests in R

With R we include the packages we are going to use:

install.packages("randomForest", repos="http://cran.r-project.org") 
library(randomForest) 

Load the data:

filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" 
housing <- read.table(filename) 
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",  
                       "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", 
                       "B", "LSTAT", "MDEV") 

Split it up:

housing <- housing[order(housing$MDEV),] 
#install.packages("caret") 
library(caret) 
set.seed(5557) 
indices <- createDataPartition(housing$MDEV, p=0.75, list=FALSE) 
training <- housing[indices,] 
testing <- housing[-indices,] 
nrow(training) 
nrow(testing) 

Calculate our model:

forestFit <- randomForest(MDEV ~ CRIM ...

Summary


In this chapter, we used several machine learning algorithms, some of them in R and Python to compare and contrast. We used naive Bayes to determine how the data might be used. We applied nearest neighbor in a couple of different ways to see our results. We used decision trees to come up with an algorithm for predicting. We tried to use neural network to explain housing prices. Finally, we used the random forest algorithm to do the same—with the best results! In the next chapter, we will look at optimizing Jupyter notebooks.

 

 

 

 

 

lock icon The rest of the chapter is locked
You have been reading a chapter from
Jupyter for Data Science
Published in: Oct 2017 Publisher: Packt ISBN-13: 9781785880070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}