Naive Bayes is an algorithm that uses probability to classify the data according to Bayes theorem for strong independence of the features. Bayes theorem estimates the probability of an event based on prior conditions. So, overall, we use a set of feature values to estimate a value assuming the same conditions hold true when those features have similar values.
Our first implementation of naive Bayes uses the R programming language. The R implementation of the algorithm is encoded in the e1071
library. e1071
appears to have been the department identifier at the school where the package was developed.
We first install the package, and load the library:
#install.packages("e1071", repos="http://cran.r-project.org")
library(e1071)
library(caret)
set.seed(7317)
data(iris)
Some notes on these steps:
install.packages
call is commented out as we don't want to run this every time we run the script.e1071
is the naive Bayes algorithm package.caret
package contains...Using nearest neighbor, we have an unclassified object and a set of objects that are classified. We then take the attributes of the unclassified object, compare against the known classifications in place, and select the class that is closest to our unknown. The comparison distances resolve to Euclidean geometry computing the distances between two points (where known attributes fall in comparison to the unknown's attributes).
For this example, we are using the housing data from ics.edu
. First, we load the data and assign column names:
housing <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data")
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PRATIO", "B", "LSTAT", "MDEV")
summary(housing)
We reorder the data so the key (the housing price MDEV
) is in ascending order:
housing <- housing[order(housing$MDEV),]
Now, we can split the data into a training...
In this section, we will use decision trees to predict values. A decision tree has a logical flow where the user makes decisions based on attributes following the tree down to a root level where a classification is then provided.
For this example, we are using automobile characteristics, such as vehicle weight, to determine whether the vehicle will produce good mileage. The information is extracted from the page at https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.DecisionTrees. I copied the data out to Excel and then wrote it as a CSV for use in this example.
We load the libraries to use rpart
and caret
. rpart
has the decision tree modeling package. caret
has the data partition function:
library(rpart)
library(caret)
set.seed(3277)
We load in our mpg
dataset and split it into a training and testing set:
carmpg <- read.csv("car-mpg.csv")
indices <- createDataPartition(carmpg$mpg, p=0.75, list=FALSE)
training <- carmpg[indices,]
testing...
We can model the housing data as a neural network where the different data elements are inputs into the system and the output of the network is the house price. With a neural net we end up with a graphical model that provides the factors to apply to each input in order to arrive at our housing price.
There is a neural network package available in R. We load that in:
#install.packages('neuralnet', repos="http://cran.r-project.org")
library("neuralnet")
Load in the housing data:
filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
housing <- read.table(filename)
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",
"RM", "AGE", "DIS", "RAD", "TAX", "PRATIO",
"B", "LSTAT", "MDEV")
Split up the housing data into training and test sets (we have seen this coding in prior examples):
housing <- housing[order(housing$MDEV),]
#install.packages("caret")
library(caret...
The random forests algorithm attempts a number of random decision trees and provides the tree that works best within the parameters used to drive the model.
With R we include the packages we are going to use:
install.packages("randomForest", repos="http://cran.r-project.org")
library(randomForest)
Load the data:
filename = "http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
housing <- read.table(filename)
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",
"RM", "AGE", "DIS", "RAD", "TAX", "PRATIO",
"B", "LSTAT", "MDEV")
Split it up:
housing <- housing[order(housing$MDEV),]
#install.packages("caret")
library(caret)
set.seed(5557)
indices <- createDataPartition(housing$MDEV, p=0.75, list=FALSE)
training <- housing[indices,]
testing <- housing[-indices,]
nrow(training)
nrow(testing)
Calculate our model:
forestFit <- randomForest(MDEV ~ CRIM ...
In this chapter, we used several machine learning algorithms, some of them in R and Python to compare and contrast. We used naive Bayes to determine how the data might be used. We applied nearest neighbor in a couple of different ways to see our results. We used decision trees to come up with an algorithm for predicting. We tried to use neural network to explain housing prices. Finally, we used the random forest algorithm to do the same—with the best results! In the next chapter, we will look at optimizing Jupyter notebooks.