In recent times, machine learning (ML) and data science have gained popularity like never before. This field is expected to grow exponentially in the coming years. First of all, what is machine learning? And why does someone need to take pains to understand the principles? Well, we have the answers for you. One simple example could be book recommendations in e-commerce websites when someone went to search for a particular book or any other product recommendations which were bought together to provide an idea to users which they might like. Sounds magic, right? In fact, utilizing machine learning, can achieve much more than this.
Machine learning is a branch of study in which a model can learn automatically from the experiences based on data without exclusively being modeled like in statistical models. Over a period and with more data, model predictions will become better.
In this first chapter, we will introduce the basic concepts which are necessary to understand both the statistical and machine learning terminology necessary to create a foundation for understanding the similarity between both the streams, who are either full-time statisticians or software engineers who do the implementation of machine learning but would like to understand the statistical workings behind the ML methods. We will quickly cover the fundamentals necessary for understanding the building blocks of models.
In this chapter, we will cover the following:
Statistics is the branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of numerical data.
Statistics are mainly classified into two subbranches:
Statistical modeling is applying statistics on data to find underlying hidden relationships by analyzing the significance of the variables.
Machine learning is the branch of computer science that utilizes past experience to learn from and use its knowledge to make future decisions. Machine learning is at the intersection of computer science, engineering, and statistics. The goal of machine learning is to generalize a detectable pattern or to create an unknown rule from given examples. An overview of machine learning landscape is as follows:
Machine learning is broadly classified into three categories but nonetheless, based on the situation, these categories can be combined to achieve the desired results for particular applications:
In some cases, we initially perform unsupervised learning to reduce the dimensions followed by supervised learning when the number of variables is very high. Similarly, in some artificial intelligence applications, supervised learning combined with reinforcement learning could be utilized for solving a problem; an example is self-driving cars in which, initially, images are converted to some numeric format using supervised learning and combined with driving actions (left, forward, right, and backward).
Though there are inherent similarities between statistical modeling and machine learning methodologies, sometimes it is not obviously apparent for many practitioners. In the following table, we explain the differences succinctly to show the ways in which both streams are similar and the differences between them:
Statistical modeling | Machine learning |
Formalization of relationships between variables in the form of mathematical equations. | Algorithm that can learn from the data without relying on rule-based programming. |
Required to assume shape of the model curve prior to perform model fitting on the data (for example, linear, polynomial, and so on). | Does not need to assume underlying shape, as machine learning algorithms can learn complex patterns automatically based on the provided data. |
Statistical model predicts the output with accuracy of 85 percent and having 90 percent confidence about it. | Machine learning just predicts the output with accuracy of 85 percent. |
In statistical modeling, various diagnostics of parameters are performed, like p-value, and so on. | Machine learning models do not perform any statistical diagnostic significance tests. |
Data will be split into 70 percent - 30 percent to create training and testing data. Model developed on training data and tested on testing data. | Data will be split into 50 percent - 25 percent - 25 percent to create training, validation, and testing data. Models developed on training and hyperparameters are tuned on validation data and finally get evaluated against test data. |
Statistical models can be developed on a single dataset called training data, as diagnostics are performed at both overall accuracy and individual variable level. | Due to lack of diagnostics on variables, machine learning algorithms need to be trained on two datasets, called training and validation data, to ensure two-point validation. |
Statistical modeling is mostly used for research purposes. | Machine learning is very apt for implementation in a production environment. |
From the school of statistics and mathematics. | From the school of computer science. |
The development and deployment of machine learning models involves a series of steps that are almost similar to the statistical modeling process, in order to develop, validate, and implement machine learning models. The steps are as follows:
Statistics itself is a vast subject on which a complete book could be written; however, here the attempt is to focus on key concepts that are very much necessary with respect to the machine learning perspective. In this section, a few fundamentals are covered and the remaining concepts will be covered in later chapters wherever it is necessary to understand the statistical equivalents of machine learning.
Predictive analytics depends on one major assumption: that history repeats itself!
By fitting a predictive model on historical data after validating key measures, the same model will be utilized for predicting future events based on the same explanatory variables that were significant on past data.
The first movers of statistical model implementers were the banking and pharmaceutical industries; over a period, analytics expanded to other industries as well.
Statistical models are a class of mathematical models that are usually specified by mathematical equations that relate one or more variables to approximate reality. Assumptions embodied by statistical models describe a set of probability distributions, which distinguishes it from non-statistical, mathematical, or machine learning models
Statistical models always start with some underlying assumptions for which all the variables should hold, then the performance provided by the model is statistically significant. Hence, knowing the various bits and pieces involved in all building blocks provides a strong foundation for being a successful statistician.
In the following section, we have described various fundamentals with relevant codes:
Usually, it is expensive to perform an analysis on an entire population; hence, most statistical methods are about drawing conclusions about a population by analyzing a sample.
The Python code for the calculation of mean, median, and mode using a numpy
array and the stats
package is as follows:
>>> import numpy as np
>>> from scipy import stats
>>> data = np.array([4,5,1,2,7,2,6,9,3])
# Calculate Mean
>>> dt_mean = np.mean(data) ; print ("Mean :",round(dt_mean,2))
# Calculate Median
>>> dt_median = np.median(data) ; print ("Median :",dt_median)
# Calculate Mode
>>> dt_mode = stats.mode(data); print ("Mode :",dt_mode[0][0])
The output of the preceding code is as follows:
We have used a NumPy array instead of a basic list as the data structure; the reason behind using this is the scikit-learn
package built on top of NumPy array in which all statistical models and machine learning algorithms have been built on NumPy array itself. The mode
function is not implemented in the numpy
package, hence we have used SciPy's stats
package. SciPy is also built on top of NumPy arrays.
The R code for descriptive statistics (mean, median, and mode) is given as follows:
data <- c(4,5,1,2,7,2,6,9,3)
dt_mean = mean(data) ; print(round(dt_mean,2))
dt_median = median (data); print (dt_median)
func_mode <- function (input_dt) {
unq <- unique(input_dt) unq[which.max(tabulate(match(input_dt,unq)))]
}
dt_mode = func_mode (data); print (dt_mode)
We have used the default stats
package for R; however, the mode
function was not built-in, hence we have written custom code for calculating the mode.
The Python code is as follows:
>>> from statistics import variance, stdev
>>> game_points = np.array([35,56,43,59,63,79,35,41,64,43,93,60,77,24,82])
# Calculate Variance
>>> dt_var = variance(game_points) ; print ("Sample variance:", round(dt_var,2))
# Calculate Standard Deviation
>>> dt_std = stdev(game_points) ; print ("Sample std.dev:", round(dt_std,2))
# Calculate Range
>>> dt_rng = np.max(game_points,axis=0) - np.min(game_points,axis=0) ; print ("Range:",dt_rng)
#Calculate percentiles
>>> print ("Quantiles:")
>>> for val in [20,80,100]:
>>> dt_qntls = np.percentile(game_points,val)
>>> print (str(val)+"%" ,dt_qntls)
# Calculate IQR
>>> q75, q25 = np.percentile(game_points, [75 ,25]); print ("Inter quartile range:",q75-q25)
The output of the preceding code is as follows:
The R code for dispersion (variance, standard deviation, range, quantiles, and IQR) is as follows:
game_points <- c(35,56,43,59,63,79,35,41,64,43,93,60,77,24,82)
dt_var = var(game_points); print(round(dt_var,2))
dt_std = sd(game_points); print(round(dt_std,2))
range_val<-function(x) return(diff(range(x)))
dt_range = range_val(game_points); print(dt_range)
dt_quantile = quantile(game_points,probs = c(0.2,0.8,1.0)); print(dt_quantile)
dt_iqr = IQR(game_points); print(dt_iqr)
The steps involved in hypothesis testing are as follows:
The null hypothesis is that μ0 ≥ 1000 (all chocolates weigh more than 1,000 g).
Collected sample:
Calculate test statistic:
t = (990 - 1000) / (12.5/sqrt(30)) = - 4.3818
Critical t value from t tables = t0.05, 30 = 1.699 => - t0.05, 30 = -1.699
P-value = 7.03 e-05
Test statistic is -4.3818, which is less than the critical value of -1.699. Hence, we can reject the null hypothesis (your friend's claim) that the mean weight of a chocolate is above 1,000 g.
Also, another way of deciding the claim is by using the p-value. A p-value less than 0.05 means both claimed values and distribution mean values are significantly different, hence we can reject the null hypothesis:
The Python code is as follows:
>>> from scipy import stats
>>> xbar = 990; mu0 = 1000; s = 12.5; n = 30
# Test Statistic
>>> t_smple = (xbar-mu0)/(s/np.sqrt(float(n))); print ("Test Statistic:",round(t_smple,2))
# Critical value from t-table
>>> alpha = 0.05
>>> t_alpha = stats.t.ppf(alpha,n-1); print ("Critical value from t-table:",round(t_alpha,3))
#Lower tail p-value from t-table
>>> p_val = stats.t.sf(np.abs(t_smple), n-1); print ("Lower tail p-value from t-table", p_val)
The R code for T-distribution is as follows:
xbar = 990; mu0 = 1000; s = 12.5 ; n = 30
t_smple = (xbar - mu0)/(s/sqrt(n));print (round(t_smple,2))
alpha = 0.05
t_alpha = qt(alpha,df= n-1);print (round(t_alpha,3))
p_val = pt(t_smple,df = n-1);print (p_val)
Example: Assume that the test scores of an entrance exam fit a normal distribution. Furthermore, the mean test score is 52 and the standard deviation is 16.3. What is the percentage of students scoring 67 or more in the exam?
The Python code is as follows:
>>> from scipy import stats
>>> xbar = 67; mu0 = 52; s = 16.3
# Calculating z-score
>>> z = (67-52)/16.3
# Calculating probability under the curve
>>> p_val = 1- stats.norm.cdf(z)
>>> print ("Prob. to score more than 67 is ",round(p_val*100,2),"%")
The R code for normal distribution is as follows:
xbar = 67; mu0 = 52; s = 16.3
pr = 1- pnorm(67, mean=52, sd=16.3)
print(paste("Prob. to score more than 67 is ",round(pr*100,2),"%"))
The test is usually performed by calculating χ2 from the data and χ2 with (m-1, n-1) degrees from the table. A decision is made as to whether both variables are independent based on the actual value and table value, whichever is higher:
Example: In the following table, calculate whether the smoking habit has an impact on exercise behavior:
The Python code is as follows:
>>> import pandas as pd
>>> from scipy import stats
>>> survey = pd.read_csv("survey.csv")
# Tabulating 2 variables with row & column variables respectively
>>> survey_tab = pd.crosstab(survey.Smoke, survey.Exer, margins = True)
While creating a table using the crosstab
function, we will obtain both row and column totals fields extra. However, in order to create the observed table, we need to extract the variables part and ignore the totals:
# Creating observed table for analysis
>>> observed = survey_tab.ix[0:4,0:3]
The chi2_contingency
function in the stats package uses the observed table and subsequently calculates its expected table, followed by calculating the p-value in order to check whether two variables are dependent or not. If p-value < 0.05, there is a strong dependency between two variables, whereas if p-value > 0.05, there is no dependency between the variables:
>>> contg = stats.chi2_contingency(observed= observed)
>>> p_value = round(contg[1],3)
>>> print ("P-value is: ",p_value)
The p-value is 0.483
, which means there is no dependency between the smoking habit and exercise behavior.
The R code for chi-square is as follows:
survey = read.csv("survey.csv",header=TRUE)
tbl = table(survey$Smoke,survey$Exer)
p_val = chisq.test(tbl)
Example: A fertilizer company developed three new types of universal fertilizers after research that can be utilized to grow any type of crop. In order to find out whether all three have a similar crop yield, they randomly chose six crop types in the study. In accordance with the randomized block design, each crop type will be tested with all three types of fertilizer separately. The following table represents the yield in g/m2. At the 0.05 level of significance, test whether the mean yields for the three new types of fertilizers are all equal:
Fertilizer 1 | Fertilizer 2 | Fertilizer 3 |
62 | 54 | 48 |
62 | 56 | 62 |
90 | 58 | 92 |
42 | 36 | 96 |
84 | 72 | 92 |
64 | 34 | 80 |
The Python code is as follows:
>>> import pandas as pd
>>> from scipy import stats
>>> fetilizers = pd.read_csv("fetilizers.csv")
Calculating one-way ANOVA using the stats
package:
>>> one_way_anova = stats.f_oneway(fetilizers["fertilizer1"], fetilizers["fertilizer2"], fetilizers["fertilizer3"])
>>> print ("Statistic :", round(one_way_anova[0],2),", p-value :",round(one_way_anova[1],3))
Result: The p-value did come as less than 0.05, hence we can reject the null hypothesis that the mean crop yields of the fertilizers are equal. Fertilizers make a significant difference to crops.
The R code for ANOVA is as follows:
fetilizers = read.csv("fetilizers.csv",header=TRUE)
r = c(t(as.matrix(fetilizers)))
f = c("fertilizer1","fertilizer2","fertilizer3")
k = 3; n = 6
tm = gl(k,1,n*k,factor(f))
blk = gl(n,k,k*n)
av = aov(r ~ tm + blk)
smry = summary(av)
Some terms used in a confusion matrix are:
(TP/TP+FP)
(TP/TP+FN)
(TN/TN+FP)
Area under curve is utilized for setting the threshold of cut-off probability to classify the predicted probability into various classes; we will be covering how this method works in upcoming chapters.
In order to answer this question, a probability of default model (or behavioral scorecard in technical terms) needs to be developed by using independent variables from the past 24 months and a dependent variable from the next 12 months. After preparing data with X and Y variables, it will be split into 70 percent - 30 percent as train and test data randomly; this method is called in-time validation as both train and test samples are from the same time period:
Here, R2 = sample R-squared value, n = sample size, k = number of predictors (or) variables.
Adjusted R-squared value is the key metric in evaluating the quality of linear regressions. Any linear regression model having the value of R2 adjusted >= 0.7 is considered as a good enough model to implement.
Example: The R-squared value of a sample is 0.5, with a sample size of 50 and the independent variables are 10 in number. Calculated adjusted R-squared:
Here, k = number of predictors or variables
The idea of AIC is to penalize the objective function if extra variables without strong predictive abilities are included in the model. This is a kind of regularization in logistic regression.
Here, n = number of classes. Entropy is maximal at the middle, with the value of 1 and minimal at the extremes as 0. A low value of entropy is desirable as it will segregate classes better:
Example: Given two types of coin in which the first one is a fair one (1/2 head and 1/2 tail probabilities) and the other is a biased one (1/3 head and 2/3 tail probabilities), calculate the entropy for both and justify which one is better with respect to modeling:
From both values, the decision tree algorithm chooses the biased coin rather than the fair coin as an observation splitter due to the fact the value of entropy is less.
Information gain = Entropy of parent - sum (weighted % * Entropy of child)
Weighted % = Number of observations in particular child / sum (observations in all child nodes)
Here, i = number of classes. The similarity between Gini and entropy is shown as follows:
Every model has both bias and variance error components in addition to white noise. Bias and variance are inversely related to each other; while trying to reduce one component, the other component of the model will increase. The true art lies in creating a good fit by balancing both. The ideal model will have both low bias and low variance.
Errors from the bias component come from erroneous assumptions in the underlying learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs; this phenomenon causes an underfitting problem.
On the other hand, errors from the variance component come from sensitivity to change in the fit of the model, even a small change in training data; high variance can cause an overfitting problem:
An example of a high bias model is logistic or linear regression, in which the fit of the model is merely a straight line and may have a high error component due to the fact that a linear model could not approximate underlying data well.
An example of a high variance model is a decision tree, in which the model may create too much wiggly curve as a fit, in which even a small change in training data will cause a drastic change in the fit of the curve.
At the moment, state-of-the-art models are utilizing high variance models such as decision trees and performing ensemble on top of them to reduce the errors caused by high variance and at the same time not compromising on increases in errors due to the bias component. The best example of this category is random forest, in which many decision trees will be grown independently and ensemble in order to come up with the best fit; we will cover this in upcoming chapters:
In practice, data usually will be split randomly 70-30 or 80-20 into train and test datasets respectively in statistical modeling, in which training data utilized for building the model and its effectiveness will be checked on test data:
In the following code, we split the original data into train and test data by 70 percent - 30 percent. An important point to consider here is that we set the seed values for random numbers in order to repeat the random sampling every time we create the same observations in training and testing data. Repeatability is very much needed in order to reproduce the results:
# Train & Test split
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> original_data = pd.read_csv("mtcars.csv")
In the following code, train size
is 0.7
, which means 70 percent of the data should be split into the training dataset and the remaining 30% should be in the testing dataset. Random state is seed in this process of generating pseudo-random numbers, which makes the results reproducible by splitting the exact same observations while running every time:
>>> train_data,test_data = train_test_split(original_data,train_size = 0.7,random_state=42)
The R code for the train and test split for statistical modeling is as follows:
full_data = read.csv("mtcars.csv",header=TRUE)
set.seed(123)
numrow = nrow(full_data)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = full_data[trnind,]
test_data = full_data[-trnind,]
There seems to be an analogy between statistical modeling and machine learning that we will cover in subsequent chapters in depth. However, a quick view has been provided as follows: in statistical modeling, linear regression with two independent variables is trying to fit the best plane with the least errors, whereas in machine learning independent variables have been converted into the square of error terms (squaring ensures the function will become convex, which enhances faster convergence and also ensures a global optimum) and optimized based on coefficient values rather than independent variables:
Machine learning utilizes optimization for tuning all the parameters of various algorithms. Hence, it is a good idea to know some basics about optimization.
Before stepping into gradient descent, the introduction of convex and non-convex functions is very helpful. Convex functions are functions in which a line drawn between any two random points on the function also lies within the function, whereas this isn't true for non-convex functions. It is important to know whether the function is convex or non-convex due to the fact that in convex functions, the local optimum is also the global optimum, whereas for non-convex functions, the local optimum does not guarantee the global optimum:
Does it seem like a tough problem? One turnaround could be to initiate a search process at different random locations; by doing so, it usually converges to the global optimum:
In the following code, a comparison has been made between applying linear regression in a statistical way and gradient descent in a machine learning way on the same dataset:
>>> import numpy as np
>>> import pandas as pd
The following code describes reading data using a pandas DataFrame:
>>> train_data = pd.read_csv("mtcars.csv")
Converting DataFrame variables into NumPy arrays in order to process them in scikit learn packages, as scikit-learn is built on NumPy arrays itself, is shown next:
>>> X = np.array(train_data["hp"]) ; y = np.array(train_data["mpg"])
>>> X = X.reshape(32,1); y = y.reshape(32,1)
Importing linear regression from the scikit-learn package; this works on the least squares method:
>>> from sklearn.linear_model import LinearRegression
>>> model = LinearRegression(fit_intercept = True)
Fitting a linear regression model on the data and display intercept and coefficient of single variable (hp
variable):
>>> model.fit(X,y)
>>> print ("Linear Regression Results" )
>>> print ("Intercept",model.intercept_[0] ,"Coefficient", model.coef_[0])
Now we will apply gradient descent from scratch; in future chapters, we can use the scikit-learn built-in modules rather than doing it from first principles. However, here, an illustration has been provided on the internal workings of the optimization method on which the whole machine learning has been built.
Defining the gradient descent function gradient_descent
with the following:
x
: Independent variable.y
: Dependent variable.learn_rate
: Learning rate with which gradients are updated; too low causes slower convergence and too high causes overshooting of gradients.batch_size
: Number of observations considered at each iteration for updating gradients; a high number causes a lower number of iterations and a lower number causes an erratic decrease in errors. Ideally, the batch size should be a minimum value of 30 due to statistical significance. However, various settings need to be tried to check which one is better.max_iter
: Maximum number of iteration, beyond which the algorithm will get auto-terminated:>>> def gradient_descent(x, y,learn_rate, conv_threshold,batch_size, max_iter):
... converged = False
... iter = 0
... m = batch_size
... t0 = np.random.random(x.shape[1])
... t1 = np.random.random(x.shape[1])
Mean square error calculation
Squaring of error has been performed to create the convex function, which has nice convergence properties:... MSE = (sum([(t0 + t1*x[i] - y[i])**2 for i in range(m)])/ m)
The following code states, run the algorithm until it does not meet the convergence criteria:
... while not converged:
... grad0 = 1.0/m * sum([(t0 + t1*x[i] - y[i]) for i in range(m)])
... grad1 = 1.0/m * sum([(t0 + t1*x[i] - y[i])*x[i] for i in range(m)])
... temp0 = t0 - learn_rate * grad0
... temp1 = t1 - learn_rate * grad1
... t0 = temp0
... t1 = temp1
Calculate a new error with updated parameters, in order to check whether the new error changed more than the predefined convergence threshold value; otherwise, stop the iterations and return parameters:
... MSE_New = (sum( [ (t0 + t1*x[i] - y[i])**2 for i in range(m)] ) / m)
... if abs(MSE - MSE_New ) <= conv_threshold:
... print 'Converged, iterations: ', iter
... converged = True
... MSE = MSE_New
... iter += 1
... if iter == max_iter:
... print 'Max interactions reached'
... converged = True
... return t0,t1
The following code describes running the gradient descent function with defined values. Learn rate = 0.0003, convergence threshold = 1e-8, batch size = 32, maximum number of iteration = 1500000:
>>> if __name__ == '__main__':
... Inter, Coeff = gradient_descent(x = X,y = y,learn_rate=0.00003 , conv_threshold = 1e-8, batch_size=32,max_iter=1500000)
... print ('Gradient Descent Results')
... print (('Intercept = %s Coefficient = %s') %(Inter, Coeff))
The R code for linear regression versus gradient descent is as follows:
# Linear Regression train_data = read.csv("mtcars.csv",header=TRUE) model <- lm(mpg ~ hp, data = train_data) print (coef(model)) # Gradient descent gradDesc <- function(x, y, learn_rate, conv_threshold, batch_size, max_iter) { m <- runif(1, 0, 1) c <- runif(1, 0, 1) ypred <- m * x + c MSE <- sum((y - ypred) ^ 2) / batch_size converged = F iterations = 0 while(converged == F) { m_new <- m - learn_rate * ((1 / batch_size) * (sum((ypred - y) * x))) c_new <- c - learn_rate * ((1 / batch_size) * (sum(ypred - y))) m <- m_new c <- c_new ypred <- m * x + c MSE_new <- sum((y - ypred) ^ 2) / batch_size if(MSE - MSE_new <= conv_threshold) { converged = T return(paste("Iterations:",iterations,"Optimal intercept:", c, "Optimal slope:", m)) } iterations = iterations + 1 if(iterations > max_iter) { converged = T return(paste("Iterations:",iterations,"Optimal intercept:", c, "Optimal slope:", m)) } MSE = MSE_new } } gradDesc(x = train_data$hp,y = train_data$mpg, learn_rate = 0.00003, conv_threshold = 1e-8, batch_size = 32, max_iter = 1500000)
The loss function or cost function in machine learning is a function that maps the values of variables onto a real number intuitively representing some cost associated with the variable values. Optimization methods are applied to minimize the loss function by changing the parameter values, which is the central theme of machine learning.
Zero-one loss is L0-1 = 1 (m <= 0); in zero-one loss, value of loss is 0 for m >= 0 whereas 1 for m < 0. The difficult part with this loss is it is not differentiable, non-convex, and also NP-hard. Hence, in order to make optimization feasible and solvable, these losses are replaced by different surrogate losses for different problems.
Surrogate losses used for machine learning in place of zero-one loss are given as follows. The zero-one loss is not differentiable, hence approximated losses are being used instead:
Some loss functions are as follows:
When to stop tuning the hyperparameters in a machine learning model is a million-dollar question. This problem can be mostly solved by keeping tabs on training and testing errors. While increasing the complexity of a model, the following stages occur:
Cross-validation is not popular in the statistical modeling world for many reasons; statistical models are linear in nature and robust, and do not have a high variance/overfitting problem. Hence, the model fit will remain the same either on train or test data, which does not hold true in the machine learning world. Also, in statistical modeling, lots of tests are performed at the individual parameter level apart from aggregated metrics, whereas in machine learning we do not have visibility at the individual parameter level:
In the following code, both the R and Python implementation has been provided. If none of the percentages are provided, the default parameters are 50 percent for train data, 25 percent for validation data, and 25 percent for the remaining test data.
Python implementation has only one train and test split functionality, hence we have used it twice and also used the number of observations to split rather than the percentage (as shown in the previous train and test split example). Hence, a customized function is needed to split into three datasets:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> original_data = pd.read_csv("mtcars.csv")
>>> def data_split(dat,trf = 0.5,vlf=0.25,tsf = 0.25):
... nrows = dat.shape[0]
... trnr = int(nrows*trf)
... vlnr = int(nrows*vlf)
The following Python code splits the data into training and the remaining data. The remaining data will be further split into validation and test datasets:
... tr_data,rmng = train_test_split(dat,train_size = trnr,random_state=42)
... vl_data, ts_data = train_test_split(rmng,train_size = vlnr,random_state=45)
... return (tr_data,vl_data,ts_data)
Implementation of the split function on the original data to create three datasets (by 50 percent, 25 percent, and 25 percent splits) is as follows:
>>> train_data, validation_data, test_data = data_split (original_data ,trf=0.5, vlf=0.25,tsf=0.25)
The R code for the train, validation, and test split is as follows:
# Train Validation & Test samples
trvaltest <- function(dat,prop = c(0.5,0.25,0.25)){
nrw = nrow(dat)
trnr = as.integer(nrw *prop[1])
vlnr = as.integer(nrw*prop[2])
set.seed(123)
trni = sample(1:nrow(dat),trnr)
trndata = dat[trni,]
rmng = dat[-trni,]
vlni = sample(1:nrow(rmng),vlnr)
valdata = rmng[vlni,]
tstdata = rmng[-vlni,]
mylist = list("trn" = trndata,"val"= valdata,"tst" = tstdata)
return(mylist)
}
outdata = trvaltest(mtcars,prop = c(0.5,0.25,0.25))
train_data = outdata$trn; valid_data = outdata$val; test_data = outdata$tst
Cross-validation is another way of ensuring robustness in the model at the expense of computation. In the ordinary modeling methodology, a model is developed on train data and evaluated on test data. In some extreme cases, train and test might not have been homogeneously selected and some unseen extreme cases might appear in the test data, which will drag down the performance of the model.
On the other hand, in cross-validation methodology, data was divided into equal parts and training performed on all the other parts of the data except one part, on which performance will be evaluated. This process repeated as many parts user has chosen.
Example: In five-fold cross-validation, data will be divided into five parts, subsequently trained on four parts of the data, and tested on the one part of the data. This process will run five times, in order to cover all points in the data. Finally, the error calculated will be the average of all the errors:
Grid search in machine learning is a popular way to tune the hyperparameters of the model in order to find the best combination for determining the best fit:
In the following code, implementation has been performed to determine whether a particular user will click an ad or not. Grid search has been implemented using a decision tree classifier for classification purposes. Tuning parameters are the depth of the tree, the minimum number of observations in terminal node, and the minimum number of observations required to perform the node split:
# Grid search
>>> import pandas as pd
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> input_data = pd.read_csv("ad.csv",header=None)
>>> X_columns = set(input_data.columns.values)
>>> y = input_data[len(input_data.columns.values)-1]
>>> X_columns.remove(len(input_data.columns.values)-1)
>>> X = input_data[list(X_columns)]
Split the data into train and testing:
>>> X_train, X_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,random_state=33)
Create a pipeline to create combinations of variables for the grid search:
>>> pipeline = Pipeline([
... ('clf', DecisionTreeClassifier(criterion='entropy')) ])
Combinations to explore are given as parameters in Python dictionary format:
>>> parameters = {
... 'clf__max_depth': (50,100,150),
... 'clf__min_samples_split': (2, 3),
... 'clf__min_samples_leaf': (1, 2, 3)}
The n_jobs
field is for selecting the number of cores in a computer; -1
means it uses all the cores in the computer. The scoring methodology is accuracy, in which many other options can be chosen, such as precision
, recall
, and f1
:
>>> grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
>>> grid_search.fit(X_train, y_train)
Predict using the best parameters of grid search:
>>> y_pred = grid_search.predict(X_test)
The output is as follows:
>>> print ('\n Best score: \n', grid_search.best_score_)
>>> print ('\n Best parameters set: \n')
>>> best_parameters = grid_search.best_estimator_.get_params()
>>> for param_name in sorted(parameters.keys()):
>>> print ('\t%s: %r' % (param_name, best_parameters[param_name]))
>>> print ("\n Confusion Matrix on Test data \n",confusion_matrix(y_test,y_pred))
>>> print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred))
>>> print ("\nPrecision Recall f1 table \n",classification_report(y_test, y_pred))
The R code for grid searches on decision trees is as follows:
# Grid Search on Decision Trees
library(rpart)
input_data = read.csv("ad.csv",header=FALSE)
input_data$V1559 = as.factor(input_data$V1559)
set.seed(123)
numrow = nrow(input_data)
trnind = sample(1:numrow,size = as.integer(0.7*numrow))
train_data = input_data[trnind,];test_data = input_data[-trnind,]
minspset = c(2,3);minobset = c(1,2,3)
initacc = 0
for (minsp in minspset){
for (minob in minobset){
tr_fit = rpart(V1559 ~.,data = train_data,method = "class",minsplit = minsp, minbucket = minob)
tr_predt = predict(tr_fit,newdata = train_data,type = "class")
tble = table(tr_predt,train_data$V1559)
acc = (tble[1,1]+tble[2,2])/sum(tble)
acc
if (acc > initacc){
tr_predtst = predict(tr_fit,newdata = test_data,type = "class")
tblet = table(test_data$V1559,tr_predtst)
acct = (tblet[1,1]+tblet[2,2])/sum(tblet)
acct
print(paste("Best Score"))
print( paste("Train Accuracy ",round(acc,3),"Test Accuracy",round(acct,3)))
print( paste(" Min split ",minsp," Min obs per node ",minob))
print(paste("Confusion matrix on test data"))
print(tblet)
precsn_0 = (tblet[1,1])/(tblet[1,1]+tblet[2,1])
precsn_1 = (tblet[2,2])/(tblet[1,2]+tblet[2,2])
print(paste("Precision_0: ",round(precsn_0,3),"Precision_1: ",round(precsn_1,3)))
rcall_0 = (tblet[1,1])/(tblet[1,1]+tblet[1,2])
rcall_1 = (tblet[2,2])/(tblet[2,1]+tblet[2,2])
print(paste("Recall_0: ",round(rcall_0,3),"Recall_1: ",round(rcall_1,3)))
initacc = acc
}
}
}
Machine learning models are classified mainly into supervised, unsupervised, and reinforcement learning methods. We will be covering detailed discussions about each technique in later chapters; here is a very basic summary of them:
In this chapter, we have gained a high-level view of various basic building blocks and subcomponents involved in statistical modeling and machine learning, such as mean, variance, interquartile range, p-value, bias versus variance trade-off, AIC, Gini, area under the curve, and so on with respect to the statistics context, and cross-validation, gradient descent, and grid search concepts with respect to machine learning. We have explained all the concepts with the support of both Python and R code with various libraries such as numpy
, scipy
, pandas
, and scikit- learn
, and the stats
model in Python and the basic stats
package in R. In the next chapter, we will learn to draw parallels between statistical models and machine learning models with linear regression problems and ridge/lasso regression in machine learning using both Python and R code.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.