Reader small image

You're reading from  Hands-On Predictive Analytics with Python

Product typeBook
Published inDec 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789138719
Edition1st Edition
Languages
Right arrow
Author (1)
Alvaro Fuentes
Alvaro Fuentes
author image
Alvaro Fuentes

Alvaro Fuentes is a senior data scientist with a background in applied mathematics and economics. He has more than 14 years of experience in various analytical roles and is an analytics consultant at one of the ‘Big Three' global management consulting firms, leading advanced analytics projects in different industries like banking, technology, and consumer goods. Alvaro is also an author and trainer in analytics and data science and has published courses and books, such as 'Become a Python Data Analyst' and 'Hands-On Predictive Analytics with Python'. He has also taught data science and related topics to thousands of students both on-site and online through different platforms such as Springboard, Simplilearn, Udemy, and BSG Institute, among others.
Read more about Alvaro Fuentes

Right arrow

Predicting Numerical Values with Machine Learning

Let's review what we have done so far: the business problem has been formulated, the data has been acquired and prepared, and we have a good understanding of the features and their possible relationships after applying exploratory data analysis (EDA). Now, it is finally time to build our first predictive models!

However, before building models for predictions, we should understand some of the basic foundational concepts of the field that we'll use in this book: machine learning (ML). We begin by providing a brief overview of what ML is and what the main ML techniques are. This is, of course, not a book on ML; it's just a tool, so we won't get into the theoretical or technical details that you would find in a typical ML book. Those books usually dedicate one chapter for each family of models. In addition, ML...

Technical requirements

  • Python 3.6 or higher
  • Jupyter Notebooks
  • Recent versions of the following Python libraries: NumPy, pandas, matplotlib, Seaborn, and scikit-learn

Introduction to ML

Machine learning is a term that has seen an explosion in popularity, and that is mainly because it works. It has produced very good results when applied to many scientific and industrial problems, and is present, in one form or another, in many technological products and services people use daily. If you interact with the internet, use apps on your smartphone, check your email, or do any telecommunications or banking transactions, then you have definitely interacted with an ML model. This is not a book about ML; we will focus on giving the very basic concepts necessary to use ML as a tool for doing predictive analytics, we won't delve deeper into this exciting field, and there will be many important things that we will leave out. However, because of the huge rise in interest in the subject, there are many excellent resources covering everything from deeply...

Practical considerations before modeling

We now have a basic understanding of some of the most important conceptual and theoretical aspects of ML. In this section, we will talk about some of the practical things we need to do before building a model; this includes some further data processing that is needed for feeding the data for model training. We will also introduce our main tool for model building: scikit-learn.

Introducing scikit-learn

If you go to the main web page of scikit-learn, the first things you will read are the following statements about it:

  • Simple and efficient tool for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source...

MLR

In scikit-learn, ML models are implemented in classes known as estimators, which include any object that learns from data, mainly models or transformers. All estimators have a fit method, which is used with a dataset to train the estimator like this: estimator.fit(data).

It is important to note that the estimator has two kinds of parameters:

  • Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute. Some of these estimator parameters correspond to the ML model hyperparameters. We will talk about model hyperparameters more later.
  • Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object, ending with an underscore.

Since scikit-learn has a very consistent API using estimators, it...

Lasso regression

Lasso is a clever modification to the multiple regression model that automatically excludes features that have little relevance to the accuracy of predictions. It performs a regularization strategy to perform variable selection in order to try to enhance the prediction accuracy of the multiple regression model. The equation that the lasso regression model uses to make the predictions is the same as in the multiple regression case: a linear combination of all the features, that is, each of them multiplied by a single coefficient. The modification is made in the quantity that the algorithm is trying to minimize; if we have P predictors, then the problem now is to find the combination of weights (w) that will minimize the following quantity:

Note that the first part of the quantity is almost the same as in the case of the MLR (except for the constant multiplying...

KNN

The KNN method is a method that can be used for both regression and classification problems. It belongs to the class of non-parametric models, because, unlike parametric models, the predictions are not based on the calculation of any parameters. Examples of parametric models are the regression models that we just discussed. The weights in the case of the former regression models are the parameters. KNN belongs to the family of non-parametric models, and despite its simplicity (or perhaps because of it), it frequently produces very good results, comparable to those produced by more complex and elaborate models. In its most basic implementation, it is easy understand how to it works: for a fixed number, K, which is the number of neighbors, and a given observation whose target value we want to predict, do the following:

  • Find the K data points that are closest in their feature...

Training versus testing error

The point of splitting the dataset into training and testing sets was to simulate the situation of using the model to make predictions on data the model has not seen. As we said before, the whole point is to generalize what we have learned from the observed data. The training MSE (or any metric calculated on the training dataset) may give us a biased view of the performance of our model, especially because of the possibility of overfitting. The metrics of performance we get from the training dataset will tend to be too optimistic. Let's take a look again at our illustration of overfitting:

If we calculate the training MSE for these three cases, we will definitely get the lowest one (hence the best) for the third model, the polynomial with 16 degrees; as we see, the model touches many points, making the error for those points exactly 0. However...

Summary

This was a dense chapter! We introduced some of the most important concepts of ML; we know that ML has three main branches, supervised, unsupervised, and reinforcement learning, and that we will be using only supervised learning in this book. Supervised learning has two types of tasks, regression and classification, whose only difference is the type of target we want to predict. We also talked about the very abstract concepts of hypothesis set and learning algorithm, and we even invented our (very bad) pseudo-ML model.

We also talked about the very important concept of generalization, which is the whole point of building ML models: to be able to learn how to map the features to the target using the data we have, and then use this knowledge to make predictions with data that we don't have yet. Cross-validation is a set of techniques to evaluate models; the most basic...

Further reading

  • Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. Springer series in statistics.
  • Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer.
  • Pedregosa, F. et. al. (2011). Scikit-learn: Machine learning in Python. In Journal of machine learning research.
  • Raschka, S., & Mirjalili, V. (2017). Python machine learning. Packt Publishing.
  • Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems (pp. 1473-1480).
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Predictive Analytics with Python
Published in: Dec 2018Publisher: PacktISBN-13: 9781789138719
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alvaro Fuentes

Alvaro Fuentes is a senior data scientist with a background in applied mathematics and economics. He has more than 14 years of experience in various analytical roles and is an analytics consultant at one of the ‘Big Three' global management consulting firms, leading advanced analytics projects in different industries like banking, technology, and consumer goods. Alvaro is also an author and trainer in analytics and data science and has published courses and books, such as 'Become a Python Data Analyst' and 'Hands-On Predictive Analytics with Python'. He has also taught data science and related topics to thousands of students both on-site and online through different platforms such as Springboard, Simplilearn, Udemy, and BSG Institute, among others.
Read more about Alvaro Fuentes