You're reading from Hands-On Data Analysis with Pandas - Second Edition

Product type Book

Published in Apr 2021

Publisher Packt

ISBN-13 9781800563452

Pages 788 pages

Edition 2nd Edition

Languages

Python

Concepts

Databases

Author (1):

Stefanie Molin

Table of Contents (21) Chapters

Preface

Section 1: Getting Started with Pandas

Chapter 1: Introduction to Data Analysis

Chapter 2: Working with Pandas DataFrames

Section 2: Using Pandas for Data Analysis

Chapter 3: Data Wrangling with Pandas

Chapter 4: Aggregating Pandas DataFrames

Chapter 5: Visualizing Data with Pandas and Matplotlib

Chapter 6: Plotting with Seaborn and Customization Techniques

Section 3: Applications – Real-World Analyses Using Pandas

Chapter 7: Financial Analysis – Bitcoin and the Stock Market

Chapter 8: Rule-Based Anomaly Detection

Section 4: Introduction to Machine Learning with Scikit-Learn

Chapter 9: Getting Started with Machine Learning in Python

Chapter 10: Making Better Predictions – Optimizing Models

Chapter 11: Machine Learning Anomaly Detection

Section 5: Additional Resources

Chapter 12: The Road Ahead

Solutions

Other Books You May Enjoy

Appendix

Chapter 10: Making Better Predictions – Optimizing Models

In the previous chapter, we learned how to build and evaluate our machine learning models. However, we didn't touch upon what we can do if we want to improve their performance. Of course, we could try out a different model and see if it performs better—unless there are requirements that we use a specific method for legal reasons or in order to be able to explain how it works. We want to make sure we use the best version of the model that we can, and for that, we need to discuss how to tune our models.

This chapter will introduce techniques for the optimization of machine learning model performance using scikit-learn, as a continuation of the content in Chapter 9, Getting Started with Machine Learning in Python. Nonetheless, it should be noted that there is no panacea. It is entirely possible to try everything we can think of and still have a model with little predictive value; such is the nature of modeling...

Chapter materials

In this chapter, we will be working with three datasets. The first two come from data on wine quality donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml/index.php) by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, and contain information on the chemical properties of various wine samples along with a rating of the quality from a blind tasting session by a panel of wine experts. These files can be found in the data/ folder inside this chapter's folder in the GitHub repository (https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_10) as winequality-red.csv and winequality-white.csv for red and white wine, respectively.

Our third dataset was collected using the Open Exoplanet Catalogue database, at https://github.com/OpenExoplanetCatalogue/open_exoplanet_catalogue/, which provides data in XML format. The parsed planet data can be found in the data/planets.csv file. For the exercises...

Hyperparameter tuning with grid search

No doubt you have noticed that we can provide various parameters to the model classes when we instantiate them. These model parameters are not derived from the data itself and are referred to as hyperparameters. Some examples of these are regularization terms, which we will discuss later in this chapter, and weights. Through the process of model tuning, we seek to optimize our model's performance by tuning these hyperparameters.

How can we know we are picking the best values to optimize our model's performance? One way is to use a technique called grid search to tune these hyperparameters. Grid search allows us to define a search space and test all combinations of hyperparameters in that space, keeping the ones that result in the best model. The scoring criterion we define will determine the best model.

Remember the elbow point method we discussed in Chapter 9, Getting Started with Machine Learning in Python, for finding a good...

Feature engineering

When trying to improve performance, we may also consider ways to provide the best features (model inputs) to our model through the process of feature engineering. The Preprocessing data section in Chapter 9, Getting Started with Machine Learning in Python, introduced us to feature transformation when we scaled, encoded, and imputed our data. Unfortunately, feature transformation may mute some elements of our data that we want to use in our model, such as the unscaled value of the mean of a specific feature. For this situation, we can create a new feature with this value; this and other new features are added during feature construction (sometimes called feature creation).

Feature selection is the process of determining which features to train the model on. This can be done manually or through another process, such as machine learning. When looking to choose features for our model, we want features that have an impact on our dependent variable without unnecessarily...

Ensemble methods

Ensemble methods combine many models (often weak ones) to create a stronger one that will either minimize the average error between observed and predicted values (the bias) or improve how well it generalizes to unseen data (minimize the variance). We have to strike a balance between complex models that may increase variance, as they tend to overfit, and simple models that may have high bias, as these tend to underfit. This is called the bias-variance trade-off, which is illustrated in the following subplots:

Figure 10.11 – The bias-variance trade-off

Ensemble methods can be broken down into three categories: boosting, bagging, and stacking. Boosting trains many weak learners, which learn from each other's mistakes to reduce bias, making a stronger learner. Bagging, on the other hand, uses bootstrap aggregation to train many models on bootstrap samples of the data and aggregate the results together (using voting for classification...

Inspecting classification prediction confidence

As we saw with ensemble methods, when we know the strengths and weaknesses of our model, we can employ strategies to attempt to improve performance. We may have two models to classify something, but they most likely won't agree on everything. However, say that we know that one does better on edge cases, while the other is better on the more common ones. In that case, we would likely want to investigate a voting classifier to improve our performance. How can we know how the models perform in different situations, though?

By looking at the probabilities the model predicts of an observation belonging to a given class, we can gain insight into how confident our model is when it is correct and when it errs. We can use our pandas data wrangling skills to make quick work of this. Let's see how confident our original white_or_red model from Chapter 9, Getting Started with Machine Learning in Python, was in its predictions:

>...

Addressing class imbalance

When faced with a class imbalance in our data, we may want to try to balance the training data before we build a model around it. In order to do this, we can use one of the following imbalanced sampling techniques:

Over-sample the minority class.
Under-sample the majority class.

In the case of over-sampling, we pick a larger proportion from the minority class in order to get closer to the amount of the majority class; this may involve a technique such as bootstrapping or generating new data similar to the values in the existing data (using machine learning algorithms such as nearest neighbors). Under-sampling, on the other hand, will take less data overall by reducing the amount taken from the majority class. The decision to use over-sampling or under-sampling will depend on the amount of data we started with, and in some cases, computational costs. In practice, we wouldn't try either of these without first trying to build the model...

Regularization

When working with regressions, we may look to add a penalty term to our regression equation to reduce overfitting by punishing certain decisions for coefficients made by the model; this is called regularization. We are looking for the coefficients that will minimize this penalty term. The idea is to shrink the coefficients toward zero for features that don't contribute much to reducing the error of the model. Some common techniques are ridge regression, LASSO (short for Least Absolute Shrinkage and Selection Operator) regression, and elastic net regression, which combines the LASSO and ridge penalty terms. Note that since these techniques rely on the magnitude of the coefficients, the data should be scaled beforehand.

Ridge regression, also called L2 regularization, punishes high coefficients () by adding the sum of the squares of the coefficients to the cost function (which regression looks to minimize when fitting), as per the following penalty term:

...

Summary

In this chapter, we reviewed various techniques we can employ to improve model performance. We learned how to use grid search to find the best hyperparameters in a search space, and how to tune our model using the scoring metric of our choosing with GridSearchCV. This means we don't have to accept the default in the score() method of our model and can customize it to our needs.

In our discussion of feature engineering, we learned how to reduce the dimensionality of our data using techniques such as PCA and feature selection. We saw how to use the PolynomialFeatures class to add interaction terms to models with categorical and numerical features. Then, we learned how to use the FeatureUnion class to augment our training data with transformed features. In addition, we saw how decision trees can help us understand which features in the data contribute most to the classification or regression task at hand, using feature importances. This helped us see the importance of...

Exercises

Complete the following exercises to practice the skills covered in this chapter. Be sure to consult the Machine learning workflow section in the Appendix as a refresher on the process of building models:

Predict star temperature with elastic net linear regression as follows:
a) Using the data/stars.csv file, build a pipeline to normalize the data with a MinMaxScaler object and then run elastic net linear regression using all the numeric columns to predict the temperature of the star.
b) Run grid search on the pipeline to find the best values for alpha, l1_ratio, and fit_intercept for the elastic net in the search space of your choice.
c) Train the model on 75% of the initial data.
d) Calculate the R2 of your model.
e) Find the coefficients for each regressor and the intercept.
f) Visualize the residuals using the plot_residuals() function from the ml_utils.regression module.
Perform multiclass classification of white wine quality using a support vector machine and feature...