scikit-learn Cookbook

Over 50 recipes to incorporate scikit-learn into every step of the data science pipeline, from feature extraction to model building and model evaluation

scikit-learn Cookbook

This ebook is included in a Mapt subscription
Trent Hauck

1 customer reviews
Over 50 recipes to incorporate scikit-learn into every step of the data science pipeline, from feature extraction to model building and model evaluation
$10.00
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook
Preview in Mapt

Book Details

ISBN 139781783989485
Paperback214 pages

Book Description

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.

The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.

Table of Contents

Chapter 1: Premodel Workflow
Introduction
Getting sample data from external sources
Creating sample data for toy analysis
Scaling data to the standard normal
Creating binary features through thresholding
Working with categorical variables
Binarizing label features
Imputing missing values through various strategies
Using Pipelines for multiple preprocessing steps
Reducing dimensionality with PCA
Using factor analysis for decomposition
Kernel PCA for nonlinear dimensionality reduction
Using truncated SVD to reduce dimensionality
Decomposition to classify with DictionaryLearning
Putting it all together with Pipelines
Using Gaussian processes for regression
Defining the Gaussian process object directly
Using stochastic gradient descent for regression
Chapter 2: Working with Linear Models
Introduction
Fitting a line through data
Evaluating the linear regression model
Using ridge regression to overcome linear regression's shortfalls
Optimizing the ridge regression parameter
Using sparsity to regularize models
Taking a more fundamental approach to regularization with LARS
Using linear methods for classification – logistic regression
Directly applying Bayesian ridge regression
Using boosting to learn from errors
Chapter 3: Building Models with Distance Metrics
Introduction
Using KMeans to cluster data
Optimizing the number of centroids
Assessing cluster correctness
Using MiniBatch KMeans to handle more data
Quantizing an image with KMeans clustering
Finding the closest objects in the feature space
Probabilistic clustering with Gaussian Mixture Models
Using KMeans for outlier detection
Using k-NN for regression
Chapter 4: Classifying Data with scikit-learn
Introduction
Doing basic classifications with Decision Trees
Tuning a Decision Tree model
Using many Decision Trees – random forests
Tuning a random forest model
Classifying data with support vector machines
Generalizing with multiclass classification
Using LDA for classification
Working with QDA – a nonlinear LDA
Using Stochastic Gradient Descent for classification
Classifying documents with Naïve Bayes
Label propagation with semi-supervised learning
Chapter 5: Postmodel Workflow
Introduction
K-fold cross validation
Automatic cross validation
Cross validation with ShuffleSplit
Stratified k-fold
Poor man's grid search
Brute force grid search
Using dummy estimators to compare results
Regression model evaluation
Feature selection
Feature selection on L1 norms
Persisting models with joblib

What You Will Learn

  • Address algorithms of various levels of complexity and learn how to analyze data at the same time
  • Handle common data problems such as feature extraction and missing data
  • Understand how to evaluate your models against themselves and any other model
  • Discover just enough math needed to learn how to think about the connections between various algorithms
  • Customize the machine learning algorithm to fit your problem, and learn how to modify it when the situation calls for it
  • Incorporate other packages from the Python ecosystem to munge and visualize your dataset

Authors

Table of Contents

Chapter 1: Premodel Workflow
Introduction
Getting sample data from external sources
Creating sample data for toy analysis
Scaling data to the standard normal
Creating binary features through thresholding
Working with categorical variables
Binarizing label features
Imputing missing values through various strategies
Using Pipelines for multiple preprocessing steps
Reducing dimensionality with PCA
Using factor analysis for decomposition
Kernel PCA for nonlinear dimensionality reduction
Using truncated SVD to reduce dimensionality
Decomposition to classify with DictionaryLearning
Putting it all together with Pipelines
Using Gaussian processes for regression
Defining the Gaussian process object directly
Using stochastic gradient descent for regression
Chapter 2: Working with Linear Models
Introduction
Fitting a line through data
Evaluating the linear regression model
Using ridge regression to overcome linear regression's shortfalls
Optimizing the ridge regression parameter
Using sparsity to regularize models
Taking a more fundamental approach to regularization with LARS
Using linear methods for classification – logistic regression
Directly applying Bayesian ridge regression
Using boosting to learn from errors
Chapter 3: Building Models with Distance Metrics
Introduction
Using KMeans to cluster data
Optimizing the number of centroids
Assessing cluster correctness
Using MiniBatch KMeans to handle more data
Quantizing an image with KMeans clustering
Finding the closest objects in the feature space
Probabilistic clustering with Gaussian Mixture Models
Using KMeans for outlier detection
Using k-NN for regression
Chapter 4: Classifying Data with scikit-learn
Introduction
Doing basic classifications with Decision Trees
Tuning a Decision Tree model
Using many Decision Trees – random forests
Tuning a random forest model
Classifying data with support vector machines
Generalizing with multiclass classification
Using LDA for classification
Working with QDA – a nonlinear LDA
Using Stochastic Gradient Descent for classification
Classifying documents with Naïve Bayes
Label propagation with semi-supervised learning
Chapter 5: Postmodel Workflow
Introduction
K-fold cross validation
Automatic cross validation
Cross validation with ShuffleSplit
Stratified k-fold
Poor man's grid search
Brute force grid search
Using dummy estimators to compare results
Regression model evaluation
Feature selection
Feature selection on L1 norms
Persisting models with joblib

Book Details

ISBN 139781783989485
Paperback214 pages
Read More
From 1 reviews

Read More Reviews