You're reading from The Applied Data Science Workshop - Second Edition

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781800202504

Edition2nd Edition

Languages

Python

Tools

Jupyter

Concepts

Data Science

Author (1)

Alex Galea

5. Model Validation and Optimization

Overview

In this chapter, you will learn how to use k-fold cross validation to test model performance, as well as how to use validation curves to optimize model parameters. You will also learn how to implement dimensionality reduction techniques such as Principal Component Analysis (PCA). By the end of this chapter, you will have completed an end-to-end machine learning project and produced a final model that can be used to make business decisions.

Introduction

As we've seen in the previous chapters, it's easy to train models with scikit-learn using just a few lines of Python code. This is possible by abstracting away the computational complexity of the algorithm, including details such as constructing cost functions and optimizing model parameters. In other words, we deal with a black box where the internal operations are hidden from us.

While the simplicity offered by this approach is quite nice on the surface, it does nothing to prevent the misuse of algorithms—for example, by selecting the wrong model for a dataset, overfitting on the training set, or failing to test properly on unseen data.

In this chapter, we'll show you how to avoid some of these pitfalls while training classification models and equip you with the tools to produce trustworthy results. We'll introduce k-fold cross validation and validation curves, and then look at ways to use them in Jupyter.

We'll also introduce...

Assessing Models with k-Fold Cross Validation

Thus far, we have trained models on a subset of the data and then assessed performance on the unseen portion, called the test set. This is good practice because the model's performance on data that's used for training is not a good indicator of its effectiveness as a predictor. It's very easy to increase accuracy on a training dataset by overfitting a model, which results in a poorer performance on unseen data.

That being said, simply training models on data that's been split in this way is not good enough. There is a natural variance in data that causes accuracies to be different (if even slightly), depending on the training and test splits. Furthermore, using only one training/test split to compare models can introduce bias toward certain models and lead to overfitting.

k-Fold cross validation offers a solution to this problem and allows the variance to be accounted for by way of an error estimate on each accuracy...

Dimensionality Reduction with PCA

Dimensionality reduction can be as simple as removing unimportant features from the training data. However, it's usually not obvious that removing a set of features will boost model performance. Even features that are highly noisy may offer some valuable information that models can learn from. For these reasons, we should know about better methods for reducing data dimensionality, such as the following:

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)

These techniques allow for data compression, where the most important information from a large group of features can be encoded in just a few features.

In this section, we'll focus on PCA. This technique transforms the data by projecting it into a new subspace of orthogonal principal components, where the components with the highest eigenvalues (as described here) encode the most information for training the model. Then, we can simply select a set of...

Summary

In this chapter, we have seen how to use Jupyter Notebooks to perform parameter optimization and model selection.

We built upon the work we did in the previous chapter, where we trained predictive classification models for our binary problem and saw how decision boundaries are drawn for SVM, KNN, and Random Forest models. We improved on these simple models by using validation curves to optimize parameters and explored how dimensionality reduction can improve model performance as well.

Finally, at the end of the last exercise, we explored how the final model can be used in practice to make data-driven decisions. This demonstration connects our results back to the original business problem that inspired our modeling problem initially.

In the next chapter, we will depart from machine learning and focus on data acquisition instead. Specifically, we will discuss methods for extracting web data and learn about HTTP requests, web scraping with Python, and more data processing...

The rest of the chapter is locked

You have been reading a chapter from

The Applied Data Science Workshop - Second Edition

Published in: Jul 2020Publisher: PacktISBN-13: 9781800202504

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Other recommended products

Related to this chapter

Applied Data Science with Python and Jupyter

Applied Data Science with Python and Jupyter teaches you the skills you need for entry-level data science. You'll learn about some of the most commonly used libraries that are part of the Anaconda distribution, and then explore machine learning models with real datasets to give you the skills and exposure you need for the real world. You'll finish up by learning how easy it can be to scrape and gather your own data from the open web so that you can apply your new skills in an actionable context.

BookOct 2018192 pages

Beginning Data Science with Python and Jupyter

Get to grips with the skills you need for entry-level data science in this hands-on Python and Jupyter course. You'll learn about some of the most commonly used libraries that are part of the Anaconda distribution, and then explore machine learning models with real datasets to give you the skills and exposure you need for the real world. We'll finish up by showing you how easy it can be to scrape and gather your own data from the open web, so that you can apply your new skills in an actionable context.

BookJun 2018194 pages

Applied Deep Learning with Python

Getting started with data science can be overwhelming, even for experienced developers. In this two-part, hands-on book we’ll show you how to apply your existing understanding of the Python language to this new and exciting field that’s full of new opportunities (and high expectations)!

BookAug 2018334 pages

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

The Machine Learning Workshop

With expert guidance and real-world examples, The Machine Learning Workshop gets you up and running with programming machine learning algorithms. By showing you how to leverage scikit-learn's flexibility, it teaches you all the skills you need to use machine learning to solve real-world problems.

BookJul 2020286 pages

scikit-learn Cookbook

scikit-learn has evolved as a robust library for machine learning applications in python with support for a wide range of supervised and unsupervised learning algorithms. This edition brings to you the various enhancements to its model implementations, API and bug fixes in the latest major release of scikit-learn to support Python. This book covers easy to follow recipes right from mathematical operations to implementing various supervised, unsupervised and deep learning algorithms with scikit-learn. Get practical hands-on knowledge to implement various models and algorithms like Multi-Layer Perceptrons, time-series split, MAE criterion for regression, criteria for gradient boosting, Classifier, Regressor, and much more.

BookNov 2017374 pages

Training Systems using Python Statistical Modeling

This book will acquaint you with various aspects of statistical analysis in Python. You will work with different types of prediction models, such as decision trees, random forests and neural networks. By the end of this book, you will be confident in using various Python packages to train your own models for effective machine learning.

BookMay 2019290 pages

Applied Supervised Learning with Python

Applied Supervised Learning with Python provides you a rich understanding of machine learning, one of the most pursued topics in information science, and Python, one of the most popular scripting languages. Through this book, you'll learn Jupyter Notebooks, the technology used in academic and commercial circles with in-line code running support.

BookApr 2019404 pages

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

BookFeb 2020532 pages

The Applied Artificial Intelligence Workshop

The Applied Artificial Intelligence Workshop teaches you the ins and outs of machine learning and neural networks from the ground up, using real-world examples. You'll learn to develop AI and ML models using Python, starting with using the minmax algorithm and alpha-beta pruning to create your first game, and ending with classifying images using neural networks.

BookJul 2020420 pages

The Data Wrangling Workshop

Data is the new oil, but it’s often in a crude form. To perform anything meaningful, such as data modeling, data visualization, or predictive analysis, you first need to wrangle with and refine data. The Data Wrangling Workshop equips you with the knowledge you need to get up and running with data wrangling in no time.

BookJul 2020576 pages

Hands-On Gradient Boosting with XGBoost and scikit-learn

This practical XGBoost guide will put your Python and scikit-learn knowledge to work by showing you how to build powerful, fine-tuned XGBoost models with impressive speed and accuracy. This book will help you to apply XGBoost’s alternative base learners, use unique transformers for model deployment, discover tips from Kaggle masters, and much more!

BookOct 2020310 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages