Reader small image

You're reading from  Applied Data Science with Python and Jupyter

Product typeBook
Published inOct 2018
Reading LevelBeginner
Publisher
ISBN-139781789958171
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Alex Galea
Alex Galea
author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea

Right arrow

Chapter 2. Data Cleaning and Advanced Machine Learning

Note

Learning Objectives

By the end of this chapter, you will be able to:

  • Plan a machine learning classification strategy

  • Preprocess data to prepare it for machine learning

  • Train classification models

  • Use validation curves to tune model parameters

  • Use dimensionality reduction to enhance model performance

Note

In this chapter, you will learn data preprocessing and machine learning by completing several practical exercises.

Introduction


Consider a small food-delivery business that is looking to optimize their product. An analyst might look at the appropriate data and determine what type of food people are enjoying most. Perhaps they find a large amount of people are ordering the spiciest food options, indicating the business might be losing out on customers who desire something even more spicy. This is quite basic, or as some might say, "vanilla" analytics.

In a separate task, the analyst could employ predictive analytics by modeling the order volumes over time. With enough data, they could predict the future order volumes and therefore guide the restaurant as to how many staff are required each day. This model could take factors such as the weather into account to make the best predictions. For instance, a heavy rainstorm could be an indicator to staff more delivery personnel to make up for slow travel times. With historical weather data, that type of signal could be encoded into the model. This prediction...

Preparing to Train a Predictive Model


Here, we will cover the preparation required to train a predictive model. Although not as technically glamorous as training the models themselves, this step should not be taken lightly. It's very important to ensure you have a good plan before proceeding with the details of building and training a reliable model. Furthermore, once you've decided on the right plan, there are technical steps in preparing the data for modeling that should not be overlooked.

Note

We must be careful not to go so deep into the weeds of technical tasks that we lose sight of the goal. Technical tasks include things that require programming skills, for example, constructing visualizations, querying databases, and validating predictive models. It's easy to spend hours trying to implement a specific feature or get the plots looking just right. Doing this sort of thing is certainly beneficial to our programming skills, but we should not forget to ask ourselves if it's really worth...

Training Classification Models


As you've already seen in the previous chapter, using libraries such as scikit-learn and platforms such as Jupyter, predictive models can be trained in just a few lines of code. This is possible by abstracting away the difficult computations involved with optimizing model parameters. In other words, we deal with a black box where the internal operations are hidden instead. With this simplicity also comes the danger of misusing algorithms, for example, by overfitting during training or failing to properly test on unseen data. We'll show how to avoid these pitfalls while training classification models and produce trustworthy results with the use of k-fold cross validation and validation curves.

Introduction to Classification Algorithms

Recall the two types of supervised machine learning: regression and classification. In regression, we predict a continuous target variable. For example, recall the linear and polynomial models from the first chapter. In this chapter...

Summary


In this chapter, we have seen how predictive models can be trained in Jupyter Notebooks.

To begin with, we talked about how to plan a machine learning strategy. We thought about how to design a plan that can lead to actionable business insights and stressed the importance of using the data to help set realistic business goals. We also explained machine learning terminology such as supervised learning, unsupervised learning, classification, and regression.

Next, we discussed methods for preprocessing data using scikit-learn and pandas. This included lengthy discussions and examples of a surprisingly time-consuming part of machine learning: dealing with missing data.

In the latter half of the chapter, we trained predictive classification models for our binary problem, comparing how decision boundaries are drawn for various models such as the SVM, k-Nearest Neighbors, and Random Forest. We then showed how validation curves can be used to make good parameter choices and how dimensionality...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Applied Data Science with Python and Jupyter
Published in: Oct 2018Publisher: ISBN-13: 9781789958171
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alex Galea

Alex Galea has been professionally practicing data analytics since graduating with a masters degree in physics from the University of Guelph, Canada. He developed a keen interest in Python while researching quantum gases as part of his graduate studies. Alex is currently doing web data analytics, where Python continues to play a key role in his work. He is a frequent blogger about data-centric projects that involve Python and Jupyter Notebooks.
Read more about Alex Galea