Hands-On Gradient Boosting with XGBoost and scikit-learn

By Corey Wade
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Machine Learning Landscape
About this book
XGBoost is an industry-proven, open-source software library that provides a gradient boosting framework for scaling billions of data points quickly and efficiently. The book introduces machine learning and XGBoost in scikit-learn before building up to the theory behind gradient boosting. You’ll cover decision trees and analyze bagging in the machine learning context, learning hyperparameters that extend to XGBoost along the way. You’ll build gradient boosting models from scratch and extend gradient boosting to big data while recognizing speed limitations using timers. Details in XGBoost are explored with a focus on speed enhancements and deriving parameters mathematically. With the help of detailed case studies, you’ll practice building and fine-tuning XGBoost classifiers and regressors using scikit-learn and the original Python API. You'll leverage XGBoost hyperparameters to improve scores, correct missing values, scale imbalanced datasets, and fine-tune alternative base learners. Finally, you'll apply advanced XGBoost techniques like building non-correlated ensembles, stacking models, and preparing models for industry deployment using sparse matrices, customized transformers, and pipelines. By the end of the book, you’ll be able to build high-performing machine learning models using XGBoost with minimal errors and maximum speed.
Publication date:
October 2020
Publisher
Packt
Pages
310
ISBN
9781839218354

 

Chapter 1: Machine Learning Landscape

Welcome to Hands-On Gradient Boosting with XGBoost and Scikit-Learn, a book that will teach you the foundations, tips, and tricks of XGBoost, the best machine learning algorithm for making predictions from tabular data.

The focus of this book is XGBoost, also known as Extreme Gradient Boosting. The structure, function, and raw power of XGBoost will be fleshed out in increasing detail in each chapter. The chapters unfold to tell an incredible story: the story of XGBoost. By the end of this book, you will be an expert in leveraging XGBoost to make predictions from real data.

In the first chapter, XGBoost is presented in a sneak preview. It makes a guest appearance in the larger context of machine learning regression and classification to set the stage for what's to come. 

This chapter focuses on preparing data for machine learning, a process also known as data wrangling. In addition to building machine learning models, you will...

 

Previewing XGBoost

Machine learning gained recognition with the first neural network in the 1940s, followed by the first machine learning checker champion in the 1950s. After some quiet decades, the field of machine learning took off when Deep Blue famously beat world chess champion Gary Kasparov in the 1990s. With a surge in computational power, the 1990s and early 2000s produced a plethora of academic papers revealing new machine learning algorithms such as random forests and AdaBoost.

The general idea behind boosting is to transform weak learners into strong learners by iteratively improving upon errors. The key idea behind gradient boosting is to use gradient descent to minimize the errors of the residuals. This evolutionary strand, from standard machine learning algorithms to gradient boosting, is the focus of the first four chapters of this book.

XGBoost is short for Extreme Gradient Boosting. The Extreme part refers to pushing the limits of computation to achieve gains...

 

Data wrangling

Data wrangling is a comprehensive term that encompasses the various stages of data preprocessing before machine learning can begin. Data loading, data cleaning, data analysis, and data manipulation are all included within the sphere of data wrangling.

This first chapter presents data wrangling in detail. The examples are meant to cover standard data wrangling challenges that can be swiftly handled by pandas, Python's special library for handling data analytics. Although no experience with pandas is required, basic knowledge of pandas will be beneficial. All code is explained so that readers new to pandas may follow along.

Dataset 1 – Bike rentals

The bike rentals dataset is our first dataset. The data source is the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php), a world-famous data warehouse that is free to the public. Our bike rentals dataset has been adjusted from the original dataset (https://archive.ics.uci...

 

Predicting regression

Machine learning algorithms aim to predict the values of one output column using data from one or more input columns. The predictions rely on mathematical equations determined by the general class of machine learning problems being addressed. Most supervised learning problems are classified as regression or classification. In this section, machine learning is introduced in the context of regression.

Predicting bike rentals

In the bike rentals dataset, df_bikes['cnt'] is the number of bike rentals in a given day. Predicting this column would be of great use to a bike rental company. Our problem is to predict the correct number of bike rentals on a given day based on data such as whether this day is a holiday or working day, forecasted temperature, humidity, windspeed, and so on.

According to the dataset, df_bikes['cnt'] is the sum of df_bikes['casual'] and df_bikes['registered']. If df_bikes[...

 

Predicting classification

You learned that XGBoost may have an edge in regression, but what about classification? XGBoost has a classification model, but will it perform as accurately as well tested classification models such as logistic regression? Let's find out.

What is classification?

Unlike with regression, when predicting target columns with a limited number of outputs, a machine learning algorithm is categorized as a classification algorithm. The possible outputs may include the following:

  • Yes, No

  • Spam, Not Spam

  • 0, 1

  • Red, Blue, Green, Yellow, Orange

Dataset 2 – The census

We will move a little more swiftly through the second dataset, the Census Income Data Set (https://archive.ics.uci.edu/ml/datasets/Census+Income), to predict personal income.

Data wrangling

Before implementing machine learning, the dataset must be preprocessed. When testing new algorithms, it's essential to have all numerical columns with no null values...

 

Summary

Your journey through XGBoost has officially begun! You started this chapter by learning the fundamentals of data wrangling and pandas, essential skills for all machine learning practitioners, with a focus on correcting null values. Next, you learned how to build machine learning models in scikit-learn by comparing linear regression with XGBoost. Then, you prepared a dataset for classification and compared logistic regression with XGBoost. In both cases, XGBoost was the clear winner.

Congratulations on building your first XGBoost models! Your initiation into data wrangling and machine learning using the pandas, NumPy, and scikit-learn libraries is complete.

In Chapter 2, Decision Trees in Depth, you will improve your machine learning skills by building decision trees, the base learners of XGBoost machine learning models, and fine-tuning hyperparameters to improve results.

About the Author
  • Corey Wade

    Corey Wade, M.S. Mathematics, M.F.A. Writing & Consciousness, is the founder and director of Berkeley Coding Academy where he teaches Machine Learning and AI to teens from all over the world. Additionally, Corey chairs the Math Department at Berkeley Independent Study where he has received multiple grants to run after-school coding programs to help bridge the tech skills gap. Additional experiences include teaching Natural Language Processing with Hello World, developing Data Science curricula with Pathstream, and publishing statistics and machine learning models with Towards Data Science, Springboard, and Medium.

    Browse publications by this author
Hands-On Gradient Boosting with XGBoost and scikit-learn
Unlock this book and the full library FREE for 7 days
Start now