Chapter 1: Machine Learning Landscape
Welcome to Hands-On Gradient Boosting with XGBoost and Scikit-Learn, a book that will teach you the foundations, tips, and tricks of XGBoost, the best machine learning algorithm for making predictions from tabular data.
The focus of this book is XGBoost, also known as Extreme Gradient Boosting. The structure, function, and raw power of XGBoost will be fleshed out in increasing detail in each chapter. The chapters unfold to tell an incredible story: the story of XGBoost. By the end of this book, you will be an expert in leveraging XGBoost to make predictions from real data.
In the first chapter, XGBoost is presented in a sneak preview. It makes a guest appearance in the larger context of machine learning regression and classification to set the stage for what's to come.
This chapter focuses on preparing data for machine learning, a process also known as data wrangling. In addition to building machine learning models, you will...
Previewing XGBoost
Machine learning gained recognition with the first neural network in the 1940s, followed by the first machine learning checker champion in the 1950s. After some quiet decades, the field of machine learning took off when Deep Blue famously beat world chess champion Gary Kasparov in the 1990s. With a surge in computational power, the 1990s and early 2000s produced a plethora of academic papers revealing new machine learning algorithms such as random forests and AdaBoost.
The general idea behind boosting is to transform weak learners into strong learners by iteratively improving upon errors. The key idea behind gradient boosting is to use gradient descent to minimize the errors of the residuals. This evolutionary strand, from standard machine learning algorithms to gradient boosting, is the focus of the first four chapters of this book.
XGBoost is short for Extreme Gradient Boosting. The Extreme part refers to pushing the limits of computation to achieve gains...
Data wrangling
Data wrangling is a comprehensive term that encompasses the various stages of data preprocessing before machine learning can begin. Data loading, data cleaning, data analysis, and data manipulation are all included within the sphere of data wrangling.
This first chapter presents data wrangling in detail. The examples are meant to cover standard data wrangling challenges that can be swiftly handled by pandas, Python's special library for handling data analytics. Although no experience with pandas is required, basic knowledge of pandas will be beneficial. All code is explained so that readers new to pandas may follow along.
Dataset 1 – Bike rentals
The bike rentals dataset is our first dataset. The data source is the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php), a world-famous data warehouse that is free to the public. Our bike rentals dataset has been adjusted from the original dataset (https://archive.ics.uci...
Predicting regression
Machine learning algorithms aim to predict the values of one output column using data from one or more input columns. The predictions rely on mathematical equations determined by the general class of machine learning problems being addressed. Most supervised learning problems are classified as regression or classification. In this section, machine learning is introduced in the context of regression.
Predicting bike rentals
In the bike rentals dataset, df_bikes['cnt']
is the number of bike rentals in a given day. Predicting this column would be of great use to a bike rental company. Our problem is to predict the correct number of bike rentals on a given day based on data such as whether this day is a holiday or working day, forecasted temperature, humidity, windspeed, and so on.
According to the dataset, df_bikes['cnt']
is the sum of df_bikes['casual']
and df_bikes['registered']
. If df_bikes[...
Predicting classification
You learned that XGBoost may have an edge in regression, but what about classification? XGBoost has a classification model, but will it perform as accurately as well tested classification models such as logistic regression? Let's find out.
What is classification?
Unlike with regression, when predicting target columns with a limited number of outputs, a machine learning algorithm is categorized as a classification algorithm. The possible outputs may include the following:
Yes, No
Spam, Not Spam
0, 1
Red, Blue, Green, Yellow, Orange
Dataset 2 – The census
We will move a little more swiftly through the second dataset, the Census Income Data Set (https://archive.ics.uci.edu/ml/datasets/Census+Income), to predict personal income.
Data wrangling
Before implementing machine learning, the dataset must be preprocessed. When testing new algorithms, it's essential to have all numerical columns with no null values...
Summary
Your journey through XGBoost has officially begun! You started this chapter by learning the fundamentals of data wrangling and pandas, essential skills for all machine learning practitioners, with a focus on correcting null values. Next, you learned how to build machine learning models in scikit-learn by comparing linear regression with XGBoost. Then, you prepared a dataset for classification and compared logistic regression with XGBoost. In both cases, XGBoost was the clear winner.
Congratulations on building your first XGBoost models! Your initiation into data wrangling and machine learning using the pandas, NumPy, and scikit-learn libraries is complete.
In Chapter 2, Decision Trees in Depth, you will improve your machine learning skills by building decision trees, the base learners of XGBoost machine learning models, and fine-tuning hyperparameters to improve results.