Free eBook - The Data Science Workshop - Second Edition

5 (1 reviews total)
By Anthony So , Thomas V. Joseph , Robert Thas John and 2 more
  • A new free eBook every day on the latest in tech
  • 30 permanently free eBooks from our core tech library
  1. 1. Introduction to Data Science in Python
About this book

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities.

The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search.

Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch.

By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.

Publication date:
August 2020


2. Regression


This chapter is an introduction to linear regression analysis and its application to practical problem-solving in data science. You will learn how to use Python, a versatile programming language, to carry out regression analysis and examine the results. The use of the logarithm function to transform inherently non-linear relationships between variables and to enable the application of the linear regression method of analysis will also be introduced.

By the end of this chapter, you will be able to identify and import the Python modules required for regression analysis; use the pandas module to load a dataset and prepare it for regression analysis; create a scatter plot of bivariate data and fit a regression line through it; use the methods available in the Python statsmodels module to fit a regression model to a dataset; explain the results of simple and multiple linear regression analysis; assess the goodness of fit of a linear regression model; and...



The previous chapter provided a primer to Python programming and an overview of the data science field. Data science is a relatively young multidisciplinary field of study. It draws its concepts and methods from the traditional fields of statistics, computer science, and the broad field of artificial intelligence (AI), especially the subfield of AI called machine learning:

Figure 2.1: The data science models

As you can see in Figure 2.1, data science aims to make use of both structured and unstructured data, develop models that can be effectively used, make predictions, and also derive insights for decision making.

A loose description of structured data will be any set of data that can be conveniently arranged into a table that consists of rows and columns. This kind of data is normally stored in database management systems.

Unstructured data, however, cannot be conveniently stored in tabular form – an example of such a dataset...


Simple Linear Regression

In Figure 2.3, you can see the crime rate per capita and the median value of owner-occupied homes for the city of Boston, which is the largest city of the Commonwealth of Massachusetts. We seek to use regression analysis to gain an insight into what drives crime rates in the city.

Such analysis is useful to policy makers and society in general because it can help with decision-making directed toward the reduction of the crime rate, and hopefully the eradication of crime across communities. This can make communities safer and increase the quality of life in society.

This is a data science problem and is of the supervised machine learning type. There is a dependent variable named crime rate (let's denote it Y), whose variation we seek to understand in terms of an independent variable, named Median value of owner-occupied homes (let's denote it X).

In other words, we are trying to understand the variation in crime rate based on different...


Multiple Linear Regression

In the simple linear regression discussed previously, we only have one independent variable. If we include multiple independent variables in our analysis, we get a multiple linear regression model. Multiple linear regression is represented in a way that's similar to simple linear regression.

Let's consider a case where we want to fit a linear regression model that has three independent variables, X1, X2, and X3. The formula for the multiple linear regression equation will look like Figure 2.5:

Figure 2.5: Multiple linear regression equation

Each independent variable will have its own coefficient or parameter (that is, β1 β2 or β3). The βs coefficient tells us how a change in their respective independent variable influences the dependent variable if all other independent variables are unchanged.

Estimating the Regression Coefficients (β0, β1, β2 and β3)

The regression...


Conducting Regression Analysis Using Python

Having discussed the basics of regression analysis, it is now time to get our hands dirty and actually do some regression analysis using Python.

To begin with our analysis, we need to start a session in Python and load the relevant modules and dataset required.

All of the regression analysis we will do in this chapter will be based on the Boston Housing dataset. The dataset is good for teaching and is suitable for linear regression analysis. It presents the level of challenge that necessitates the use of the logarithm function to transform variables in order to achieve a better level of model fit to the data. The dataset contains information on a collection of properties in the Boston area and can be used to determine how the different housing attributes of a specific property affect the property's value.

The column headings of the Boston Housing dataset CSV file can be explained as follows:

  • CRIM – per capita...

Multiple Regression Analysis

In the exercises and activity so far, we have used only one independent variable in our regression analysis. In practice, as we have seen with the Boston Housing dataset, processes and phenomena of analytic interest are rarely influenced by only one feature. To be able to model the variability to a higher level of accuracy, therefore, it is necessary to investigate all the independent variables that may contribute significantly toward explaining the variability in the dependent variable. Multiple regression analysis is the method that is used to achieve this.

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels Formula API

In this exercise, we will be using the plus operator (+) in the patsy formula string to define a linear regression model that includes more than one independent variable.

To complete this activity, run the code in the following steps in your Colab notebook:

  1. Open a new Colab notebook file and...

Assumptions of Regression Analysis

Due to the parametric nature of linear regression analysis, the method makes certain assumptions about the data it analyzes. When these assumptions are not met, the results of the regression analysis may be misleading to say the least. It is, therefore, necessary to check any analysis work to ensure the regression assumptions are not violated.

Let's review the main assumptions of linear regression analysis that we must ensure are met in order to develop a good model:

  1. The relationship between the dependent and independent variables must be linear and additive.

    This means that the relationship must be of the straight-line type, and if there are many independent variables involved, thus multiple linear regression, the weighted sum of these independent variables must be able to explain the variability in the dependent variable.

  2. The residual terms (ϵi) must be normally distributed. This is so that the standard error of estimate...

Explaining the Results of Regression Analysis

A primary objective of regression analysis is to find a model that explains the variability observed in a dependent variable of interest. It is, therefore, very important to have a quantity that measures how well a regression model explains this variability. The statistic that does this is called R-squared (R2). Sometimes, it is also called the coefficient of determination. To understand what it actually measures, we need to take a look at some other definitions.

The first of these is called the Total Sum of Squares (TSS). TSS gives us a measure of the total variance found in the dependent variable from its mean value.

The next quantity is called the Regression sum of squares (RSS). This gives us a measure of the amount of variability in the dependent variable that our model explains. If you imagine us creating a perfect model with no errors in prediction, then TSS will be equal to RSS. Our hypothetically perfect model will provide...



This chapter introduced the topic of linear regression analysis using Python. We learned that regression analysis, in general, is a supervised machine learning or data science problem. We learned about the fundamentals of linear regression analysis, including the ideas behind the method of least squares. We also learned about how to use the pandas Python module to load and prepare data for exploration and analysis.

We explored how to create scatter graphs of bivariate data and how to fit a line of best fit through them. Along the way, we discovered the power of the statsmodels module in Python. We explored how to use it to define simple linear regression models and to solve the model for the relevant parameters. We also learned how to extend that to situations where the number of independent variables is more than one – multiple linear regressions. We investigated approaches by which we can transform a non-linear relation between a dependent and independent...

About the Authors
  • Anthony So

    Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.

    Browse publications by this author
  • Thomas V. Joseph

    Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.

    Browse publications by this author
  • Robert Thas John

    Robert Thas John is a Google developer expert in machine learning. His day job involves working as a data engineer on the Google Cloud Platform by building, training, and deploying large-scale machine learning models. He also makes decisions about how to store and process large amounts of data. He has more than 10 years of experience in building enterprise-grade solutions and working with data. He spends his free time learning or contributing to the developer community. He frequently travels to speak at technology events or to mentor developers. He also writes a blog on data science.

    Browse publications by this author
  • Andrew Worsley

    Andrew David Worsley is an independent consultant and educator with expertise in the areas of machine learning, statistics, cloud computing, and artificial intelligence. He has practiced data science in several countries across a multitude of industries including retail, financial services, marketing, resources, and healthcare.

    Browse publications by this author
  • Dr. Samuel Asare

    Dr. Samuel Asare is a professional engineer with enthusiasm for Python programming, research, and writing. He is highly skilled in applying data science methods to the extraction of useful insights from large data sets. He possesses solid skills in project management processes. Samuel has previously held positions, in industry and academia, as a process engineer and a lecturer of materials science and engineering respectively. Presently, he is pursuing his passion for solving industry problems, using data science methods, and writing.

    Browse publications by this author
Latest Reviews (1 reviews total)
El contenido del libro me gusta bastante, y voy en el primer capitulo y cumple con mis espectativas
Recommended For You
The Data Analysis Workshop

Learn how to analyze data using Python models with the help of real-world use cases and guidance from industry experts

By Gururajan Govindan and 2 more
The Data Wrangling Workshop - Second Edition

A beginner’s guide to simplifying Extract, Transform, Load (ETL) processes with the help of hands-on tips, tricks, and best practices, in a fun and interactive way

By Brian Lipp and 2 more
The Data Visualization Workshop

Explore a modern approach to visualizing data with Python and transform large real-world datasets into expressive visual graphics using this beginner-friendly workshop

By Mario Döbler and 1 more
Machine Learning for Algorithmic Trading - Second Edition

Leverage machine learning to design and back-test automated trading strategies for real-world markets using pandas, TA-Lib, scikit-learn, LightGBM, SpaCy, Gensim, TensorFlow 2, Zipline, backtrader, Alphalens, and pyfolio.

By Stefan Jansen