The Data Science Workshop

By Anthony So , Thomas V. Joseph , Robert Thas John and 2 more
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    2. Regression
About this book
You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results. Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding. Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book. Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.
Publication date:
January 2020


2. Regression


By the end of this chapter, you will be able to identify and import the Python modules required for regression analysis; use the pandas module to load a dataset and prepare it for regression analysis; create a scatter plot of bivariate data and fit a regression line through it; use the methods available in the Python statsmodels module to fit a regression model to a dataset; explain the results of simple and multiple linear regression analysis; assess the goodness of fit of a linear regression model; and apply linear regression analysis as a tool for practical problem-solving.

This chapter is an introduction to linear regression analysis and its application to practical problem-solving in data science. You will learn how to use Python, a versatile programming language, to carry out regression analysis and examine the results. The use of the logarithm function to transform inherently non-linear relationships between variables and to enable the application...



The previous chapter provided a primer to Python programming and an overview of the data science field. Data science is a relatively young multidisciplinary field of study. It draws its concepts and methods from the traditional fields of statistics, computer science, and the broad field of artificial intelligence (AI), especially the subfield of AI called machine learning:

Figure 2.1: The data science models

As you can see in Figure 2.1, data science aims to make use of both structured and unstructured data, develop models that can be effectively used, make predictions, and also derive insights for decision making.

A loose description of structured data will be any set of data that can be conveniently arranged into a table that consists of rows and columns. This kind of data is normally stored in database management systems.

Unstructured data, however, cannot be conveniently stored in tabular form – an example of such a dataset...


Simple Linear Regression

In Figure 2.3, you can see the crime rate per capita and the median value of owner-occupied homes for the city of Boston, which is the largest city of the Commonwealth of Massachusetts. We seek to use regression analysis to gain an insight into what drives crime rates in the city.

Such analysis is useful to policy makers and society in general because it can help with decision-making directed toward the reduction of the crime rate, and hopefully the eradication of crime across communities. This can make communities safer and increase the quality of life in society.

This is a data science problem and is of the supervised machine learning type. There is a dependent variable named crime rate (let's denote it Y), whose variation we seek to understand in terms of an independent variable, named Median value of owner-occupied homes (let's denote it X).

In other words, we are trying to understand the variation in crime rate based on different...


Multiple Linear Regression

In the simple linear regression discussed previously, we only have one independent variable. If we include multiple independent variables in our analysis, we get a multiple linear regression model. Multiple linear regression is represented in a way that's similar to simple linear regression.

Let's consider a case where we want to fit a linear regression model that has three independent variables, X1, X2, and X3. The formula for the multiple linear regression equation will look like Equation 2.2:

Figure 2.5: Multiple linear regression equation

Each independent variable will have its own coefficient or parameter (that is, β1 β2 or β3 ). The βs coefficient tells us how a change in their respective independent variable influences the dependent variable if all other independent variables are unchanged.

Estimating the Regression Coefficients (β0, β1, β2 and β3)

The regression...


Conducting Regression Analysis Using Python

Having discussed the basics of regression analysis, it is now time to get our hands dirty and actually do some regression analysis using Python.

To begin with our analysis, we need to start a session in Python and load the relevant modules and dataset required.

All of the regression analysis we will do in this chapter will be based on the Boston Housing dataset. The dataset is good for teaching and is suitable for linear regression analysis. It presents the level of challenge that necessitates the use of the logarithm function to transform variables in order to achieve a better level of model fit to the data. The dataset contains information on a collection of properties in the Boston area and can be used to determine how the different housing attributes of a specific property affect the property's value.

The column headings of the Boston Housing dataset CSV file can be explained as follows:

  • CRIM – per capita...

Multiple Regression Analysis

In the exercises and activity so far, we have used only one independent variable in our regression analysis. In practice, as we have seen with the Boston Housing dataset, processes and phenomena of analytic interest are rarely influenced by only one feature. To be able to model the variability to a higher level of accuracy, therefore, it is necessary to investigate all the independent variables that may contribute significantly toward explaining the variability in the dependent variable. Multiple regression analysis is the method that is used to achieve this.

Exercise 2.05: Fitting a Multiple Linear Regression Model Using the Statsmodels formula API

In this exercise, we will be using the plus operator (+) in the patsy formula string to define a linear regression model that includes more than one independent variable.

To complete this activity, run the code in the following steps in your Colab notebook:

  1. Open a new Colab notebook file and...

Assumptions of Regression Analysis

Due to the parametric nature of linear regression analysis, the method makes certain assumptions about the data it analyzes. When these assumptions are not met, the results of the regression analysis may be misleading to say the least. It is, therefore, necessary to check any analysis work to ensure the regression assumptions are not violated.

Let's review the main assumptions of linear regression analysis that we must ensure are met in order to develop a good model:

  1. The relationship between the dependent and independent variables must be linear and additive.

    This means that the relationship must be of the straight-line type, and if there are many independent variables involved, thus multiple linear regression, the weighted sum of these independent variables must be able to explain the variability in the dependent variable.

  2. The residual terms (ϵi) must be normally distributed. This is so that the standard error...

Explaining the Results of Regression Analysis

A primary objective of regression analysis is to find a model that explains the variability observed in a dependent variable of interest. It is, therefore, very important to have a quantity that measures how well a regression model explains this variability. The statistic that does this is called R-squared (R2). Sometimes, it is also called the coefficient of determination. To understand what it actually measures, we need to take a look at some other definitions.

The first of these is called the Total Sum of Squares (TSS). TSS gives us a measure of the total variance found in the dependent variable from its mean value.

The next quantity is called the Regression sum of squares (RSS). This gives us a measure of the amount of variability in the dependent variable that our model explains. If you imagine us creating a perfect model with no errors in prediction, then TSS will be equal to RSS. Our hypothetically perfect model will provide...



This chapter introduced the topic of linear regression analysis using Python. We learned that regression analysis, in general, is a supervised machine learning or data science problem. We learned about the fundamentals of linear regression analysis, including the ideas behind the method of least squares. We also learned about how to use the pandas Python module to load and prepare data for exploration and analysis.

We explored how to create scatter graphs of bivariate data and how to fit a line of best fit through them. Along the way, we discovered the power of the statsmodels module in Python. We explored how to use it to define simple linear regression models and to solve the model for the relevant parameters. We also learned how to extend that to situations where the number of independent variables is more than one – multiple linear regressions. We investigated approaches by which we can transform a non-linear relation between a dependent and independent variable...

About the Authors
  • Anthony So

    Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.

    Browse publications by this author
  • Thomas V. Joseph

    Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.

    Browse publications by this author
  • Robert Thas John

    Robert Thas John is a Google developer expert in machine learning. His day job involves working as a data engineer on the Google Cloud Platform by building, training, and deploying large-scale machine learning models. He also makes decisions about how to store and process large amounts of data. He has more than 10 years of experience in building enterprise-grade solutions and working with data. He spends his free time learning or contributing to the developer community. He frequently travels to speak at technology events or to mentor developers. He also writes a blog on data science.

    Browse publications by this author
  • Andrew Worsley

    Andrew David Worsley is an independent consultant and educator with expertise in the areas of machine learning, statistics, cloud computing, and artificial intelligence. He has practiced data science in several countries across a multitude of industries including retail, financial services, marketing, resources, and healthcare.

    Browse publications by this author
  • Dr. Samuel Asare

    Dr. Samuel Asare is a professional engineer with enthusiasm for Python programming, research, and writing. He is highly skilled in applying data science methods to the extraction of useful insights from large data sets. He possesses solid skills in project management processes. Samuel has previously held positions, in industry and academia, as a process engineer and a lecturer of materials science and engineering respectively. Presently, he is pursuing his passion for solving industry problems, using data science methods, and writing.

    Browse publications by this author
The Data Science Workshop
Unlock this book and the full library FREE for 7 days
Start now