Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 9: Data Modeling – Preprocessing

In this chapter, you will learn two important processes used to prepare data for modeling – splitting and scaling. You will learn how to use the sklearn methods – .StandardScaler and .MinMaxScaler for scaling, and .train_test_split for splitting. You will also be introduced to the reasons behind scaling and exactly what these methods do. As part of exploring splitting and scaling, you will use sklearn LinearRegression and statsmodels to create simple linear regression models.

By the end of this chapter, you will be comfortable preparing datasets to begin modeling. The main ideas you will learn in this chapter are as follows:

  • Exploring independent and dependent variables
  • Understanding data scaling and normalization
  • Activity 9.01 – Data splitting, scaling, and modeling

An introduction to data modeling

Consider a statement such as the weather depends on the season. If we wanted to confirm this statement with data, we would collect information about the weather during different times of the year. The statement is asserting a model – a model of the weather that says we can say something about the weather if we know the season. Proposing and evaluating models is data modeling.

Often, we would like to understand the relationships within our data (numbers and other types of information), and in the previous chapter, we used visualization methods for that. Here, we can go a little deeper and ask questions such as are the independent variables correlated with each other? or is the output a linear function of the input? In some cases, we can answer these questions with charts; in other cases, we may construct a mathematical model. A mathematical model is simply a function that transfers some input data into an output.

Data modeling, the topic...

Exploring dependent and independent variables

In this chapter, you will learn about dependent and independent variables. You will learn about the need for scaling and normalization of data, in addition to performing those operations. You will also use some basic modeling methods to analyze your data.

At a high level, we can say a dependent variable is related to one or more independent variables in a linear or non-linear way. Linear models are easy to understand. A linear model relating one Y to one X is just a line. With multiple X variables, each one has a coefficient that gives its effect on Y, and since all those effects are independent, we just add all the effects together in a multivariate linear model. In a non-linear model, Y depends on X in a more complex way, such as Y being a function of X2. We can create non-linear models nearly as easily as linear models in pandas using some simple additional modules. We'll explore how to do that in the following chapter.

Much...

Understanding data scaling and normalization

If we inspect the coefficients of our mpg model, from the Training, validation, and test splits section earlier, we see that they range over several orders of magnitude. The code here iterates over the variable names and coefficient values, taking advantage of Python's .enumerate() method that iterates over the column names but also returns a counter, which we capture in coef and use to index the model coefficients. For reference, the code prints the range of the variable in data used to fit the model:

print('var\t  coef\t\t\t    range')
for coef, var in enumerate(my_data.columns[1:-1]):
    print(var, '\t', round(my_model.coef_[0][coef], 5), 
          '\twith range ', round(float(my_data[var].max() - 
               ...

Activity 9.01 – Data splitting, scaling, and modeling

You are charged with analyzing the performance of a combined cycle power plant and are given data on the full-load electrical power production along with environmental variables (such as temperature or humidity). In the first part of the activity, you will split the data manually and with sklearn, then you will scale the data, construct a simple linear model, and output the results:

  1. For this activity, all you will need is the Pandas library, the modules from sklearn, and numpy. Load them in the first cell of the notebook.
  2. Use the power_plant.csv dataset – 'Datasets\\power_plant.csv'. Read the data into a Pandas DataFrame, print out the shape, and list the first five rows.

The independent variables are as follows:

  • AT – ambient temperature
  • V – exhaust vacuum level
  • AP – ambient pressure
  • RH – relative humidity

The dependent variable is...

Summary

In this chapter, you learned how to split and scale data for downstream modeling tasks. You now can split data manually if that is appropriate but are also familiar with the sklearn methods to simplify the splitting tasks. You also saw how different scaling methods work and learned why min/max scaling might be used in some models and standardization in other models. You've seen how to make simple linear regression models, a topic to which we will return in the next chapter. Along the way, you learned why it is important to split data and hold some back from the modeling step in order to measure performance for new data. You now have the basic toolkit for preparing data for modeling, which is where we will begin the next chapter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So