You're reading from The Pandas Workshop

Product typeBook

Published inJun 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800208933

Edition1st Edition

Languages

Python

Tools

NumPy Pandas

Concepts

Data Science

Authors (4):

Blaine Bateman

Saikat Basak

Thomas V. Joseph

William So

View More author details

Chapter 9: Data Modeling – Preprocessing

In this chapter, you will learn two important processes used to prepare data for modeling – splitting and scaling. You will learn how to use the sklearn methods – .StandardScaler and .MinMaxScaler for scaling, and .train_test_split for splitting. You will also be introduced to the reasons behind scaling and exactly what these methods do. As part of exploring splitting and scaling, you will use sklearn LinearRegression and statsmodels to create simple linear regression models.

By the end of this chapter, you will be comfortable preparing datasets to begin modeling. The main ideas you will learn in this chapter are as follows:

Exploring independent and dependent variables
Understanding data scaling and normalization
Activity 9.01 – Data splitting, scaling, and modeling

An introduction to data modeling

Consider a statement such as the weather depends on the season. If we wanted to confirm this statement with data, we would collect information about the weather during different times of the year. The statement is asserting a model – a model of the weather that says we can say something about the weather if we know the season. Proposing and evaluating models is data modeling.

Often, we would like to understand the relationships within our data (numbers and other types of information), and in the previous chapter, we used visualization methods for that. Here, we can go a little deeper and ask questions such as are the independent variables correlated with each other? or is the output a linear function of the input? In some cases, we can answer these questions with charts; in other cases, we may construct a mathematical model. A mathematical model is simply a function that transfers some input data into an output.

Data modeling, the topic...

Exploring dependent and independent variables

In this chapter, you will learn about dependent and independent variables. You will learn about the need for scaling and normalization of data, in addition to performing those operations. You will also use some basic modeling methods to analyze your data.

At a high level, we can say a dependent variable is related to one or more independent variables in a linear or non-linear way. Linear models are easy to understand. A linear model relating one Y to one X is just a line. With multiple X variables, each one has a coefficient that gives its effect on Y, and since all those effects are independent, we just add all the effects together in a multivariate linear model. In a non-linear model, Y depends on X in a more complex way, such as Y being a function of X2. We can create non-linear models nearly as easily as linear models in pandas using some simple additional modules. We'll explore how to do that in the following chapter.

Much...

Understanding data scaling and normalization

If we inspect the coefficients of our mpg model, from the Training, validation, and test splits section earlier, we see that they range over several orders of magnitude. The code here iterates over the variable names and coefficient values, taking advantage of Python's .enumerate() method that iterates over the column names but also returns a counter, which we capture in coef and use to index the model coefficients. For reference, the code prints the range of the variable in data used to fit the model:

print('var\t  coef\t\t\t    range')

for coef, var in enumerate(my_data.columns[1:-1]):

    print(var, '\t', round(my_model.coef_[0][coef], 5),

          '\twith range ', round(float(my_data[var].max() -

...

Activity 9.01 – Data splitting, scaling, and modeling

You are charged with analyzing the performance of a combined cycle power plant and are given data on the full-load electrical power production along with environmental variables (such as temperature or humidity). In the first part of the activity, you will split the data manually and with sklearn, then you will scale the data, construct a simple linear model, and output the results:

For this activity, all you will need is the Pandas library, the modules from sklearn, and numpy. Load them in the first cell of the notebook.
Use the power_plant.csv dataset – 'Datasets\\power_plant.csv'. Read the data into a Pandas DataFrame, print out the shape, and list the first five rows.

The independent variables are as follows:

AT – ambient temperature
V – exhaust vacuum level
AP – ambient pressure
RH – relative humidity

The dependent variable is...

Summary

In this chapter, you learned how to split and scale data for downstream modeling tasks. You now can split data manually if that is appropriate but are also familiar with the sklearn methods to simplify the splitting tasks. You also saw how different scaling methods work and learned why min/max scaling might be used in some models and standardization in other models. You've seen how to make simple linear regression models, a topic to which we will return in the next chapter. Along the way, you learned why it is important to split data and hold some back from the modeling step in order to measure performance for new data. You now have the basic toolkit for preparing data for modeling, which is where we will begin the next chapter.

The rest of the chapter is locked

You have been reading a chapter from

The Pandas Workshop

Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages