Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 10: Data Modeling – Modeling Basics

In this chapter, you will learn how to discover patterns in data using resampling and smoothing. The .resample(), .rolling(), and .ewm() pandas methods will be introduced and you will learn how to use them to filter out the noise and perform other useful explorations of data series. You will learn how sampling can sometimes include data from future times, which is a problem for predictive modeling, and how to address that. At the end of the chapter, you will see how a combination of scaling (introduced in Chapter 9, Data Modeling – Preprocessing), and smoothing can show interesting similarities between different data series, which might otherwise be overlooked.

By the end of this chapter, you will be skilled at applying scaling, sampling, and smoothing in a variety of ways to your data analyses.

This chapter covers the following topics:

  • Learning the modeling basics
  • Predicting future values of time series
  • ...

Introduction to data modeling

Data is often provided to you in a form that isn't completely suitable for analysis and modeling. As an example, suppose you are trying to summarize and analyze the sales of students selling cookies in an effort to raise money for a school trip. You would like to get an idea of the expected sales per student per week, in order to recognize students putting in effort and achieving higher sales. Unfortunately, the data for any given student comes in at somewhat random times, making comparisons more difficult. You decide to take each student's sales and fill in the missing days by interpolating between the days for which you have data. The process is quite tedious, and part-way through, you realize you will also have to go back and divide each day by the weekly total, otherwise you are inflating the total sales. Pandas provides the .resample() method you saw in Chapter 9, Data Modeling – Preprocessing, and by combing that with a .rolling...

Learning the modeling basics

So far, we've talked about data modeling in a somewhat abstract sense. In this and the next chapter, we will focus on the tools that help us gain insights from data and construct some basic predictive models using that data. We will begin by defining the modeling landscape in more depth, then look at some of the tools provided directly in pandas.

Modeling tools

In Chapter 9, Data Modeling – Preprocessing, we introduced the scikit-learn (sklearn) LinearRegression method and showed how to fit a simple multiple linear regression model. While there is a vast range of modeling tools available for Python, sklearn is perhaps one of the most used for everything from regression to classification and even basic neural networks. The sklearn ecosystem is described (see https://scikit-learn.org/stable/) as follows:

  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built...

Predicting future values of time series

You have seen how smoothing can be used to uncover important information in a series that might be hidden by noise. It might be tempting to think that smoothing is a very easy data modeling method, so why not use it to make predictions? The issue that arises is, in many cases, the process of smoothing data and aligning it to the original series means you are using information for any given point in the smoothed series that includes future values. Therefore, using such values as predictions is an example of data leakage, discussed in Chapter 9, Data Modeling – Preprocessing in the Avoiding information leakage section.

Suppose you are again analyzing the SPX index data you saw in Chapter 9, Data Modeling – Preprocessing:

  1. Here, you read the data, convert the dates to datetimes, and make a simple plot over a limited time range:
    SPX = pd.read_csv('Datasets/spx.csv')
    SPX['date'] = pd.to_datetime(SPX[&apos...

Activity 10.01 – Normalizing and smoothing data

Suppose you are an analyst in a financial advisory firm. Your manager has given three stock symbols to you and requested your input on how they may be correlated with their price behavior. You are provided a stocks.csv data file, which contains the symbols, closing prices, trading volumes, and a sentiment indicator (some view of the quality of the stocks, but you are not told the exact definition). Your initial goal here is to determine whether all three stocks show similar market characteristics or not, and if any or all of them do, make an initial visualization using smoothing. The long-term goal is to try to build some predictive models, so you will split the data into train and test sets. As it is time series, it's important to split on time, not randomly. For this activity, all you will need is the pandas library, a scaling module from sklearn, and matplotlib. Load them in the first cell of the notebook:

ximport pandas...

Summary

In this chapter, you built on the topics of independent and dependent variables, splitting data into train/validation/test splits for modeling and providing unbiased estimates of model performance. Here, you learned a range of basic data modeling methods using resampling (up and downsampling data frequency) and rolling window approaches to smoothing and estimating. You began your detailed investigation of data modeling with pandas tools for smoothing and resampling data, and some particular capabilities to handle time series. Importantly, you saw that smoothing methods can highlight patterns in very noisy data and that smoothing can be non-uniform in time, such as using .ewm() or a custom weighting function. With these foundational methods in hand, the next chapter will conclude data modeling with a deeper exploration of linear regression and then non-linear and powerful modeling methods, using Random Forest as a regression model.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So