You're reading from The Pandas Workshop

Product typeBook

Published inJun 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800208933

Edition1st Edition

Languages

Python

Tools

NumPy Pandas

Concepts

Data Science

Authors (4):

Blaine Bateman

Saikat Basak

Thomas V. Joseph

William So

View More author details

Chapter 10: Data Modeling – Modeling Basics

In this chapter, you will learn how to discover patterns in data using resampling and smoothing. The .resample(), .rolling(), and .ewm() pandas methods will be introduced and you will learn how to use them to filter out the noise and perform other useful explorations of data series. You will learn how sampling can sometimes include data from future times, which is a problem for predictive modeling, and how to address that. At the end of the chapter, you will see how a combination of scaling (introduced in Chapter 9, Data Modeling – Preprocessing), and smoothing can show interesting similarities between different data series, which might otherwise be overlooked.

By the end of this chapter, you will be skilled at applying scaling, sampling, and smoothing in a variety of ways to your data analyses.

This chapter covers the following topics:

Learning the modeling basics
Predicting future values of time series

Introduction to data modeling

Data is often provided to you in a form that isn't completely suitable for analysis and modeling. As an example, suppose you are trying to summarize and analyze the sales of students selling cookies in an effort to raise money for a school trip. You would like to get an idea of the expected sales per student per week, in order to recognize students putting in effort and achieving higher sales. Unfortunately, the data for any given student comes in at somewhat random times, making comparisons more difficult. You decide to take each student's sales and fill in the missing days by interpolating between the days for which you have data. The process is quite tedious, and part-way through, you realize you will also have to go back and divide each day by the weekly total, otherwise you are inflating the total sales. Pandas provides the .resample() method you saw in Chapter 9, Data Modeling – Preprocessing, and by combing that with a .rolling...

Learning the modeling basics

So far, we've talked about data modeling in a somewhat abstract sense. In this and the next chapter, we will focus on the tools that help us gain insights from data and construct some basic predictive models using that data. We will begin by defining the modeling landscape in more depth, then look at some of the tools provided directly in pandas.

Modeling tools

In Chapter 9, Data Modeling – Preprocessing, we introduced the scikit-learn (sklearn) LinearRegression method and showed how to fit a simple multiple linear regression model. While there is a vast range of modeling tools available for Python, sklearn is perhaps one of the most used for everything from regression to classification and even basic neural networks. The sklearn ecosystem is described (see https://scikit-learn.org/stable/) as follows:

Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built...

Predicting future values of time series

You have seen how smoothing can be used to uncover important information in a series that might be hidden by noise. It might be tempting to think that smoothing is a very easy data modeling method, so why not use it to make predictions? The issue that arises is, in many cases, the process of smoothing data and aligning it to the original series means you are using information for any given point in the smoothed series that includes future values. Therefore, using such values as predictions is an example of data leakage, discussed in Chapter 9, Data Modeling – Preprocessing in the Avoiding information leakage section.

Suppose you are again analyzing the SPX index data you saw in Chapter 9, Data Modeling – Preprocessing:

Here, you read the data, convert the dates to datetimes, and make a simple plot over a limited time range:
```
SPX = pd.read_csv('Datasets/spx.csv')
SPX['date'] = pd.to_datetime(SPX[&apos...
```

Activity 10.01 – Normalizing and smoothing data

Suppose you are an analyst in a financial advisory firm. Your manager has given three stock symbols to you and requested your input on how they may be correlated with their price behavior. You are provided a stocks.csv data file, which contains the symbols, closing prices, trading volumes, and a sentiment indicator (some view of the quality of the stocks, but you are not told the exact definition). Your initial goal here is to determine whether all three stocks show similar market characteristics or not, and if any or all of them do, make an initial visualization using smoothing. The long-term goal is to try to build some predictive models, so you will split the data into train and test sets. As it is time series, it's important to split on time, not randomly. For this activity, all you will need is the pandas library, a scaling module from sklearn, and matplotlib. Load them in the first cell of the notebook:

ximport pandas...

Summary

In this chapter, you built on the topics of independent and dependent variables, splitting data into train/validation/test splits for modeling and providing unbiased estimates of model performance. Here, you learned a range of basic data modeling methods using resampling (up and downsampling data frequency) and rolling window approaches to smoothing and estimating. You began your detailed investigation of data modeling with pandas tools for smoothing and resampling data, and some particular capabilities to handle time series. Importantly, you saw that smoothing methods can highlight patterns in very noisy data and that smoothing can be non-uniform in time, such as using .ewm() or a custom weighting function. With these foundational methods in hand, the next chapter will conclude data modeling with a deeper exploration of linear regression and then non-linear and powerful modeling methods, using Random Forest as a regression model.

The rest of the chapter is locked

You have been reading a chapter from

The Pandas Workshop

Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages