You're reading from The Data Analysis Workshop

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781839211386

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Science

Authors (3):

Gururajan Govindan

Shubhangi Hora

Konstantin Palagachev

View More author details

ARIMA Models

Autoregressive Integrated Moving Average (ARIMA) models are a class of statistical models that try to explain the behavior of a time series using its own past values. Being a class of models, ARIMA models are defined by a set of parameters (p,d,q), each one corresponding to a different component of the ARIMA model:
- Autoregressive of order p: An autoregressive model of order p (AR(p) for short) models the current time series entry as a linear combination of its last p values. The mathematical formulation is as follows:

Figure 1.40: Expression for an autoregressive model of order p

Here, α is the intercept term, Y(t-i) is the lag-I term of the series with the respective coefficient βi, while ϵt is the error term (that is, the normally distributed random variable with mean 0 and variance formula ).

Moving average of order q: A moving average model of order q (MA(q) for short) attempts to model the current value of the time series as a linear combination of its past error terms. Mathematically speaking, it has the following formula:

Figure 1.41: Expression for the moving average model of order q

As in the autoregressive model, α represents a bias term; ϕ1,…,ϕq are parameters to be estimated in the model; and ϵt,…,ϵ(t-q) are the error terms at times t,…,t-q, respectively.

Integrated component of order d: The integrated component represents a transformation in the original time series, in which the transformed series is obtained by getting the difference between Yt and Y(t-d), hence the following:

Figure 1.42: Expression for an integrated component of order d

The integration term is used for detrending the original time series and making it stationary. Note that we already saw this type of transformation when we subtracted the previous entry in the number of rides, that is, when we applied an integration term of order 1.

In general, when we apply an ARIMA model of order (p,d,q) to a time series, {Yt} , we obtain the following model:

First, integrate the original time series of order d, and then obtain the new series:
Figure 1.43: Integrating the original time series
Then, apply a combination of the AR(p) and MA(q) models, also known as the autoregressive moving average model, or ARMA(p,q), to the transformed series, {Zt }t:

Figure 1.44: Expression for ARMA

Here, the coefficients α,β1,…,βp,ϕ1,…,ϕq are to be estimated.

A standard method for finding the parameters (p,d,q) of an ARIMA model is to compute the autocorrelation and partial autocorrelation functions (ACF and PACF for short). The autocorrelation function measures the Pearson correlation between the lagged values in a time series as a function of the lag:

Figure 1.45: Autocorrelation function as a function of the lag

In practice, the ACF measures the complete correlation between the current entry, Yt, and its past entries, lagged by k . Note that when computing the ACF(k), the correlation between Yt with all intermediate values (Y(t-1),…,Y(t-k+1)) is not removed. In order to account only for the correlation between and Y(t-k), we often refer to the PACF, which only measures the impact of Y(t-k) on Yt.

ACF and PACF are, in general, used to determine the order of integration when modeling a time series with an ARIMA model. For each lag, the correlation coefficient and level of significance are computed. In general, we aim at an integrated series, in which only the first few lags have correlation greater than the level of significance. We will demonstrate this in the following exercise.

Exercise 1.07: ACF and PACF Plots for Registered Rides

In this exercise, we will plot the autocorrelation and partial autocorrelation functions for the registered number of rides:

Access the necessary methods for plotting the ACF and PACF contained in the Python package, statsmodels:
```
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
```
Define a 3 x 3 grid and plot the ACF and PACF for the original series of registered rides, as well as for its first- and second-order integrated series:
```
fig, axes = plt.subplots(3, 3, figsize=(25, 12))
# plot original series
original = daily_rides["registered"]
axes[0,0].plot(original)
axes[0,0].set_title("Original series")
plot_acf(original, ax=axes[0,1])
plot_pacf(original, ax=axes[0,2])
# plot first order integrated series
first_order_int = original.diff().dropna()
axes[1,0].plot(first_order_int)
axes[1,0].set_title("First order integrated")
plot_acf(first_order_int, ax=axes[1,1])
plot_pacf(first_order_int, ax=axes[1,2])
# plot first order integrated series
second_order_int = first_order_int.diff().dropna()
axes[2,0].plot(first_order_int)
axes[2,0].set_title("Second order integrated")
plot_acf(second_order_int, ax=axes[2,1])
plot_pacf(second_order_int, ax=axes[2,2])
```
The output should be as follows:
Figure 1.46: Autocorrelation and partial autocorrelation plots of registered rides
As you can see from the preceding figure, the original series exhibits several autocorrelation coefficients that are above the threshold. The first order integrated series has only a few, which makes it a good candidate for further modeling (hence, selecting an ARIMA(p,1,q) model). Finally, the second order integrated series present a large negative autocorrelation of lag 1, which, in general, is a sign of too large an order of integration.
Now focus on finding the model parameters and the coefficients for an ARIMA(p,d,q) model, based on the observed registered rides. The general approach is to try different combinations of parameters and chose the one that minimizes certain information criterion, for instance, the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC):
Akaike Information Criterion:
Figure 1.47: Expression for AIC
Bayesian Information Criterion:
Figure 1.48: Expression for BIC
Here, k is the number of parameters in the selected model, n is the number of samples, and is the log likelihood. As you can see, there is no substantial difference between the two criteria and, in general, both are used. If different optimal models are selected according to the different IC, we tend to find a model in between.
In the following code snippet, fit an ARIMA(p,d,q) model to the registered column. Note that the pmdarima package is not a standard in Anaconda; therefore, in order to install it, you need to install it via the following:
```
conda install -c saravji pmdarima
```
And then perform the following:
```
from pmdarima import auto_arima
model = auto_arima(registered, start_p=1, start_q=1, \
                   max_p=3, max_q=3, information_criterion="aic")
print(model.summary())
```
Python's pmdarima package has a special function that automatically finds the best parameters for an ARIMA(p,d,q) model based on the AIC. Here is the resulting model:
Figure 1.49: The resulting model based on AIC
As you can see, the best selected model was ARIMA(3,1,3), with the coef column containing the coefficients for the model itself.

Finally, evaluate how well the number of rides is approximated by the model by using the model.predict_in_sample() function:

# plot original and predicted values
plot_data = pd.DataFrame(registered)
plot_data['predicted'] = model.predict_in_sample()
plot_data.plot(figsize=(12, 8))
plt.ylabel("number of registered rides")
plt.title("Predicted vs actual number of rides")
plt.savefig('figs/registered_arima_fit.png', format='png')

The output should be as follows:

Figure 1.50: Predicted versus the actual number of registered rides

As you can see, the predicted column follows the original one quite well, although it is unable to correctly model a large number of rise and fall movements in the registered series.

Note

To access the source code for this specific section, please refer to https://packt.live/37xHHKQ.

You can also run this example online at https://packt.live/30MqcFf. You must execute the entire Notebook in order to get the desired result.

In this first chapter, we presented various techniques from data analysis and statistics, which will serve as a basis for further analysis in the next chapter, in order to improve our results and to obtain a better understanding of the problems we are about to deal with.

Activity 1.01: Investigating the Impact of Weather Conditions on Rides

In this activity, you will investigate the impact of weather conditions and their relationship with the other weather-related columns (temp, atemp, hum, and windspeed), as well as their impact on the number of registered and casual rides. The following steps will help you to perform the analysis:

Import the initial hour data.
Create a new column in which weathersit is mapped to the four categorical values specified in Exercise 1.01, Preprocessing Temporal and Weather Features. (clear, cloudy, light_rain_snow, and heavy_rain_snow).
Define a Python function that accepts as input the hour data, a column name, and a weather condition, and then returns a seaborn regplot in which regression plots are produced between the provided column name and the registered and casual rides for the specified weather condition.
Produce a 4 x 4 plot in which each column represents a specific weather condition (clear, cloudy, light_rain_snow, and heavy_rain_snow), and each row of the specified four columns (temp, atemp, hum, and windspeed). A useful function for producing the plot might be the matplotlib.pyplot.subplot() function.
Note
For more information on the matplotlib.pyplot.subplot() function, refer to https://pythonspot.com/matplotlib-subplot/.
Define a second function that accepts the hour data, a column name, and a specific weather condition as an input, and then prints the Pearson's correlation and p-value between the registered and casual rides and the provided column for the specified weather condition (once when the correlation is computed between the registered rides and the specified column and once between the casual rides and the specified column).
Iterating over the four columns (temp, atemp, hum, and windspeed) and four weather conditions (clear, cloudy, light_rain_snow, and heavy_rain_snow), print the correlation for each column and each weather condition by using the function defined in Step 5.
The output should be as follows:

Figure 1.51: Correlation between weather and the registered/casual rides

Note

The solution for this activity can be found via this link.

The rest of the page is locked

You have been reading a chapter from

The Data Analysis Workshop

Published in: Jul 2020Publisher: PacktISBN-13: 9781839211386

Authors (3)

Gururajan Govindan

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision-making and machine learning with Python.
Read more about Gururajan Govindan

Shubhangi Hora

Shubhangi Hora is a data scientist, Python developer, and published writer. With a background in computer science and psychology, she is particularly passionate about healthcare-related AI, including mental health. Shubhangi is also a trained musician.
Read more about Shubhangi Hora

Konstantin Palagachev

Konstantin Palagachev holds a Ph.D. in applied mathematics and optimization, with an interest in operations research and data analysis. He is recognized for his passion for delivering data-driven solutions and expertise in the area of urban mobility, autonomous driving, insurance, and finance. He is also a devoted coach and mentor, dedicated to sharing his knowledge and passion for data science.
Read more about Konstantin Palagachev

Other recommended products

Related to this chapter

Python Feature Engineering Cookbook

Feature engineering is invaluable for developing and enriching your machine learning models. In this book, you will work with the best Python tools to streamline your feature engineering pipelines, feature engineering techniques and simplify and improve the quality of your code.

BookJan 2020372 pages

Hands-On Gradient Boosting with XGBoost and scikit-learn

This practical XGBoost guide will put your Python and scikit-learn knowledge to work by showing you how to build powerful, fine-tuned XGBoost models with impressive speed and accuracy. This book will help you to apply XGBoost’s alternative base learners, use unique transformers for model deployment, discover tips from Kaggle masters, and much more!

BookOct 2020310 pages

Data Science for Marketing Analytics

Data Science for Marketing Analytics opens doors to looking at data with a different approach and new tools. Drawing on machine learning and data science concepts, this book broadens the range of tools that you can use to transform the market analysis process.

BookMar 2019420 pages

Forecasting Time Series Data with Facebook Prophet

This book will help you get to grips with the task of time series forecasting using the leading open source forecasting tool available to the public, Facebook Prophet. You will learn how to implement the advanced features of Prophet to build forecast models and understand why and how to modify each of the default parameters to improve results.

BookMar 2021270 pages

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

BookFeb 2020532 pages

Big Data Analysis with Python

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control the data avalanche for you. With this book, you'll learn effective techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

BookApr 2019276 pages

Hands-On Exploratory Data Analysis with Python

This book provides practical knowledge about the main pillars of EDA including data cleaning, data preparation, data exploration, and data visualization. You can leverage the power of Python to understand, summarize and investigate your data in the best way possible. The book presents a unique approach to exploring hidden features in your data.

BookMar 2020352 pages

Data Science for Marketing Analytics

This book on marketing analytics with Python will quickly get you up and running using practical data science and machine learning to improve your approach to marketing. You'll learn how to analyze sales, understand customer data, predict outcomes, and present conclusions with clear visualizations.

BookSep 2021636 pages

Python for Finance Cookbook

Python is becoming the number one language for data science and also quantitative finance. This book provides you with solutions to common tasks from the intersection of quantitative finance and data science, using modern Python libraries.

BookJan 2020432 pages

The Data Science Workshop

The Data Science Workshop equips you with the basic skills you need to start working on a variety of data science projects. You’ll work through the essential building blocks of a data science project gradually through the book, and then put all the pieces together to consolidate your knowledge and apply your learnings in the real world.

BookAug 2020824 pages5

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from The Data Analysis Workshop

ARIMA Models

Exercise 1.07: ACF and PACF Plots for Registered Rides

Activity 1.01: Investigating the Impact of Weather Conditions on Rides

Unlock this book and the full library FREE for 7 days

Authors (3)

Python Feature Engineering Cookbook

Feature engineering is invaluable for developing and enriching your machine learning models. In this book, you will work with the best Python tools to streamline your feature engineering pipelines, feature engineering techniques and simplify and improve the quality of your code.

Hands-On Gradient Boosting with XGBoost and scikit-learn

Data Science for Marketing Analytics

Data Science for Marketing Analytics opens doors to looking at data with a different approach and new tools. Drawing on machine learning and data science concepts, this book broadens the range of tools that you can use to transform the market analysis process.

Forecasting Time Series Data with Facebook Prophet

The Supervised Learning Workshop

Taking an engaging and practical approach, The Supervised Learning Workshop teaches you how to predict the output of new data, based on the relationship and behavior of?existing datasets. You’ll learn at your own pace and use Python libraries and Jupyter to build intelligent predictive models.?

Big Data Analysis with Python

Hands-On Exploratory Data Analysis with Python

Data Science for Marketing Analytics

This book on marketing analytics with Python will quickly get you up and running using practical data science and machine learning to improve your approach to marketing. You'll learn how to analyze sales, understand customer data, predict outcomes, and present conclusions with clear visualizations.

Python for Finance Cookbook

Python is becoming the number one language for data science and also quantitative finance. This book provides you with solutions to common tasks from the intersection of quantitative finance and data science, using modern Python libraries.

The Data Science Workshop

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook