Reader small image

You're reading from  Mastering Predictive Analytics with R

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781783982806
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Rui Miguel Forte
Rui Miguel Forte
author image
Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No
Read more about Rui Miguel Forte

Rui Miguel Forte
Rui Miguel Forte
author image
Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Read more about Rui Miguel Forte

View More author details
Right arrow

Chapter 9. Time Series Analysis

Many models that we come across involve observing a process of some sort over a period of time in order to learn to predict how that process will behave in the future. As we are dealing with a process that generates observations indexed by time, we refer to these models as time series models. Classic examples of time series are stock market indexes, volume of sales of a company's product over time, and changing weather attributes such as temperature and rainfall during the year.

In this chapter, we will focus on univariate time series, that is to say, time series that involve monitoring how a single variable fluctuates over time. To do this, we begin with some basic tools for describing time series, followed by an overview of a number of fundamental examples. It turns out that there is a wide variety of different approaches to modeling time series; in this chapter, we will focus primarily on ARIMA models, but we will also provide pointers on a few alternatives...

Fundamental concepts of time series


A time series is just a sequence of random variables, Y1, Y2, …, YT, indexed by an evenly spaced sequence of points in time. Time series are ubiquitous in everyday life; we can observe the total amount of rainfall in millimeters over yearly periods for consecutive years, the average daytime temperature over consecutive days, the price of a particular share in the stock market at the close of every day of trading, or the total number of patients in a doctor's waiting room every half hour. As we can see, examples abound.

To analyze time series data, we use the concept of a stochastic process, which is just a sequence of random variables that are generated via an underlying mechanism that is stochastic or random, as opposed to deterministic. From the perspective of the predictive modeler, our goal is to study time series in order to build a model that best describes the behavior of a finite set of samples that we have obtained, in order for us to predict how...

Some fundamental time series


We will begin our study of time series by looking at two famous but very simple examples. These will not only give us a feel for the field, but as we will see later on, they will also become integral building blocks to describe more complex time series.

White noise

A basic but very important type of time series is known as discrete white noise, or simply white noise. In a white noise time series, the random variables that are generated all have a mean of 0, finite and identical variance σw2, and the random variables at different time steps are uncorrelated with each other. Although some texts do not enforce this requirement, most texts also specify that the variables are also independent and identically distributed (iid) random variables.

The iid property essentially requires that each random variable come from the exact same distribution, such as a normal distribution with a particular mean and standard deviation. The property also requires that two variables from...

Stationarity


We have often seen that in predictive modeling, we need to make certain important but limiting assumptions in order to build practical models. With time series models, one of the most common assumptions to make that render the modeling task significantly simpler is the stationarity assumption.

Stationarity essentially describes that the probabilistic behavior of a time series does not change with the passage of time. There are two versions of the stationarity property that are commonly used. A stochastic process is said to be strictly stationary when the joint probability distribution of a sequence of points starting at time t, Yt, Yt+1, ..., Yt+n, is the same as the joint probability distribution of another sequence of points starting at a different time T, YT, YT+1, ..., YT+n.

To be strictly stationary, this property must hold for any choice of time t and T, and for any sequence length n. In particular, because we can choose n = 1, this means that the probability distributions...

Stationary time series models


In this section, we will describe a few stationary time series models. As we will see, these can be used to model a number of real-world processes.

Moving average models

A moving average (MA) process is a stochastic process in which the random variable at time step t is a linear combination of the most recent (in time) terms of a white noise process. Concretely, we can write this in an equation as follows:

In the previous equation, and henceforth, we will assume that the e terms are white noise random variables with mean 0 and variance σw2. We can describe a moving average process in an equivalent way by making use of the backshift operator, B. The backshift operator is an operator that when applied to a random variable in a stochastic process at time t, produces the random variable at the previous time step, t-1. For example:

We can obtain random variables further back in time by successive applications of the backshift operator. B2, for example, indicates the...

Non-stationary time series models


In this section, we will look at some models that are non-stationary but nonetheless have certain properties that allow us to either derive a stationary model or model the non-stationary behavior.

Autoregressive integrated moving average models

The random walk process is an example of a time series model that is itself non-stationary, but the differences between consecutive points, Yt and Yt+1, which we can write as ∆Yt, is stationary. This differenced sequence was nothing but the white noise sequence, which we know to be stationary.

If we were to take the difference between consecutive output points of the differenced sequence, we would again obtain another sequence, which we call a second order differenced sequence.

Generalizing this notion of differencing, we can say that a dth order difference is obtained by repeatedly computing differences between consecutive terms d times, to obtain a new sequence with points, Wt, from an original sequence, Yt. We can...

Predicting intense earthquakes


Having reviewed several time series models, we are now ready for some practical examples. Our first data set is a time series of earthquakes having magnitude that exceeds 4.0 on the Richter scale in Greece over the period between the year 2000 and the year 2008. This data set was recorded by the Observatory of Athens and is hosted on the website of the University of Athens, Faculty of Geology, Department of Geophysics & Geothermics. The data is available online at http://www.geophysics.geol.uoa.gr/catalog/catgr_20002008.epi.

We will import these data directly by using the package RCurl. From this package, we will use the functions getURL(), which retrieves the contents of a particular address on the Internet, and textConnection(), which will interpret the result as raw text. Once we have the data, we provide meaningful names for the columns using information from the website:

> library("RCurl")
> seismic_raw <- read.table(textConnection(getURL(
...

Predicting lynx trappings


Our second data set, known as the lynx data set, is a very famous data set and is provided with the core distribution of R. This was first presented in a 1942 paper by C. Elton and M. Nicholson, titled The ten year cycle in numbers of Canadian lynx, which appears in the Journal of Animal Ecology. The data consist of the number of Canadian lynx trapped in the MacKenzie river over the period 1821-1934. We can load the data as follows:

> data(lynx)

The following diagram shows a plot of the lynx data:

We will repeat the exact same series of analysis steps as we did with the earthquake data. First, we will create a grid of parameter combinations and use this to train multiple models. Then we will pick the best one on account of it having the smallest AIC value. Finally, we will use the chosen parameter combination to train a model and forecast the next few data points. The reader is encouraged to also experiment with auto.arima().

> d <- 0:2
> p <- 0:6
>...

Predicting foreign exchange rates


Our third and final data set will be constructed from a historical database of Euro Foreign Exchange Reference rates provided by the website of the European Central Bank. We can download a zipped archive containing the data from http://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist.zip. When unzipped, this archive contains a file titled eurofxref-hist.csv, which we can directly import into R using the read.csv() function:

> eurofxref.hist <- read.csv("eurofxref-hist.csv",
                             stringsAsFactors = F)
> eurofxref.hist[1 : 6, 1 : 6]
        Date    USD    JPY    BGN CYP    CZK
1 2014-09-05 1.2948 136.27 1.9558 N/A 27.596
2 2014-09-04 1.3015 136.89 1.9558 N/A 27.662
3 2014-09-03 1.3151 138.11 1.9558 N/A 27.658
4 2014-09-02 1.3115 137.63 1.9558 N/A 27.784
5 2014-09-01 1.3133 136.97 1.9558 N/A 27.738
6 2014-08-29 1.3188 137.11 1.9558 N/A 27.725

As we can see, our data frame contains the conversion rates for several different currencies...

Other time series models


In this chapter, we spent most of our time on studying models that describe a time series in terms of the patterns of correlations between different points in time. This approach led us to the ARIMA family of models, which we have seen are highly configurable and have successfully been employed in many real-world problems. There is a diverse array of methods that have been applied to the time series problem and in fact we have seen a few elsewhere in this book as well.

The neural networks that we studied in Chapter 4, Neural Networks, and the hidden Markov models that we saw in Chapter 8, Probabilistic Graphical Models, are two such examples. Sometimes, we can treat a time series as a regression problem, and so techniques from this area can be leveraged too.

One other important class of methods is exponential smoothing. There are two key premises behind methods that use this approach. The first of these is that a time series is usually decomposed into a number of different...

Summary


The focus of this chapter was on understanding the fundamental tools that are useful in studying time series. Time series analysis is a very large field, but in this brief synopsis, we explored the basic concepts that are essential to further study. We started off by looking at some properties of time series such as the autocorrelation function and saw how this, along with the partial autocorrelation function, can provide important clues about the underlying process involved.

Next, we introduced stationarity, which is a very useful property of some time series that in a nutshell says that the statistical behavior of the underlying process does not change over time. We introduced white noise as a stochastic process that forms the basis of many other processes. In particular, it appears in the random walk process, the moving average (MA) process, as well as the autoregressive process (AR). These, in turn, we saw can be combined to yield even more complex time series.

In order to handle...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Predictive Analytics with R
Published in: Jun 2015Publisher: ISBN-13: 9781783982806
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rui Miguel Forte

Why do you think this reviewer is suitable for this book? Mr. Rui Miguel Forte has authored a book for Packt titled “Mastering Predictive Analytics with R”. The book has received a 5 star rating. He has 3 years experience as a Data Scientist. He has knowledge of Scala, Python, R, PHP. • Has the reviewer published any articles or blogs on this or a similar tool/technology ? [Provide Links and References] A brief of Unsupervised learning has been covered in his book “Mastering Predictive Analytics with R” https://www.safaribooksonline.com/library/view/mastering-predictive-analytics/9781783982806/ https://www.linkedin.com/profile/view?id=AAkAAAC5YUIBYL7LyLCWZ6LsR0ENJxByC2jU9AU&authType=NAME_SEARCH&authToken=c1Pg&locale=en_US&trk=tyah&trkInfo=clickedVertical%3Amynetwork%2CclickedEntityId%3A12149058%2CauthType%3ANAME_SEARCH%2Cidx%3A1-1-1%2CtarId%3A1444032603690%2Ctas%3ARui%20Miguel%20Forte • Feedback on the Outline (in case outline has been shared with the reviewer) The author said the outline is good to go. • Did the reviewer share any concerns or questions regarding the reviewing process? (related to the schedule, commitment, or any additional comments) No
Read more about Rui Miguel Forte

author image
Rui Miguel Forte

Rui Miguel Forte is currently the chief data scientist at Workable. He was born and raised in Greece and studied in the UK. He is an experienced data scientist, having over 10 years of work experience in a diverse array of industries spanning mobile marketing, health informatics, education technology, and human resources technology. His projects have included predictive modeling of user behavior in mobile marketing promotions, speaker intent identification in an intelligent tutor, information extraction techniques for job applicant resumes and fraud detection for job scams. He currently teaches R, MongoDB, and other data science technologies to graduate students in the Business Analytics MSc program at the Athens University of Economics and Business. In addition, he has lectured in a number of seminars, specialization programs, and R schools for working data science professionals in Athens. His core programming knowledge is in R and Java, and he has extensive experience working with a variety of database technologies such as Oracle, PostgreSQL, MongoDB, and HBase. He holds a Master’s degree in Electrical and Electronic Engineering from Imperial College London and is currently researching machine learning applications in information extraction and natural language processing.
Read more about Rui Miguel Forte