Packt+ | Advance your knowledge in tech

You're reading from Learning NumPy Array

Product typeBook

Published inJun 2014

Reading LevelIntermediate

Publisher

ISBN-139781783983902

Edition1st Edition

Languages

Python

Tools

NumPy

Concepts

Data Science

Author (1)

Ivan Idris

Chapter 4. Simple Predictive Analytics with NumPy

Following the exploration of the meteorological data in the previous chapter, we will now try to predict temperature. Usually, weather prediction is accomplished with complex models and top-of-the-line supercomputers. Most people don't have access to such resources, so we will cut corners and use simpler models. The list of topics covered in this chapter is as follows:

Autocorrelation
Autoregressive models
Robust statistics

Examining autocorrelation of average temperature with pandas

The pandas (Python data analysis) library is just a collection of fancy wrappers around NumPy, Matplotlib, and other Python libraries. You can find more information including installation instructions on the pandas website at http://pandas.pydata.org/pandas-docs/stable/install.html. Most good APIs such as NumPy seem to follow the Unix philosophy—keep it simple and do one thing well. This philosophy results in many small tools and utilities that can be used as building blocks for something bigger. The pandas library mimics the R programming language in its approach.

The pandas library has plotting subroutines, which can plot lag and autocorrelation plots. Autocorrelation is correlation within a dataset and can be indicative of a trend. For instance, if we have a lag of one day, we can see if the average temperature of yesterday influences the temperature today. For that to be the case, the autocorrelation value needs to be relatively...

Describing data with pandas DataFrames

Luckily, pandas has descriptive statistics utilities. We will read the average wind speed, temperature, and pressure values from the KNMI De Bilt data file into a pandas DataFrame. This object is similar to the R dataframe, which is like a data table in a spreadsheet or a database. The columns are labeled, the data can be indexed, and you can run computations on the data. We will then print out descriptive statistics and a correlation matrix as shown in the following steps:

Read the CSV file with the pandas read_csv function. This function works in a similar fashion to the NumPy load_txt function:

to_float = lambda x: .1 * float(x.strip() or np.nan)
to_date = lambda x: dt.strptime(x, "%Y%m%d")
cols = [4, 11, 25]
conv_dict = dict( (col, to_float) for col in cols) 

conv_dict[1] = to_date
cols.append(1)
 
headers = ['dates', 'avg_ws', 'avg_temp', 'avg_pres']
df = pd.read_csv(sys.argv[1], usecols=cols, names=headers, index_col=[0], converters=conv_dict)

Print...

Correlating weather and stocks with pandas

We will try to correlate stock market data for the Netherlands with the DataFrame we produced last time from the KNMI De Bilt weather data. As a proxy for the stock market, we will use closing prices of the EWN ETF. This might not be the best choice, by the way, so if you have a better idea, please use the appropriate stock ticker. The steps for this exercise are provided as follows:

Download the EWN data from Yahoo Finance, with a special function. The code is as follows:

#EWN start Mar 22, 1996
start = dt(1996, 3, 22)
end = dt(2013, 5, 4)

symbol = "EWN"
quotes = finance.quotes_historical_yahoo(symbol, start, end, asobject=True)

Create a DataFrame object with the available dates in the downloaded data:
```
df2 = pd.DataFrame(quotes.close, index=dt_idx, columns=[symbol])
```
Join the new DataFrame object with DataFrame of the weather data. We will then obtain the correlation matrix:
```
df3 = df.join(df2)

print df3.corr()
```
The correlation matrix is as follows:

As...

Predicting temperature

Temperature is a thermodynamic variable, which quantifies being hot or cold. To predict temperature, we can apply our knowledge of thermodynamics and meteorology. This in general would result in the creation of complex weather models with a multitude of inputs. However, this is beyond the scope of this book, so we will try to keep our continuing example simple and manageable.

Autoregressive model with lag 1

What will the temperature be tomorrow? Probably, the same as today but a bit different. We can assume that temperature is a function of the temperature of the preceding day. This can be justified with the autocorrelation plot that we created earlier. To keep it simple, we will assume further that the function is a polynomial. We will define a cutoff point to be used for the fit. Ninety percent of the data should be used for that purpose. Let's model this idea with NumPy:

Fit the data to polynomials of different degrees with the polyfit function as shown in the following...

Analyzing intra-year daily average temperatures

We are going to have a look at the temperature variation within a year by converting dates to the corresponding day of the year in numbers. This number is between 1 and 366, where 1 corresponds to January 1st and 365 (or 366) corresponds to December 31st. Perform the following steps to analyze the intra-year daily average temperature:

Initialize arrays for the range 1-366 with averages initialized to zeros:
```
rng = np.arange(1, 366)
avgs = np.zeros(365)
avgs2 = np.zeros(365)
```

Calculate averages by the day of the year before and after a cutoff point:

for i in rng: 
   indices = np.where(days[:cutoff] == i)
   avgs[i-1] = temp[indices].mean()
   indices = np.where(days[cutoff+1:] == i)
   avgs2[i-1] = temp[indices].mean()

Fit the averages before the cutoff point to a quadratic polynomial (just a first-order approximation):
```
poly = np.polyfit(rng, avgs, 2)
print poly
```
The following polynomial coefficients in descending power are printed:
```
[ -4.91329859e-04...
```

Introducing the day-of-the-year temperature model

Continuing with the work we did in the previous example, I would like to propose a new model, where temperature is a function of the day of the year (between 1 and 366). Of course, this model is not complete, but can be used as a component in a more advanced model, which should take into account the previous autoregressive model that we did with lag 2. The procedure for this model is illustrated as follows:

Fit the temperature data before the cutoff point to a quadratic polynomial just as in the previous section but without averaging:
```
poly = np.polyfit(days[:cutoff], temp[:cutoff], 2)
print poly
```
Believe it or not, we get the same polynomial coefficients we got earlier:
```
[ -4.91072584e-04   1.92682505e-01  -3.97182941e+00]
```
Calculate the absolute difference between the predicted and actual values:
```
delta = np.abs(np.polyval(poly, days[cutoff:]) - temp[cutoff:])
```

Plot a histogram of the absolute error:

plt.hist(delta, bins = 10, normed = True)
plt.show...

Modeling temperature with the SciPy leastsq function

So, now we have two ideas: either the temperature today depends on the temperature yesterday and the day before yesterday, and we assume that some kind of linear combination is formed, or the temperature depends on a day of the year (between 1 and 366). We can combine these ideas, but then the question is how. It seems that we could have a multiplicative model or an additive model.

Let's choose the additive model since it seems simpler. This means that we assume that temperature is the sum of the autoregressive component and a cyclical component. It's easy to write this down into one equation. We will use the SciPy leastsq function to minimize the square of the error of this equation. The procedure for this model is illustrated as follows:

Define a function that computes the error of our model. The code is as follows:

def error(p, d, t, lag2, lag1):
   l2, l1, d2, d1, d0 = p
 
   return t - l2 * lag2 + l1 * lag1 + d2 * d ** 2 + d1 * d + d0...

Day-of-year temperature take two

The quadratic polynomial approximation for the day-of-the-year temperature fit can be improved upon. We haven't used any of the NumPy trigonometric functions until now. Those should be a good fit for this problem. So, let's try a trigonometric function and fit again using a function from the scipy.optimize module (leastsq to be precise) as follows:

Set up a simple model function and an error function to be minimized, as shown in the following code snippet:

def model(p, d):
   a, b, w, c = p
   return a + b * np.cos(w * d + c)
 
def error(p, d, t):
   return t - model(p, d)

Give the initial guess and fit the data:

p0 = [.1, 1, .01, .01]
params = leastsq(error, p0, args=(days, temp))[0]
print params

We get the following parameters:

[ 9.6848106  -7.59870042 -0.01766333 -5.83349705]

Note

Here, -2 pi over 365 is equal to the third parameter. I believe that the first parameter is equal to the average of all the temperatures, and we can come up with similar explanations...

Moving-average temperature model with lag 1

The moving average model of a time series represents the data as oscillations around the mean of the data. It is assumed that the lag components are white noise (not a politically incorrect term as far as I know), which forms a linear combination. We will again use the leastsq function to fit a model:

We will start off with a simple moving-average model. It has only one lag component and therefore only one coefficient. The code snippet is as follows:
```
def model(p, ma1):
   return p * ma1
```
Call the leastsq function. Here, we subtract the mean from the data:
```
params = leastsq(error, p0, args=(temp[1:cutoff] - mu, temp[:cutoff-1] - mu))[0]
print params
```
The program prints the following parameter:
```
[ 0.94809073]
```
We get the following plot for the absolute error histogram, which is comparable to the autoregressive model results:

The Autoregressive Moving Average temperature model

The Autoregressive Moving Average (ARMA) model mixes the Autoregressive (AR) and Moving Average (MA) models. We have already discussed both models. Informally, we can say that we have the autoregressive component with white noise around it. Part of this white noise can be modeled as a linear combination of lag components plus some constant as follows:

Define an autoregressive model with lag 2 using linear coefficients we obtained with a previous script:
```
def ar(a):
   ar_p = [1.06517683, -0.08293789]
 
   return ar_p[0] * a[1:-1] + ar_p[1] * a[:-2]
```

Define the moving average model with lag 1:

def model(p, ma1):
   c0, c1 = p
 
   return c0 + c1 * ma1

Subtract the autoregressive model values from the data, giving us the error terms (white noise):
```
err_terms = temp[cutoff+1:] - ar(temp[cutoff-1:])
```
Most of the code for this model should appear familiar to you as shown in the following code:
```
import sys
import numpy as np
import matplotlib.pyplot as plt...
```

The time-dependent temperature mean adjusted autoregressive model

It's a mouthful, but it's not nearly as complicated as it sounds. Let's parse the title in the following points:

As we found out, the average temperature for each day of the year seems to fit an annual cycle. It may have to do with the rotation of the Earth around the Sun.
There appears to be a trend of increasing temperature. Some have called that global warming and blame industry and human beings in general for it. Without getting into a political discussion, let's assume that there is truth in this claim. Further, let's assume for now that this trend depends on the year. I know I will get into trouble for this, but let's also assume for now that the relation is based on a first-degree polynomial (a straight line).
For the sake of argument, let's claim that the previous two points together form a time-dependent mean. We will model what is left over as a linear combination of autoregressive lag components.

We need to perform...

Outliers analysis of average De Bilt temperature

Outliers are values in a dataset that are to be considered extreme. Outliers can be caused by measurement or other types of errors, or they could be caused by a natural phenomenon. There are several definitions for outliers. In this example, we will be using the definition for mild outliers. This definition depends on the position of the first and the third quartiles. A quarter and three quarters of the items in the dataset are smaller than the first and third quartile values, respectively. The difference between these specific quartiles is called the inter-quartile range. It's a robust measure for dispersion similar to standard deviation. Mild outliers are defined to be 1.5 inter-quartile ranges away from either the first or third quartile. We can study the temperature outliers as follows:

Find the first quartile with a function from SciPy:
```
q1 = scoreatpercentile(temp, 25)
```
Find the third quartile:
```
q3 = scoreatpercentile(temp, 75)
```
Find the indices...

Using more robust statistics

We can make our code from the The time-dependent temperature mean adjusted autoregressive model section more robust by doing the following:

Computing the median instead of the mean
```
avgs[i-1] = np.median(temp[indices])
```

Ignoring the outliers with a masked array

temp[:cutoff] = ma.masked_array(temp[:cutoff], temp[:cutoff] < (q1 - 1.5 * irq))

We get slightly different output with the modified code, with about 70 percent of the values predicted having an absolute error of less than 2 degrees Celsius:

AR params [ 0.95095073 -0.17373633]
% delta less than 2 70.8567244325

Summary

In this chapter, we learned several simple techniques to predict temperature. Of course, they are not at the level of meteorologists who have access to supercomputers and can apply complex equations. But we did come pretty far with our simple approach.

In the next chapter, we will switch to different datasets. The next chapter will focus on signal processing techniques.

The rest of the chapter is locked

You have been reading a chapter from

Learning NumPy Array

Published in: Jun 2014Publisher: ISBN-13: 9781783983902

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Read more about Ivan Idris

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages