Reader small image

You're reading from  Learning NumPy Array

Product typeBook
Published inJun 2014
Reading LevelIntermediate
Publisher
ISBN-139781783983902
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Ivan Idris
Ivan Idris
author image
Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Read more about Ivan Idris

Right arrow

Chapter 4. Simple Predictive Analytics with NumPy

Following the exploration of the meteorological data in the previous chapter, we will now try to predict temperature. Usually, weather prediction is accomplished with complex models and top-of-the-line supercomputers. Most people don't have access to such resources, so we will cut corners and use simpler models. The list of topics covered in this chapter is as follows:

  • Autocorrelation

  • Autoregressive models

  • Robust statistics

Examining autocorrelation of average temperature with pandas


The pandas (Python data analysis) library is just a collection of fancy wrappers around NumPy, Matplotlib, and other Python libraries. You can find more information including installation instructions on the pandas website at http://pandas.pydata.org/pandas-docs/stable/install.html. Most good APIs such as NumPy seem to follow the Unix philosophy—keep it simple and do one thing well. This philosophy results in many small tools and utilities that can be used as building blocks for something bigger. The pandas library mimics the R programming language in its approach.

The pandas library has plotting subroutines, which can plot lag and autocorrelation plots. Autocorrelation is correlation within a dataset and can be indicative of a trend. For instance, if we have a lag of one day, we can see if the average temperature of yesterday influences the temperature today. For that to be the case, the autocorrelation value needs to be relatively...

Describing data with pandas DataFrames


Luckily, pandas has descriptive statistics utilities. We will read the average wind speed, temperature, and pressure values from the KNMI De Bilt data file into a pandas DataFrame. This object is similar to the R dataframe, which is like a data table in a spreadsheet or a database. The columns are labeled, the data can be indexed, and you can run computations on the data. We will then print out descriptive statistics and a correlation matrix as shown in the following steps:

  1. Read the CSV file with the pandas read_csv function. This function works in a similar fashion to the NumPy load_txt function:

    to_float = lambda x: .1 * float(x.strip() or np.nan)
    to_date = lambda x: dt.strptime(x, "%Y%m%d")
    cols = [4, 11, 25]
    conv_dict = dict( (col, to_float) for col in cols) 
    
    conv_dict[1] = to_date
    cols.append(1)
     
    headers = ['dates', 'avg_ws', 'avg_temp', 'avg_pres']
    df = pd.read_csv(sys.argv[1], usecols=cols, names=headers, index_col=[0], converters=conv_dict)
  2. Print...

Correlating weather and stocks with pandas


We will try to correlate stock market data for the Netherlands with the DataFrame we produced last time from the KNMI De Bilt weather data. As a proxy for the stock market, we will use closing prices of the EWN ETF. This might not be the best choice, by the way, so if you have a better idea, please use the appropriate stock ticker. The steps for this exercise are provided as follows:

  1. Download the EWN data from Yahoo Finance, with a special function. The code is as follows:

    #EWN start Mar 22, 1996
    start = dt(1996, 3, 22)
    end = dt(2013, 5, 4)
    
    symbol = "EWN"
    quotes = finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
  2. Create a DataFrame object with the available dates in the downloaded data:

    df2 = pd.DataFrame(quotes.close, index=dt_idx, columns=[symbol])
  3. Join the new DataFrame object with DataFrame of the weather data. We will then obtain the correlation matrix:

    df3 = df.join(df2)
    
    print df3.corr()

    The correlation matrix is as follows:

As...

Predicting temperature


Temperature is a thermodynamic variable, which quantifies being hot or cold. To predict temperature, we can apply our knowledge of thermodynamics and meteorology. This in general would result in the creation of complex weather models with a multitude of inputs. However, this is beyond the scope of this book, so we will try to keep our continuing example simple and manageable.

Autoregressive model with lag 1

What will the temperature be tomorrow? Probably, the same as today but a bit different. We can assume that temperature is a function of the temperature of the preceding day. This can be justified with the autocorrelation plot that we created earlier. To keep it simple, we will assume further that the function is a polynomial. We will define a cutoff point to be used for the fit. Ninety percent of the data should be used for that purpose. Let's model this idea with NumPy:

  1. Fit the data to polynomials of different degrees with the polyfit function as shown in the following...

Analyzing intra-year daily average temperatures


We are going to have a look at the temperature variation within a year by converting dates to the corresponding day of the year in numbers. This number is between 1 and 366, where 1 corresponds to January 1st and 365 (or 366) corresponds to December 31st. Perform the following steps to analyze the intra-year daily average temperature:

  1. Initialize arrays for the range 1-366 with averages initialized to zeros:

    rng = np.arange(1, 366)
    avgs = np.zeros(365)
    avgs2 = np.zeros(365)
  2. Calculate averages by the day of the year before and after a cutoff point:

    for i in rng: 
       indices = np.where(days[:cutoff] == i)
       avgs[i-1] = temp[indices].mean()
       indices = np.where(days[cutoff+1:] == i)
       avgs2[i-1] = temp[indices].mean()
  3. Fit the averages before the cutoff point to a quadratic polynomial (just a first-order approximation):

    poly = np.polyfit(rng, avgs, 2)
    print poly

    The following polynomial coefficients in descending power are printed:

    [ -4.91329859e-04...

Introducing the day-of-the-year temperature model


Continuing with the work we did in the previous example, I would like to propose a new model, where temperature is a function of the day of the year (between 1 and 366). Of course, this model is not complete, but can be used as a component in a more advanced model, which should take into account the previous autoregressive model that we did with lag 2. The procedure for this model is illustrated as follows:

  1. Fit the temperature data before the cutoff point to a quadratic polynomial just as in the previous section but without averaging:

    poly = np.polyfit(days[:cutoff], temp[:cutoff], 2)
    print poly

    Believe it or not, we get the same polynomial coefficients we got earlier:

    [ -4.91072584e-04   1.92682505e-01  -3.97182941e+00]
    
  2. Calculate the absolute difference between the predicted and actual values:

    delta = np.abs(np.polyval(poly, days[cutoff:]) - temp[cutoff:])
  3. Plot a histogram of the absolute error:

    plt.hist(delta, bins = 10, normed = True)
    plt.show...

Modeling temperature with the SciPy leastsq function


So, now we have two ideas: either the temperature today depends on the temperature yesterday and the day before yesterday, and we assume that some kind of linear combination is formed, or the temperature depends on a day of the year (between 1 and 366). We can combine these ideas, but then the question is how. It seems that we could have a multiplicative model or an additive model.

Let's choose the additive model since it seems simpler. This means that we assume that temperature is the sum of the autoregressive component and a cyclical component. It's easy to write this down into one equation. We will use the SciPy leastsq function to minimize the square of the error of this equation. The procedure for this model is illustrated as follows:

  1. Define a function that computes the error of our model. The code is as follows:

    def error(p, d, t, lag2, lag1):
       l2, l1, d2, d1, d0 = p
     
       return t - l2 * lag2 + l1 * lag1 + d2 * d ** 2 + d1 * d + d0...

Day-of-year temperature take two


The quadratic polynomial approximation for the day-of-the-year temperature fit can be improved upon. We haven't used any of the NumPy trigonometric functions until now. Those should be a good fit for this problem. So, let's try a trigonometric function and fit again using a function from the scipy.optimize module (leastsq to be precise) as follows:

  1. Set up a simple model function and an error function to be minimized, as shown in the following code snippet:

    def model(p, d):
       a, b, w, c = p
       return a + b * np.cos(w * d + c)
     
    def error(p, d, t):
       return t - model(p, d)
  2. Give the initial guess and fit the data:

    p0 = [.1, 1, .01, .01]
    params = leastsq(error, p0, args=(days, temp))[0]
    print params

    We get the following parameters:

    [ 9.6848106  -7.59870042 -0.01766333 -5.83349705]
    

Note

Here, -2 pi over 365 is equal to the third parameter. I believe that the first parameter is equal to the average of all the temperatures, and we can come up with similar explanations...

Moving-average temperature model with lag 1


The moving average model of a time series represents the data as oscillations around the mean of the data. It is assumed that the lag components are white noise (not a politically incorrect term as far as I know), which forms a linear combination. We will again use the leastsq function to fit a model:

  1. We will start off with a simple moving-average model. It has only one lag component and therefore only one coefficient. The code snippet is as follows:

    def model(p, ma1):
       return p * ma1
  2. Call the leastsq function. Here, we subtract the mean from the data:

    params = leastsq(error, p0, args=(temp[1:cutoff] - mu, temp[:cutoff-1] - mu))[0]
    print params

    The program prints the following parameter:

    [ 0.94809073]
    

    We get the following plot for the absolute error histogram, which is comparable to the autoregressive model results:

The Autoregressive Moving Average temperature model


The Autoregressive Moving Average (ARMA) model mixes the Autoregressive (AR) and Moving Average (MA) models. We have already discussed both models. Informally, we can say that we have the autoregressive component with white noise around it. Part of this white noise can be modeled as a linear combination of lag components plus some constant as follows:

  1. Define an autoregressive model with lag 2 using linear coefficients we obtained with a previous script:

    def ar(a):
       ar_p = [1.06517683, -0.08293789]
     
       return ar_p[0] * a[1:-1] + ar_p[1] * a[:-2]
  2. Define the moving average model with lag 1:

    def model(p, ma1):
       c0, c1 = p
     
       return c0 + c1 * ma1
  3. Subtract the autoregressive model values from the data, giving us the error terms (white noise):

    err_terms = temp[cutoff+1:] - ar(temp[cutoff-1:])

    Most of the code for this model should appear familiar to you as shown in the following code:

    import sys
    import numpy as np
    import matplotlib.pyplot as plt...

The time-dependent temperature mean adjusted autoregressive model


It's a mouthful, but it's not nearly as complicated as it sounds. Let's parse the title in the following points:

  • As we found out, the average temperature for each day of the year seems to fit an annual cycle. It may have to do with the rotation of the Earth around the Sun.

  • There appears to be a trend of increasing temperature. Some have called that global warming and blame industry and human beings in general for it. Without getting into a political discussion, let's assume that there is truth in this claim. Further, let's assume for now that this trend depends on the year. I know I will get into trouble for this, but let's also assume for now that the relation is based on a first-degree polynomial (a straight line).

  • For the sake of argument, let's claim that the previous two points together form a time-dependent mean. We will model what is left over as a linear combination of autoregressive lag components.

We need to perform...

Outliers analysis of average De Bilt temperature


Outliers are values in a dataset that are to be considered extreme. Outliers can be caused by measurement or other types of errors, or they could be caused by a natural phenomenon. There are several definitions for outliers. In this example, we will be using the definition for mild outliers. This definition depends on the position of the first and the third quartiles. A quarter and three quarters of the items in the dataset are smaller than the first and third quartile values, respectively. The difference between these specific quartiles is called the inter-quartile range. It's a robust measure for dispersion similar to standard deviation. Mild outliers are defined to be 1.5 inter-quartile ranges away from either the first or third quartile. We can study the temperature outliers as follows:

  1. Find the first quartile with a function from SciPy:

    q1 = scoreatpercentile(temp, 25)
  2. Find the third quartile:

    q3 = scoreatpercentile(temp, 75)
  3. Find the indices...

Using more robust statistics


We can make our code from the The time-dependent temperature mean adjusted autoregressive model section more robust by doing the following:

  • Computing the median instead of the mean

    avgs[i-1] = np.median(temp[indices])
  • Ignoring the outliers with a masked array

    temp[:cutoff] = ma.masked_array(temp[:cutoff], temp[:cutoff] < (q1 - 1.5 * irq))

We get slightly different output with the modified code, with about 70 percent of the values predicted having an absolute error of less than 2 degrees Celsius:

AR params [ 0.95095073 -0.17373633]
% delta less than 2 70.8567244325

Summary


In this chapter, we learned several simple techniques to predict temperature. Of course, they are not at the level of meteorologists who have access to supercomputers and can apply complex equations. But we did come pretty far with our simple approach.

In the next chapter, we will switch to different datasets. The next chapter will focus on signal processing techniques.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning NumPy Array
Published in: Jun 2014Publisher: ISBN-13: 9781783983902
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Read more about Ivan Idris