Learning NumPy Array — Save 50%
Supercharge your scientific Python computations by understanding how to use the NumPy library effectively with this book and ebook.
In this article by Ivan Idris, author of Learning NumPy Array, we will learn about some signal-processing techniques, and we will analyze time-series data with these. As example data, we will use the sunspot data provided by a Belgian scientific institute. We can download this data from several places on the Internet, and it is also provided as sample data by the statsmodels library. There are a number of things we can do with the data, such as:
Trying to determine periodic cycles within the data. This can be done, but this is a bit advanced, so we will just get you started.
Smoothing the data to filter out noise.
(For more resources related to this topic, see here.)
Introducing the Sunspot data
Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms.
Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively:
mins = signal.argrelmin(data) maxs = signal.argrelmax(data)
Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data:
import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv, delimiter=',', usecols=(1,), unpack=True,
skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data) maxs = signal.argrelmax(data) extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show()
We will see the following chart:
In this plot, you can see the extrema is indicated with dots.
The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet:
spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng)
Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code:
import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv, delimiter=',', usecols=(1,), unpack=True,
skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data) maxs = signal.argrelmax(data) extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng, max_rng), min(min_rng[-1],
max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show()
The code produces the following chart:
What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own.
Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable.
We can express an equal-weight strategy for the simple moving average as follows in the NumPy code:
weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum()
A simple moving average uses equal weights which, in code, looks as follows:
def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1]
The following code plots the simple moving average for the 11- and 22-year sunspot cycle:
import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv, delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show()
In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data.
This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data.
Resources for Article:
- Advanced Indexing and Array Concepts [Article]
- Fast Array Operations with NumPy [Article]
- Move Further with NumPy Modules [Article]
|Supercharge your scientific Python computations by understanding how to use the NumPy library effectively with this book and ebook.|
eBook Price: £10.99
Book Price: £16.99
About the Author :
Ivan Idris was born in Bulgaria from Indonesian parents. He moved to the Netherlands and graduated from university with a degree in Experimental Physics.
His graduation thesis had a strong emphasis on Applied Computer Science. After graduating, he worked for several companies as a Java Developer, Data Warehouse Developer, and QA Analyst.
His main professional interests are Business Intelligence, big data, and cloud computing. He enjoys writing clean, testable code and interesting technical articles. He is the author of NumPy Beginner’s Guide, NumPy Cookbook, and Learning NumPy.