|Read more about this book|
(For more resources on this subject, see here.)
Loading data files
When performing a statistical analysis of a particular problem, you often have some data stored in a file. You can save your variables (or the entire workspace) using different file formats and then load them back in again. Octave can, of course, also load data from files generated by other programs. There are certain restrictions when you do this which we will discuss here. In the following matter, we will only consider ASCII files, that is, readable text files.
When you load data from an ASCII file using the load command, the data is treated as a two-dimensional array. We can then think of the data as a matrix where lines represent the matrix rows and columns the matrix columns. For this matrix to be well defined, the data must be organized such that all the rows have the same number of columns (and therefore the columns the same number of rows). For example, the content of a file called series.dat can be:
Next we to load this into Octave's workspace:
octave:1> load -ascii series.dat;
whereby the data is stored in the variable named series. In fact, Octave is capable of loading the data even if you do not specify the ASCII format. The number of rows and columns are then:
I prefer the file extension .dat, but again this is optional and can be anything you wish, say .txt, .ascii, .data, or nothing at all.
In the data files you can have:
- Octave comments
- Data blocks separated by blank lines (or equivalent empty rows)
- Tabs or single and multi-space for number separation
Thus, the following data file will successfully load into Octave:
# First block
1 232 334
2 245 334
3 456 342
4 555 321
# Second block
1 231 334
2 244 334
3 450 341
4 557 327
The resulting variable is a matrix with 8 rows and 3 columns. If you know the number of blocks or the block sizes, you can then separate the blocked-data.
Now, the following data stored in the file bad.dat will not load into Octave's workspace:
1 232.1 334
because line 1 has three columns whereas lines 2-4 have two columns. If you try to load this file, Octave will complain:
octave:3> load -ascii bad.dat
error: load: bad.dat: inconsisitent number of columns near line 2
error:load: unable to extract matrix size from file 'bad.dat'
Simple descriptive statistics
Consider an Octave function mcintgr and its vectorized version mcintgrv. This function can evaluate the integral for a mathematical function f in some interval [a; b] where the function is positive. The Octave function is based on the Monte Carlo method and the return value, that is, the integral, is therefore a stochastic variable. When we calculate a given integral, we should as a minimum present the result as a mean or another appropriate measure of a central value together with an associated statistical uncertainty. This is true for any other stochastic variable, whether it is the height of the pupils in class, length of a plant's leaves, and so on.
In this section, we will use Octave for the most simple statistical description of stochastic variables.
Histogram and moments
Let us calculate the integral given in Equation (5.9) one thousand times using the vectorized version of the Monte Carlo integrator:
octave:4> for i=1:1000
> s(i) = mcintgrv("sin", 0, pi, 1000);
The array s now contains a sequence of numbers which we know are approximately 2. Before we make any quantitative statistical description, it is always a good idea to first plot a histogram of the data as this gives an approximation to the true underlying probability distribution of the variable s. The easiest way to do this is by using Octave's hist function which can be called using:
octave:5> hist(s, 30, 1)
The first argument, s, to hist is the stochastic variable, the second is the number of bins that s should be grouped into (here we have used 30), and the third argument gives the sum of the heights of the histogram (here we set it to 1). The histogram is shown in the figure below. If hist is called via the command hist(s), s is grouped into ten bins and the sum of the heights of the histogram is equal to sum(s).
From the figure, we see that mcintgrv produces a sequence of random numbers that appear to be normal (or Gaussian) distributed with a mean of 2. This is what we expected. It then makes good sense to describe the variable via the sample mean defined as:
where N is the number of samples (here 1000) and si the i'th data point, as well as the sample variance given by:
The variance is a measure of the distribution width and therefore an estimate of the statistical uncertainty of the mean value. Sometimes, one uses the standard deviation instead of the variance. The standard deviation is simply the square root of the variance
To calculate the sample mean, sample variance, and the standard deviation in Octave, you use:
ans = 1.9999
ans = 0.002028
ans = 0.044976
In the statistical description of the data, we can also include the skewness which measures the symmetry of the underlying distribution around the mean. If it is positive, it is an indication that the distribution has a long tail stretching towards positive values with respect to the mean. If it is negative, it has a long negative tail. The skewness is often defined as:
We can calculate this in Octave via:
ans = -0.15495
This result is a bit surprising because we would assume from the histogram that the data set represents numbers picked from a normal distribution which is symmetric around the mean and therefore has zero skewness. It illustrates an important point—be careful to use the skewness as a direct measure of the distributions symmetry—you need a very large data set to get a good estimate.
You can also calculate the kurtosis which measures the flatness of the sample distribution compared to a normal distribution. Negative kurtosis indicates a relative flatter distribution around the mean and a positive kurtosis that the sample distribution has a sharp peak around the mean. The kurtosis is defined by the following:
It can be calculated by the kurtosis function.
ans = -0.02310
The kurtosis has the same problem as the skewness—you need a very large sample size to obtain a good estimate.
As you may know, the sample mean, variance, skewness, and kurtosis are examples of sample moments. The mean is related to the first moment, the variance the second moment, and so forth. Now, the moments are not uniquely defined. One can, for example, define the k'th absolute sample moment pka and k'th central sample moment pkc as:
Notice that the first absolute moment is simply the sample mean, but the first central sample moment is zero. In Octave, you can easily retrieve the sample moments using the moment function, for example, to calculate the second central sample moment you use:
octave:11> moment(s, 2, 'c')
ans = 0.002022
Here the first input argument is the sample data, the second defines the order of the moment, and the third argument specifies whether we want the central moment 'c' or absolute moment 'a' which is the default. Compare the output with the output from Command 7—why is it not the same?
|Read more about this book|
(For more resources on this subject, see here.)
Comparing data sets
Above, it was shown how you can use Octave to perform the very simplest statistical description of a single data set. In this section, we shall see how to statistically compare two data sets. What do we exactly mean by a statistical comparison? For example, we could test if two data sets statistically have the same means (this is known as the student's t-test), or if they have the same underlying probability distribution (the χ2 – test).
In Octave, you can perform almost all known statistical tests: the student's t-test, z-test, the Kolmogorov-Smirnov test, and many more. Here I will only show you how to perform one variance of the t-test and how to compute the Pearson correlation coefficient.
The correlation coefficient
The following table shows the height and weight of boys from age 2 to 15:
One can easily see that both height and weight are increasing with respect to age. To see if the two data sets are actually correlated, we need to be a bit more careful. Usually the correlation between two data sets is quantified by using the Pearson's correlation coefficient which is given by:
where σx is the standard deviation of the data set xi and σy is the standard deviation of yi. Values of rp around one indicates good correlation between the two data sets. The Peason correlation coefficient is easily calculated in Octave using cor (or corrcoef). This function has the syntax:
r = cor(x, y)
No need to explain, I guess.
Assume that we have stored the data in an ASCII file called boys.dat like this:
# Age weight Height
2 12.5 85.5
3 13.2 93.2
octave:12> load -ascii boys.dat;
We then need to find the correlation between the second and the third column:
octave:13> cor(boys(:,2), boys(:,3))
ans = 0.97066
That is, the two data sets are indeed correlated, which we would expect.
The student t-test
The following sequence of numbers shows the height of the pupils in a class of 21 children (in centimetres):
156.92 140.00 163.20 167.24 149.84 149.21 166.86 152.01
147.53 157.56 154.48 170.33 155.82 162.24 161.43 174.94
146.30 151.08 150.82 154.49 165.98
The mean is 157.07 centimetres. The national average height is 161.11 centimetres. Under the assumption that the height of the pupils is normal distributed around the mean, can we show that the mean is statistically the same as the national average? Octave's t_test can help. A simple version of the syntax is:
pvalue = t_test(x, m)
Here pvalue is the probability that the null hypothesis (that the two means are the same) is true, x is the data, and m is the mean that we test against.
Suppose we have the heights stored in an array called heights. To perform the test, we use:
octave:14> t_test(heights, 161.11)
ans = 0.0508369
which means that we cannot definitely conclude that the mean height in the class is the same as the national average height assuming no variance in the latter. Usually one accepts the null hypothesis for pvalue > 0.05, so here we have a border-line situation.
The table below lists some of the most common statistical test functions:
In many areas of science, you want to fit a function to data. This function can represent either an empirical or a theoretical model. There are many reasons to do this, for example, if the theoretical model agrees with the observed data values, the theory is likely to be right and hopefully you have gained new insight into the phenomenon you are studying.
In this section, we will discuss some of Octave's fitting functionality. I will not go into details with the algorithms that are behind the fitting functions—this will simply take up too much space and not be of much relevance for the points.
Suppose we want to investigate the length of the leaves in two different families of trees at different heights. Normally the leaves are largest near the ground, in order to increase the photosynthesis. The figure below shows fabricated data of the leaf length as a function of height from the ground for two imaginary families of trees called tree A (red squares) and tree B (blue triangles). For some reason, we have the idea that the leaf length for tree A, we denote this by yA, is a linear function with respect to height x, but for tree B, the leaf length yB follows a third order polynomial with respect to height. That is, we should test if the models:
can fit the data well if we use the polynomial coefficients as fitting parameters.
In Octave this is straightforward using polyfit. This function can be called via the following syntax:
[cfit s] = polyfit(x, y, n)
where x is the independent/free variable (in this case the height), y is the measured data (leaf length), and n is the degree of the polynomial. The first output is an array with the polynomial coefficients and therefore has length n+1, and the second is a structure that contains information about the fit. We shall study some the important structure fields below.
Time for action – using polyfit
- Assume that we have loaded the relevant data into Octave and stored the leaf lengths of tree A and B in variables yA and yB. The corresponding heights are stored in xA and xB.
- To fit the linear model to the data in yA, we do the following:
octave:15> [cA sA] = polyfit(xA, yA, 1);
- We must check if the fit made any sense at all. First, we can plot the resulting linear fit together with the data:
octave:16> plot(xA, yA, 'rs', xA, sA.yf, 'r');
and is shown in the figure below with red squares and a red line.
- The fit of yB to the third order polynomial follows the same procedure:
octave:17> [cB sB] = polyfit(xB, yB, 3);
octave:18> hold on; plot(xB, yB, 'bv', xB, sB.yf, 'b');
The plot is shown in the figure below:
Notice that I have used the plotting style format 'rs' and 'bv'. Depending on your plotting backend, this may not be supported. You can change it to, for example, 'ro' and 'b+'
What just happened?
polyfit finds the polynomial coefficients cn, cn–1,...,c1, c0 such that the difference between the polynomial (our statistical model) y = y (xi, cn, cn–1,...,c1, c0) and the data points yi is minimized by some measure. This measure is the sum of the residuals, ri = yi – y, that is:
where N is the number of fitting points. This fitting procedure is called a least squares fit.
As mentioned above, polyfit returns a structure that stores information about the fit. In Command 16, we use one of the structure fields yf that contains the fitted values y (xi, cn, cn–1,...,c1,c0) to plot the resulting fit. We could alternatively have used the polynomial coefficients returned in cA:
octave:20> plot(xA, yA, 'rs', xA, polyval(cA, xA), 'r');
Goodness of the fit
From the figure above, it looks as if the fits represent the data quite well, that is, the polynomials seem to be good models of the leaf length variation. A visual verification of the fits and models in general is not really enough. Scientists often prefer some objective quantity to indicate whether the fit is satisfactory or not.
polyfit stores a quantity in the structure field normr, namely the 2-norm of the residuals. This is given by:
This is however not of much help here because this quantity depends on the absolute values of the residuals. One can instead use the correlation coefficient:
You can see that for small residuals (and possibly a good fit), the correlation coefficient is close to 1; if the fit is poor, it is close to 0. Unfortunately, polyfit won't calculate the quantity for you, but you can easily do it yourself.
|Read more about this book|
(For more resources on this subject, see here.)
Time for action – calculating the correlation coefficient
Let us try to calculate the correlation coefficient for the fit of the leaf length for tree A. We just need to follow Equation (7.9):
octave:21> denom = (length(yA) - 1)*var(yA);
octave:22> rcor = 1 - sA.normr^2/denom
rcor = 0.96801
This gives an indication that the fit is good as we expected.
What just happened?
In Command 21, we calculated the denominator in Equation (7.9). Notice that instead of calculating the square of the standard deviation, we simply use the variance found with var. From Equation (7.9), we see that the 2-norm of the residuals enters the nominator. This is already calculated in polyfit and stored in the structure field normr, so we use this in the evaluation of the correlation coefficient.
If there is a systematic deviation between the fit and the data, the model may have to be rejected. These deviations can be sufficiently small and are therefore not captured by the correlation coefficient. They can, however, be seen via residual plots. In Octave, it is straightforward to do, and I leave this to you to do as an exercise.
Of course, not everything can be described via simple polynomial models. Phenomena related to growth are for example often described via an exponential function or a power law. These two examples are trivial to solve with the polynomial fitting technique discussed above, but you may have particular fitting models that need a more general fitting procedure. Octave can help you with this as well.
Before we discuss how to fit more general functions to data, it is worthwhile to see if we can transform the fitting function into a different form that allows us to use the polynomial fitting procedure above.
A very simple example of such a transform is if the data set follows a power law function, say:
We can transform this into a linear equation by taking the logarithm on both sides:
In this way, the new variable y' = ln (y), is a linear function of x' = ln (x), that is, we can write y' = ax' + b' with b' = ln (b) . We then fit Equation (7.11) to the transformed data using the polyfit function. Remember to transform the fitted parameter b back using the inverse transform, b = eb'.
In the example above, the transform is trivial. You could consider other possible ways of transforming your model. For example, if it is possible to Fourier transform the data and the model it could perhaps be a better choice to perform the fit in Fourier space.
General least squares fitting
In case you cannot transform the model into a proper form that allows you to use polyfit, you can use leasqr. This function comes with the optimization package optim which can be downloaded from the Octave-Forge web page. leasqr is a really powerful function that allows you to perform even the most complicated fits—if I must choose my favorite Octave function, I think this would be it.
The syntax is as follows:
[yfit pfit cvg iter ...] = leasqr(x, y, p, fun, opt)
Describing all the inputs and outputs in sufficient detail would take a great deal of effort, so we shall only discuss a limited number of features, but these will take us quite far. If you are interested in knowing more about the function simply type: help leasqr.
The inputs are:
x: the independent variable
y: the measured/observed data
p: an initial parameter guess
fun: the model (fitting function)
opt: up to six optional arguments specifying the weights of each data point, maximum number of iterations the algorithm can perform, and so on
leasqr can return 10 outputs, the first four are:
yfit: the fitted function values
pfit: the fitted parameter values
cvg: is 1 if the fit converged (likely to be successful), 0 if not
iter: number of iterations done
The crucial point when using leasqr is that you must provide the model in the form of an Octave function that can be fitted to the data set. This function must be dependent on at least one parameter and follow the syntax:
y = fun(x,p)
Again, x is the free variable, and p is the parameter list.
Let us illustrate leasqr by fitting the following two parameter model:
to some data that we simply generate ourselves. Note, α and β are the fitting parameters.
Time for action – using leasqr
- Let us generate data with normally distributed random noise:
octave:23> x=linspace(0, 5); y = 1./(1 + 1.2*x.^1.8) + \
- Then we specify the model using the following function definition:
octave:24> function y = ffun(x, p)
> y = 1./(1+p(1)*x.^p(2));
- Give an initial guess of the parameters α and β:
octave:25> p = [0.5 0.0];
- We can now fit the model to data:
octave:26> [yfit pfit cvg iter] = leasqr(x, y, p, "ffun");
- We can check whether the fitting algorithm converged or not, and how many iterations it used:
octave:27> cvg, iter
cvg = 1
iter = 6
- The values of the fitted parameters are of course important:
This is very close to the values that we would expect from Command 24. The fit is plotted together with the data in the figure below.
What just happened?
In Command 23, we instantiated the free variable x, and set it to be a vector with element values between 0 and 5 and consisting of one hundred elements. The variable y then plays the role of the dependent data set. We added a little noise with low amplitude to make it more realistic, hence the call to randn. In Command 24, we defined the model through an Octave function and we then set our initial guess in Command 25. This guess needs to be reasonable—how far from the true parameter value it can be depends on the model and the data. Sometimes the algorithm may not converge, and you must then use another (a better) initial guess. It should be clear what Commands 26 to 28 do.
Always have a valid reason to use a specific functional form as a model. A fit of a model with too many or too few free parameters will fail to converge to the data set even though the function has the correct form.
Note that if you follow the commands above, your fitted parameters may be a bit different and the number of iterations may be six plus or minus one depending on what random numbers your machine generated.
This article covered how to perform the simplest statistical analysis and function fitting.
- Interacting with GNU Octave: Variables [Article]
- Interacting with GNU Octave: Operators [Article]
- What Can You Do with Sage Math? [Article]
- Sage: Tips and Tricks [Article]
- Creating a Reporting Site using BIRT [Article]
- Organizing, Clarifying and Communicating the R Data Analysis [Article]