Read more about this book |

*(For more resources on this subject, see here.)*

# Loading data files

When performing a statistical analysis of a particular problem, you often have some data stored in a file. You can save your variables (or the entire workspace) using different file formats and then load them back in again. Octave can, of course, also load data from files generated by other programs. There are certain restrictions when you do this which we will discuss here. In the following matter, we will only consider ASCII files, that is, readable text files.

When you load data from an ASCII file using the *load* command, the data is treated as a two-dimensional array. We can then think of the data as a matrix where lines represent the matrix rows and columns the matrix columns. For this matrix to be well defined, the data must be organized such that all the rows have the same number of columns (and therefore the columns the same number of rows). For example, the content of a file called *series.dat* can be:

Next we to load this into Octave's workspace:

octave:1> load -ascii series.dat;

whereby the data is stored in the variable named *series*. In fact, Octave is capable of loading the data even if you do not specify the ASCII format. The number of rows and columns are then:

octave:2> size(series)

ans =

4 3

I prefer the file extension *.dat*, but again this is optional and can be anything you wish, say *.txt, .ascii, .data*, or nothing at all.

In the data files you can have:

- Octave comments
- Data blocks separated by blank lines (or equivalent empty rows)
- Tabs or single and multi-space for number separation

Thus, the following data file will successfully load into Octave:

# First block

1 232 334

2 245 334

3 456 342

4 555 321

# Second block

1 231 334

2 244 334

3 450 341

4 557 327

The resulting variable is a matrix with 8 rows and 3 columns. If you know the number of blocks or the block sizes, you can then separate the blocked-data.

Now, the following data stored in the file *bad.dat* will not load into Octave's workspace:

1 232.1 334

2 245.2

3 456.23

4 555.6

because line 1 has three columns whereas lines 2-4 have two columns. If you try to load this file, Octave will complain:

octave:3> load -ascii bad.dat

error: load: bad.dat: inconsisitent number of columns near line 2

error:load: unable to extract matrix size from file 'bad.dat'

# Simple descriptive statistics

Consider an Octave function *mcintgr* and its vectorized version *mcintgrv*. This function can evaluate the integral for a mathematical function *f* in some interval *[a; b]* where the function is positive. The Octave function is based on the Monte Carlo method and the return value, that is, the integral, is therefore a stochastic variable. When we calculate a given integral, we should as a minimum present the result as a mean or another appropriate measure of a central value together with an associated statistical uncertainty. This is true for any other stochastic variable, whether it is the height of the pupils in class, length of a plant's leaves, and so on.

In this section, we will use Octave for the most simple statistical description of stochastic variables.

## Histogram and moments

Let us calculate the integral given in Equation (5.9) one thousand times using the vectorized version of the Monte Carlo integrator:

octave:4> for i=1:1000

> s(i) = mcintgrv("sin", 0, pi, 1000);

> endfor

The array s now contains a sequence of numbers which we know are approximately 2. Before we make any quantitative statistical description, it is always a good idea to first plot a histogram of the data as this gives an approximation to the true underlying probability distribution of the variable s. The easiest way to do this is by using Octave's *hist* function which can be called using:

octave:5> hist(s, 30, 1)

The first argument, s, to *hist* is the stochastic variable, the second is the number of bins that s should be grouped into (here we have used 30), and the third argument gives the sum of the heights of the histogram (here we set it to 1). The histogram is shown in the figure below. If *hist* is called via the command *hist(s)*, s is grouped into ten bins and the sum of the heights of the histogram is equal to *sum(s)*.

From the figure, we see that *mcintgrv* produces a sequence of random numbers that appear to be normal (or Gaussian) distributed with a mean of 2. This is what we expected. It then makes good sense to describe the variable via the sample mean defined as:

where *N* is the number of samples (here 1000) and *s _{i}* the

*i*'th data point, as well as the sample variance given by:

The variance is a measure of the distribution width and therefore an estimate of the statistical uncertainty of the mean value. Sometimes, one uses the standard deviation instead of the variance. The standard deviation is simply the square root of the variance

To calculate the sample mean, sample variance, and the standard deviation in Octave, you use:

octave:6> mean(s)

ans = 1.9999

octave:7> var(s)

ans = 0.002028

octave:8> std(s)

ans = 0.044976

In the statistical description of the data, we can also include the skewness which measures the symmetry of the underlying distribution around the mean. If it is positive, it is an indication that the distribution has a long tail stretching towards positive values with respect to the mean. If it is negative, it has a long negative tail. The skewness is often defined as:

We can calculate this in Octave via:

octave:9> skewness(s)

ans = -0.15495

This result is a bit surprising because we would assume from the histogram that the data set represents numbers picked from a normal distribution which is symmetric around the mean and therefore has zero skewness. It illustrates an important point—be careful to use the skewness as a direct measure of the distributions symmetry—you need a very large data set to get a good estimate.

You can also calculate the kurtosis which measures the flatness of the sample distribution compared to a normal distribution. Negative kurtosis indicates a relative flatter distribution around the mean and a positive kurtosis that the sample distribution has a sharp peak around the mean. The kurtosis is defined by the following:

It can be calculated by the *kurtosis* function.

octave:10> kurtosis(s)

ans = -0.02310

The kurtosis has the same problem as the skewness—you need a very large sample size to obtain a good estimate.

## Sample moments

As you may know, the sample mean, variance, skewness, and kurtosis are examples of sample moments. The mean is related to the first moment, the variance the second moment, and so forth. Now, the moments are not uniquely defined. One can, for example, define the *k*'th absolute sample moment p^{k}_{a} and *k*'th central sample moment p^{k}_{c} as:

Notice that the first absolute moment is simply the sample mean, but the first central sample moment is zero. In Octave, you can easily retrieve the sample moments using the *moment* function, for example, to calculate the second central sample moment you use:

octave:11> moment(s, 2, 'c')

ans = 0.002022

Here the first input argument is the sample data, the second defines the order of the moment, and the third argument specifies whether we want the central moment '*c*' or absolute moment '*a*' which is the default. Compare the output with the output from Command 7—why is it not the same?

Read more about this book |

*(For more resources on this subject, see here.)*

## Comparing data sets

Above, it was shown how you can use Octave to perform the very simplest statistical description of a single data set. In this section, we shall see how to statistically compare two data sets. What do we exactly mean by a statistical comparison? For example, we could test if two data sets statistically have the same means (this is known as the student's t-test), or if they have the same underlying probability distribution (the χ^{2} – test).

In Octave, you can perform almost all known statistical tests: the student's t-test, z-test, the Kolmogorov-Smirnov test, and many more. Here I will only show you how to perform one variance of the t-test and how to compute the Pearson correlation coefficient.

### The correlation coefficient

The following table shows the height and weight of boys from age 2 to 15:

One can easily see that both height and weight are increasing with respect to age. To see if the two data sets are actually correlated, we need to be a bit more careful. Usually the correlation between two data sets is quantified by using the Pearson's correlation coefficient which is given by:

where *σ _{x}* is the standard deviation of the data set

*x*and

_{i}*σ*is the standard deviation of

_{y}*y*. Values of

_{i}*r*around one indicates good correlation between the two data sets. The Peason correlation coefficient is easily calculated in Octave using

_{p}*cor*(or

*corrcoef*). This function has the syntax:

r = cor(x, y)

No need to explain, I guess.

Assume that we have stored the data in an ASCII file called *boys.dat* like this:

# Age weight Height

2 12.5 85.5

3 13.2 93.2

...

octave:12> load -ascii boys.dat;

We then need to find the correlation between the second and the third column:

octave:13> cor(boys(:,2), boys(:,3))

ans = 0.97066

That is, the two data sets are indeed correlated, which we would expect.

### The student t-test

The following sequence of numbers shows the height of the pupils in a class of 21 children (in centimetres):

156.92 140.00 163.20 167.24 149.84 149.21 166.86 152.01

147.53 157.56 154.48 170.33 155.82 162.24 161.43 174.94

146.30 151.08 150.82 154.49 165.98

The mean is 157.07 centimetres. The national average height is 161.11 centimetres. Under the assumption that the height of the pupils is normal distributed around the mean, can we show that the mean is statistically the same as the national average? Octave's *t_test* can help. A simple version of the syntax is:

pvalue = t_test(x, m)

Here *pvalue* is the probability that the null hypothesis (that the two means are the same) is true, *x* is the data, and *m* is the mean that we test against.

Suppose we have the heights stored in an array called *heights*. To perform the test, we use:

octave:14> t_test(heights, 161.11)

ans = 0.0508369

which means that we cannot definitely conclude that the mean height in the class is the same as the national average height assuming no variance in the latter. Usually one accepts the null hypothesis for *pvalue > 0.05*, so here we have a border-line situation.

The table below lists some of the most common statistical test functions:

# Function fitting

In many areas of science, you want to fit a function to data. This function can represent either an empirical or a theoretical model. There are many reasons to do this, for example, if the theoretical model agrees with the observed data values, the theory is likely to be right and hopefully you have gained new insight into the phenomenon you are studying.

In this section, we will discuss some of Octave's fitting functionality. I will not go into details with the algorithms that are behind the fitting functions—this will simply take up too much space and not be of much relevance for the points.

## Polynomial fitting

Suppose we want to investigate the length of the leaves in two different families of trees at different heights. Normally the leaves are largest near the ground, in order to increase the photosynthesis. The figure below shows fabricated data of the leaf length as a function of height from the ground for two imaginary families of trees called tree A (red squares) and tree B (blue triangles). For some reason, we have the idea that the leaf length for tree A, we denote this by *y ^{A}*, is a linear function with respect to height

*x*, but for tree B, the leaf length

*y*follows a third order polynomial with respect to height. That is, we should test if the models:

^{B}can fit the data well if we use the polynomial coefficients as fitting parameters.

In Octave this is straightforward using *polyfit*. This function can be called via the following syntax:

[cfit s] = polyfit(x, y, n)

where *x* is the independent/free variable (in this case the height), *y* is the measured data (leaf length), and *n* is the degree of the polynomial. The first output is an array with the polynomial coefficients and therefore has length *n+1*, and the second is a structure that contains information about the fit. We shall study some the important structure fields below.

# Time for action – using polyfit

- Assume that we have loaded the relevant data into Octave and stored the leaf lengths of tree A and B in variables
*yA*and*yB*. The corresponding heights are stored in*xA*and*xB*. - To fit the linear model to the data in
*yA*, we do the following:octave:15> [cA sA] = polyfit(xA, yA, 1);

- We must check if the fit made any sense at all. First, we can plot the resulting linear fit together with the data:
octave:16> plot(xA, yA, 'rs', xA, sA.yf, 'r');

and is shown in the figure below with red squares and a red line.

- The fit of
*y*to the third order polynomial follows the same procedure:^{B}octave:17> [cB sB] = polyfit(xB, yB, 3);

octave:18> hold on; plot(xB, yB, 'bv', xB, sB.yf, 'b');The plot is shown in the figure below:

*Notice that I have used the plotting style format 'rs' and 'bv'. Depending on your plotting backend, this may not be supported. You can change it to, for example, 'ro' and 'b+'*

*What just happened?*

*polyfit* finds the polynomial coefficients *c _{n}, c_{n–1},...,c_{1}, c_{0}* such that the difference between the polynomial (our statistical model) y = y (x

_{i}, c

_{n}, c

_{n–1},...,c

_{1}, c

_{0}) and the data points y

_{i}is minimized by some measure. This measure is the sum of the residuals, r

_{i}= y

_{i}– y, that is:

where N is the number of fitting points. This fitting procedure is called a least squares fit.

As mentioned above, *polyfit* returns a structure that stores information about the fit. In Command 16, we use one of the structure fields *yf* that contains the fitted values *y (x _{i}, c_{n}, c_{n–1},...,c_{1},c_{0})* to plot the resulting fit. We could alternatively have used the polynomial coefficients returned in

*cA*:

octave:19> cA

cA =

-0.75172 10.52164

Using polyval:

octave:20> plot(xA, yA, 'rs', xA, polyval(cA, xA), 'r');

## Goodness of the fit

From the figure above, it looks as if the fits represent the data quite well, that is, the polynomials seem to be good models of the leaf length variation. A visual verification of the fits and models in general is not really enough. Scientists often prefer some objective quantity to indicate whether the fit is satisfactory or not.

*polyfit* stores a quantity in the structure field normr, namely the 2-norm of the residuals. This is given by:

This is however not of much help here because this quantity depends on the absolute values of the residuals. One can instead use the correlation coefficient:

You can see that for small residuals (and possibly a good fit), the correlation coefficient is close to 1; if the fit is poor, it is close to 0. Unfortunately, *polyfit* won't calculate the quantity for you, but you can easily do it yourself.

Read more about this book |

*(For more resources on this subject, see here.)*

# Time for action – calculating the correlation coefficient

Let us try to calculate the correlation coefficient for the fit of the leaf length for tree A. We just need to follow Equation (7.9):

octave:21> denom = (length(yA) - 1)*var(yA);

octave:22> rcor = 1 - sA.normr^2/denom

rcor = 0.96801

This gives an indication that the fit is good as we expected.

*What just happened?*

In Command 21, we calculated the denominator in Equation (7.9). Notice that instead of calculating the square of the standard deviation, we simply use the variance found with *var*. From Equation (7.9), we see that the 2-norm of the residuals enters the nominator. This is already calculated in *polyfit* and stored in the structure field *normr*, so we use this in the evaluation of the correlation coefficient.

## Residual plot

If there is a systematic deviation between the fit and the data, the model may have to be rejected. These deviations can be sufficiently small and are therefore not captured by the correlation coefficient. They can, however, be seen via residual plots. In Octave, it is straightforward to do, and I leave this to you to do as an exercise.

## Non-polynomial fits

Of course, not everything can be described via simple polynomial models. Phenomena related to growth are for example often described via an exponential function or a power law. These two examples are trivial to solve with the polynomial fitting technique discussed above, but you may have particular fitting models that need a more general fitting procedure. Octave can help you with this as well.

## Transforms

Before we discuss how to fit more general functions to data, it is worthwhile to see if we can transform the fitting function into a different form that allows us to use the polynomial fitting procedure above.

A very simple example of such a transform is if the data set follows a power law function, say:

We can transform this into a linear equation by taking the logarithm on both sides:

In this way, the new variable *y' = ln (y)*, is a linear function of *x' = ln (x)*, that is, we can write *y' = ax' + b'* with *b' = ln (b)* . We then fit Equation (7.11) to the transformed data using the *polyfit* function. Remember to transform the fitted parameter *b* back using the inverse transform, *b = e ^{b}'*.

In the example above, the transform is trivial. You could consider other possible ways of transforming your model. For example, if it is possible to Fourier transform the data and the model it could perhaps be a better choice to perform the fit in Fourier space.

## General least squares fitting

In case you cannot transform the model into a proper form that allows you to use *polyfit*, you can use *leasqr*. This function comes with the optimization package **optim** which can be downloaded from the Octave-Forge web page. *leasqr* is a really powerful function that allows you to perform even the most complicated fits—if I must choose my favorite Octave function, I think this would be it.

The syntax is as follows:

[yfit pfit cvg iter ...] = leasqr(x, y, p, fun, opt)

Describing all the inputs and outputs in sufficient detail would take a great deal of effort, so we shall only discuss a limited number of features, but these will take us quite far. If you are interested in knowing more about the function simply type: *help leasqr*.

The inputs are:

x: the independent variable

y: the measured/observed data

p: an initial parameter guess

fun: the model (fitting function)

opt: up to six optional arguments specifying the weights of each data point, maximum number of iterations the algorithm can perform, and so on

*leasqr* can return 10 outputs, the first four are:

yfit: the fitted function values

pfit: the fitted parameter values

cvg: is 1 if the fit converged (likely to be successful), 0 if not

iter: number of iterations done

The crucial point when using *leasqr* is that you must provide the model in the form of an Octave function that can be fitted to the data set. This function must be dependent on at least one parameter and follow the syntax:

y = fun(x,p)

Again, *x* is the free variable, and *p* is the parameter list.

Let us illustrate *leasqr* by fitting the following two parameter model:

to some data that we simply generate ourselves. Note, α and β are the fitting parameters.

# Time for action – using leasqr

- Let us generate data with normally distributed random noise:
octave:23> x=linspace(0, 5); y = 1./(1 + 1.2*x.^1.8) + \

> randn(1,100)*0.03; - Then we specify the model using the following function definition:
octave:24> function y = ffun(x, p)

> y = 1./(1+p(1)*x.^p(2));

> endfunction - Give an initial guess of the parameters α and β:
octave:25> p = [0.5 0.0];

- We can now fit the model to data:

Easy!octave:26> [yfit pfit cvg iter] = leasqr(x, y, p, "ffun");

- We can check whether the fitting algorithm converged or not, and how many iterations it used:
octave:27> cvg, iter

cvg = 1

iter = 6 - The values of the fitted parameters are of course important:
octave:28> pfit

p =

1.1962

1.7955This is very close to the values that we would expect from Command 24. The fit is plotted together with the data in the figure below.

*What just happened?*

In Command 23, we instantiated the free variable x, and set it to be a vector with element values between 0 and 5 and consisting of one hundred elements. The variable y then plays the role of the dependent data set. We added a little noise with low amplitude to make it more realistic, hence the call to *randn*. In Command 24, we defined the model through an Octave function and we then set our initial guess in Command 25. This guess needs to be reasonable—how far from the true parameter value it can be depends on the model and the data. Sometimes the algorithm may not converge, and you must then use another (a better) initial guess. It should be clear what Commands 26 to 28 do.

*Always have a valid reason to use a specific functional form as a model. A fit of a model with too many or too few free parameters will fail to converge to the data set even though the function has the correct form.*

Note that if you follow the commands above, your fitted parameters may be a bit different and the number of iterations may be six plus or minus one depending on what random numbers your machine generated.

# Summary

This article covered how to perform the simplest statistical analysis and function fitting.

**Further resources on this subject:**

- Interacting with GNU Octave: Variables [Article]
- Interacting with GNU Octave: Operators [Article]
- What Can You Do with Sage Math? [Article]
- Sage: Tips and Tricks [Article]
- Creating a Reporting Site using BIRT [Article]
- Organizing, Clarifying and Communicating the R Data Analysis [Article]