(For more resources on this topic, see here.)
File I/O
First, we will learn about file I/O with NumPy. Data is usually stored in files. You would not get far if you are not able to read from and write to files.
Time for action – reading and writing files
As an example of file I/O, we will create an identity matrix and store its contents in a file.
Identity matrix creation
- Creating an identity matrix: The identty matrix is a square matrix with ones on the diagonal and zeroes for the rest.
 
 The identity matrix can be created with the eye function. The only argument we need to give the eye function is the number of ones. So, for instance, for a 2-by-2 matrix, write the following code: code1 The output is: code2 
- Saving data: Save the data with the savetxt function. We obviously need to specify the name of the file that we want to save the data in and the array containing the data itself:
code3 
A file called eye.txt should have been created. You can check for yourself whether the contents are as expected.
What just happened?
Reading and writing files is a necessary skill for data analysis. We wrote to a file with savetxt. We made an identity matrix with the eye function.
CSV files
Files in the comma separated values (CSV) format are encountered quite frequently. Often, the CSV file is just a dump from a database file. Usually, each field in the CSV file corresponds to a database table column. As we all know, spreadsheet programs, such as Excel, can produce CSV files as well.
Time for action – loading from CSV files
How do we deal with CSV files? Luckily, the loadtxt function can conveniently read CSV files, split up the fields and load the data into NumPy arrays. In the following example, we will load historical price data for Apple (the company, not the fruit). The data is in CSV format. The first column contains a symbol that identifies the stock. In our case, it is AAPL, next in our case. Nn is the date in dd-mm-yyyy format. The third column is empty. Then, in order, we have the open, high, low, and close price. Last, but not least, is the volume of the day. This is what a line looks like:
code4
- How do we deal with CSV files? Luckily, the loadtxt function can conveniently read CSV files, split up the fields and load the data into NumPy arrays. In the following example, we will load historical price data for Apple (the company, not the fruit). The data is in CSV format. The first column contains a symbol that identifies the stock. In our case, it is AAPL, next in our case. Nn is the date in dd-mm-yyyy format. The third column is empty. Then, in order, we have the open, high, low, and close price. Last, but not least, is the volume of the day. This is what a line looks like:
code5 
As you can see, data is stored in the data.csv file. We have set the delimiter to , (comma), since we are dealing with a comma separated value file. The usecols parameter is set through a tuple to get the seventh and eighth fields, which correspond to the close price and volume. Unpack is set to True, which means that data will be unpacked and assigned to the c and v variables that will hold the close price and volume, respectively.
What just happened?
CSV files are a special type of file that we have to deal with frequently. We read a CSV file containing stock quotes with the loadtxt function. We indicated to the loadtxt function that the delimiter of our file was a comma. We specified which columns we were interested in, through the usecols argument, and set the unpack parameter to True so that the data was unpacked for further use.
Volume weighted average price
Volume weighted average price (VWAP) is a very important quantity. The higher the volume, the more significant a price move typically is. VWAP is calculated by using volume values as weights.
Time for action – calculating volume weighted average price
These are the actions that we will take:
- Read the data into arrays.
- Calculate VWAP:
code6 
What just happened?
That wasn't very hard, was it? We just called the average function and set its weights parameter to use the v array for weights. By the way, NumPy also has a function to calculate the arithmetic mean.
The mean function
The mean function is quite friendly and not so mean. This function calculates the arithmetic mean of an array. Let's see it in action:
code7
Time weighted average price
Now that we are at it, let's compute the time weighted average price too. It is just a variation on a theme really. The idea is that recent price quotes are more important, so we should give recent prices higher weights. The easiest way is to create an array with the arange function of increasing values from zero to the number of elements in the close price array. This is not necessarily the correct way. In fact, most of the examples concerning stock price analysis in this book are only illustrative. The following is the TWAP code:
code8
It produces this output:
code9
The TWAP is even higher than the mean.
Pop quiz – computing the weighted average
- Which function returns the weighted average of an array?
-  
- Reading from a file: First, we will need to read our file again and store the values for the high and low prices into arrays:
code10 The only thing that changed is the usecols parameter, since the high and low prices are situated in different columns. 
- Getting the range: The following code gets the price range:
code11 These are the values returned: code12 Now, it's trivial to get a midpoint, so it is left as an exercise for the reader to attempt. 
- Calculating the spread: NumPy allows us to compute the spread of an array with a function called The ptp function returns the difference between the maximum and minimum values of an array. In other words, it is equal to max(array) – min(array). Call the ptp function:
code13 You will see this: code14 
    
        Unlock access to the largest independent learning library in Tech for FREE! 
            
                Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at $19.99/month. Cancel anytime 
 
 
 
 
 
- Determine the median of the close price: Create a new Python script and call it simplestats.py. You already know how to load the data from a CSV file into an array. So, copy that line of code and make sure that it only gets the close price. The code should appear like this, by now:
code15 The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code: The function that will do the magic for us is called median. We will call it and print the result immediately. Add the following line of code: code16 The program prints the following output: code17 Since it is our first time using the median function, we would like to check whether this is correct. Not because we are paranoid or anything! Obviously, we could do it by just going through the file and finding the correct value, but that is no fun. Instead we will just mimic the median algorithm by sorting the close price array and printing the middle value of the sorted array. The msort function does the first part for us. We will call the function, store the sorted array, and then print it: code18 This prints the following output: code19 Yup, it works! Let's now get the middle value of the sorted array: code20 It gives us the following output: code21 Hey, that's a different value than the one the median function gave us. How come? Upon further investigation we find that the median function return value doesn't even appear in our file. That's even stranger! Before filing bugs with the NumPy team, let's have a look at the documentation. This mystery is easy to solve. It turns out that our naive algorithm only works for arrays with odd lengths. For even-length arrays, the median is calculated from the average of the two array values in the middle. Therefore, type the following code: code22 This prints the following output: code23 Success! Another statistical measure that we are concerned with is variance. Variance tells us how much a variable varies. In our case, it also tells us how risky an investment is, since a stock price that varies too wildly is bound to get us into trouble. 
- Calculate the variance of the close price: With NumPy, this is just a one liner. See the following code:
code24 This gives us the following output: code25 Not that we don't trust NumPy or anything, but let's double-check using the definition of variance, as found in the documentation. Mind you, this definition might be different than the one in your statistics book, but that is quite common in the field of statistics. The variance is defined as the mean of the square of deviations from the mean, divided by the number of elements in the array. Some books tell us to divide by the number of elements in the array minus one. code26 The output is as follows: code27 Just as we expected! 
 
- weighted average
- waverage
- average
- avg
 
Have a go hero – calculating other averagesTry doing the same calculation using the open price. Calculate the mean for the volume and the other prices. Value rangeUsually, we don't only want to know the average or arithmetic mean of a set of values, which are sort of in the middle; we also want the extremes, the full range—the highest and lowest values. The sample data that we are using here already has those values per day—the high and low price. However, we need to know the highest value of the high price and the lowest price value of the low price. After all, how else would we know how much our Apple stocks would gain or lose. Time for action – finding highest and lowest valuesThe min and max functions are the answer to our requirement. What just happened?We defined a range of highest to lowest values for the price. The highest value was given by applying the max function to the high price array. Similarly, the lowest value was found by calling the min function to the low price array. We also calculated the peak to peak distance with the ptp function. StatisticsStock traders are interested in the most probable close price. Common sense says that this should be close to some kind of an average. The arithmetic mean and weighted average are ways to find the center of a distribution of values. However, both are not robust and sensitive to outliers. For instance, if we had a close price value of a million dollars, this would have influenced the outcome of our calculations. Time for action – doing simple statisticsOne thing that we can do is use some kind of threshold to weed out outliers, but there is a better way. It is called the median, and it basically picks the middle value of a sorted set of values. For example, if we have the values of 1, 2, 3, 4 and 5. The median would be 3, since it is in the middle. These are the steps to calculate the median: What just happened?Maybe you noticed something new. We suddenly called the mean function on the c array. Yes, this is legal, because the ndarray object has a mean method. This is for your convenience. For now, just keep in mind that this is possible.