Reader small image

You're reading from  Learning NumPy Array

Product typeBook
Published inJun 2014
Reading LevelIntermediate
Publisher
ISBN-139781783983902
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Ivan Idris
Ivan Idris
author image
Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Read more about Ivan Idris

Right arrow

Chapter 3. Basic Data Analysis with NumPy

In this chapter, we will learn about basic data analysis through an example of historical weather data. We will learn about functions that make working with NumPy easier.

In this chapter, we shall cover the following topics:

  • Functions working on arrays

  • Loading arrays from files containing weather data

  • Simple mathematical and statistical functions

Introducing the dataset


First, we will learn about file I/O with NumPy. Data is usually stored in files. You would not get far if you are not able to read from and write to files.

The Royal Netherlands Meteorological Institute (KNMI) offers daily weather data online (browse to http://www.knmi.nl/climatology/daily_data/download.html). KNMI is the Dutch meteorological service headquartered in De Bilt. Let's download one of the KNMI files from the De Bilt weather station. The file is roughly 10 megabytes. It has some text with explanation about the data in Dutch and English. Below that is the data in comma-separated values format. I separated the metadata and the actual data into separate files. The separation is not necessary because you can skip rows when loading from NumPy. I wrote a simple script with NumPy to determine the maximum and minimum temperature for the dataset from a CSV file that was created in the separation process.

The temperatures are given in tenths of a degree Celsius. There...

Determining the daily temperature range


The daily temperature range, or diurnal temperature variation as it is called in meteorology, is not so big a deal on Earth. In desert areas on Earth or generally on different planets, the variation is greater. We will have a look at the daily temperature range for the data we downloaded in the previous example:

  1. To analyze temperature ranges, we will need to import the NumPy package and the NumPy masked arrays:

    import numpy as np
    import sys
    import numpy.ma as ma
    from datetime import datetime as dt
  2. We will load a bit more data than that loaded in the previous section: dates of measurements in the YYYYMMDD format and the average daily temperature. Dates require special conversion. Firstly date strings are converted to dates and then to numbers as follows:

    to_float = lambda x: float(x.strip() or np.nan)
    to_date = lambda x: dt.strptime(x, "%Y%m%d").toordinal()
     
    dates, avg_temp, min_temp, max_temp = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1, 11, 12...

Looking for evidence of global warming


According to the global warming theory, the temperature on Earth has increased on average since the end of the 19th century. During the last century until now, the temperature supposedly has gained about 0.8 degrees. Apparently, most of this warming has happened in the last two or three decades. In the future, we can expect the temperature to rise even more, leading to droughts, heat waves, and other unpleasant phenomena. Obviously, some regions will be hit harder than others. Several solutions have been proposed, including reduction of greenhouse gas emissions and geo-engineering by spreading special gases in the atmosphere in order to reflect more sunlight.

The data we downloaded from the Dutch Meteorological Institute, KNMI, is not sufficient to prove whether global warming is real or not, but we can certainly examine it further. For instance, we can check whether the temperature in De Bilt (that's where the data was collected) in the first half of...

Comparing solar radiation versus temperature


The Sun is of course a very important factor when it comes to temperature. Unfortunately, the De Bilt dataset from the KNMI is missing a lot of data concerning the Sun's radiation. The data is given in Joule per square centimeter. There are also other variables in the file, which are derived from solar radiation, such as the sunshine duration in hours.

We are going to analyze the radiation data a bit, draw a histogram, and compare it with the daily average temperatures. To compare, we will calculate the correlation coefficient between radiation and temperature and plot yearly relative changes in average temperature and radiation. Originally it seemed a good idea to have a scatter plot, but that didn't look right with thousands of data points, so instead, it was decided to compress the data as it were. Later, the author realized that radiation was present from around 1960 onwards, so it might have been better to plot the correlations coefficient...

Analyzing wind direction


Wind is the movement of air due to the difference in atmospheric pressure. The KNMI De Bilt data file has a column for the vector mean wind direction in degrees (360 = north, 90 = east, 180 = south, 270 = west, 0 = calm/variable). We will plot a histogram of that data and compute the corresponding average temperature for each wind direction. It seems reasonable to expect that the direction from which the wind originates influences temperature. In other words, some locations tend to be warmer or colder, so air emanating from there will be warmer or colder, respectively. The Netherlands, as you may know, doesn't have any mountains, so we don't have to take that into account. We do have to remind ourselves of the proximity of the North Sea. The Netherlands has a moderate maritime climate with southwestern winds. We can study the wind direction information with the following procedure:

  1. We will load the wind direction and average temperatures into NumPy arrays. Wind direction...

Analyzing wind speed


Wind speed is a very important value. The KNMI De Bilt data file has daily average wind speed data expressed in meters per second as well.

We will load the wind direction, wind speed, and average temperature into NumPy arrays. Wind direction and speed have missing values, so some conversion is in order. We will create a masked array from the wind direction and speed values:

to_float = lambda x: float(x.strip() or np.nan)
wind_direction, wind_speed, avg_temp = np.loadtxt(sys.argv[1], delimiter=',', usecols=(2, 4, 11), unpack=True, converters={2: to_float, 4: to_float})
wind_direction = ma.masked_invalid(wind_direction)
wind_speed = ma.masked_invalid(wind_speed)
print "# Wind Speed values", len(wind_speed.compressed())
print "Min speed", wind_speed.min(), "Max speed", wind_speed.max()
print "Average", wind_speed.mean(), "Std. Dev", wind_speed.std()
 
print "Correlation of wind speed and temperature", np.corrcoef(avg_temp[~wind_speed.mask], wind_speed.compressed())[0][1]

Tip...

Analyzing precipitation and sunshine duration


The KNMI De Bilt data file has a column containing precipitation duration values in 0.1 hours. The sunshine duration also given in 0.1 hours is derived from global radiation values. Notice the use of the word global and not solar. Hence, there are other sources of radiation taken into account here, but details are not very important right now. We will plot a histogram of precipitation duration values. However, we will omit the days when no rain fell, because there are so many dry days that it skews the overall picture. We will also display the monthly average precipitation and sunshine durations. The following steps describe the rainfall and sunlight length study:

  1. We will load the dates converted into months, sunshine, and precipitation duration into NumPy arrays. Again, we convert missing values to NaN. The code is as follows:

    to_float = lambda x: float(x.strip() or np.nan)
    to_month = lambda x: dt.strptime(x, "%Y%m%d").month
    months, sun_hours...

Analyzing monthly precipitation in De Bilt


Let's take a look at the De Bilt precipitation data in 0.1 mm from KNMI. They are using the convention again of -1 representing low values. We are again going to set those values to 0:

  1. We will load the dates converted to months, rain amounts, and rain duration in hours into NumPy arrays. Again, missing values needed to be converted to NaNs. We then create masked arrays for NumPy arrays with missing values. The code is as follows:

    to_float = lambda x: float(x.strip() or np.nan)
    to_month = lambda x: dt.strptime(x, "%Y%m%d").month
    months, duration, rain = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1, 21, 22), unpack=True, converters={1: to_month, 21: to_float, 22: to_float})
     
    # Remove -1 values
    rain[rain == -1] = 0
     
    # Measurements are in .1 mm 
    rain = .1 * ma.masked_invalid(rain)
     
    # Measurements are in .1 hours 
    duration = .1 * ma.masked_invalid(duration)
  2. We can calculate some simple statistics, such as minimum, maximum, mean, standard deviation...

Analyzing atmospheric pressure in De Bilt


Atmospheric pressure is the pressure exerted by air in the atmosphere. It is defined as force divided by area. The KNMI De Bilt data file has measurements in 0.1 hPa for average, minimum, and maximum daily pressures. We will plot a histogram of the average pressure and monthly minimums, maximums, and averages:

  1. We will load the dates converted to months, average, minimum, and maximum pressure into NumPy arrays. Again, missing values needed to be converted to NaNs. The code is as follows:

    to_float = lambda x: 0.1 * float(x.strip() or np.nan)
    to_month = lambda x: dt.strptime(x, "%Y%m%d").month
    months, avg_p, max_p, min_p = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1, 25, 26, 28), unpack=True, converters={1: to_month, 25: to_float, 26: to_float, 28: to_float})
  2. Values are missing from the pressure value columns, so we have to create masked arrays out of NumPy arrays. The following code snippet prints some simple statistics:

    max_p = ma.masked_invalid...

Analyzing atmospheric humidity in De Bilt


Relative atmospheric humidity is the percentage of partial water vapor pressure of the maximum pressure at the same temperature in the atmosphere. During the summer months, high humidity can lead to issues with getting rid of excess heat by sweating. Humidity is also related to rain, dew, and fog. The KNMI De Bilt data file provides data on daily relative average, minimum, and maximum humidity in percentages. We will draw a histogram of the daily relative average humidity and monthly chart:

  1. We will load the dates converted to months, daily relative average humidity, and the minimum and maximum humidity into NumPy arrays. Again, missing values needed to be converted into NaNs:

    to_float = lambda x: float(x.strip() or np.nan)
    to_month = lambda x: dt.strptime(x, "%Y%m%d").month
    months, avg_h, max_h, min_h = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1, 35, 36, 38), unpack=True, converters={1: to_month, 35: to_float, 36: to_float, 38: to_float})
  2. Values...

Summary


This chapter explained a great number of common NumPy functions. We explored the data from a KNMI weather station. The exploration is not exhaustive, so I encourage you to play with the data on your own. You should have realized by now how easy it is to do basic data analysis with NumPy and related Python libraries.

In the next chapter, we will go a step further and try to predict temperature using the same data.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning NumPy Array
Published in: Jun 2014Publisher: ISBN-13: 9781783983902
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Read more about Ivan Idris