There are several common data structures we will keep coming across.
List is a basic Python data type for storing a collection of values. A list is created by putting element values inside a square bracket. To reuse our list, we can give it a name and store it like this:
evens = [2,4,6,8,10]
When we want to get a series in a greater range, for instance, to get more data points for our curve of squares to make it smoother, we may use the Python range()
function:
evens = range(2,102,2)
This command will give us all even numbers from 2 to 100 (both inclusive) and store it in a list named evens
.
Very often, we deal with more complex data. If you need a matrix with multiple columns or want to perform mathematical operations over all elements in a collection, then numpy is for you:
import numpy as np
We abbreviated numpy
to np
by convention, keeping our code succinct.
np.array()
converts a supported data type, a list in this case, into a Numpy array. To produce a numpy array from our evens
list, we do the following:
np.array(evens)
A pandas dataframe is useful when we have some non-numerical labels or values in our matrix. It does not require homogeneous data, unlike Numpy. Columns can be named. There are also functions such as melt()
and pivot_table()
that add convenience in reshaping the table to facilitate analysis and plotting.
To convert a list into a pandas dataframe, we do the following:
import pandas as pd
pd.DataFrame(evens)
You can also convert a numpy array into a pandas dataframe.
While all this gives you a refresher of the data structures we will be working on, in real life, instead of inventing data, we read it from data sources. A tab-delimited plaintext file is the simplest and most common type of data input. Imagine we have a file called evens.txt
containing the aforementioned even numbers. There are two columns. The first column only records unnecessary information. We want to load the data in the second column.
Here is what the dummy text file looks like:
We can initialize an empty list, read the file line by line, split each line, and append the second element to our list:
evens = []
with open as f:
for line in f.readlines():
evens.append(line.split()[1])
Note
Of course, you can also do this in a one-liner:
evens = [int(x.split()[1]) for x in open('evens.txt').readlines()]
We are just trying to go step by step, following the Zen of Python: simple is better than complex.
It is simple when we have a file with only two columns, and only one column to read, but it can get more tedious when we have an extended table containing thousands of columns and rows and we want to convert them into a Numpy matrix later.
Numpy provides a standard one-liner solution:
import numpy as np
np.loadtxt(‘evens.txt’,delimiter=’\t’,usecols=1,dtype=np.int32)
The first parameter is the path of the data file. The delimiter
parameter specifies the string used to separate values, which is a tab here. Because numpy.loadtxt()
by default separate values separated by any whitespace into columns by default, this argument can be omitted here. We have set it for demonstration.
For usecols
and dtype
that specify which columns to read and what data type each column corresponds to, you may pass a single value to each, or a sequence (such as list) for reading multiple columns.
Numpy also by default skips lines starting with #
, which typically marks comment or header lines. You may change this behavior by setting the comment
parameter.
Similar to Numpy, pandas offers an easy way to load text files into a pandas dataframe:
import pandas as pd
pd.read_csv(usecols=1)
Here the separation can be denoted by either sep
or delimiter
, which is set as comma ,
by default (CSV stands for comma-separated values).
There is a long list of less commonly used options available as to determine how different data formats, data types, and errors should be handled. You may refer to the documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html. Besides flat CSV files, Pandas also has other built-in functions for reading other common data formats, such as Excel, JSON, HTML, HDF5, SQL, and Google BigQuery.
To stay focused on data visualization, we will not dig deep into the methods of data cleaning in this book, but this is a survival skill set very helpful in data science. If interested, you can check out resources on data handling with Python.