In the world of data science, raw data comes in many forms and sizes. There is a lot of information that can be extracted from this raw data. To give an example, Amazon collects click stream data that records each and every click of the user on the website. This data can be utilized to understand if a user is a price-sensitive customer or prefer more popularly rated products. You must have noticed recommended products in Amazon; they are derived using such data.
The first step towards such an analysis would be to parse raw data. The parsing of the data involves the following steps:
Extracting data from the source: Data can come in many forms, such as Excel, CSV, JSON, databases, and so on. Python makes it very easy to read data from these sources with the help of some useful packages, which will be covered in this chapter.
Cleaning the data: Once a sanity check has been done, one needs to clean the data appropriately so that it can be utilized for analysis. You may have a dataset about students of a class and details about their height, weight, and marks. There may also be certain rows with the height or weight missing. Depending on the analysis being performed, these rows with missing values can either be ignored or replaced with the average height or weight.
In this chapter we will cover the following topics:
Exploring arrays with NumPy
Handling data with pandas
Reading and writing data from various formats
Handling missing data
Manipulating data
Python, by default, comes with a data structure, such as List, which can be utilized for array operations, but a Python list on its own is not suitable to perform heavy mathematical operations, as it is not optimized for it.
NumPy is a wonderful Python package produced by Travis Oliphant, which has been created fundamentally for scientific computing. It helps handle large multidimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
A NumPy array would require much less memory to store the same amount of data compared to a Python list, which helps in reading and writing from the array in a faster manner.
A list of numbers can be passed to the following array function to create a NumPy array object:
>>> import numpy as np >>> n_array = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])
A NumPy array object has a number of attributes, which help in giving information about the array. Here are its important attributes:
ndim
: This gives the number of dimensions of the array. The following shows that the array that we defined had two dimensions:>>> n_array.ndim 2
n_array
has a rank of2
, which is a 2D array.shape
: This gives the size of each dimension of the array:>>> n_array.shape (3, 4)
The first dimension of
n_array
has a size of3
and the second dimension has a size of4
. This can be also visualized as three rows and four columns.size
: This gives the number of elements:>>> n_array.size 12
The total number of elements in
n_array
is 12.dtype
: This gives the datatype of the elements in the array:>>> n_array.dtype.name int64
When you have an array of data, you would like to perform certain mathematical operations on it. We will now discuss a few of the important ones in the following sections.
The following commands subtract the a
array from the b
array to get the resultant c
array. The subtraction happens element by element:
>>> a = np.array( [11, 12, 13, 14]) >>> b = np.array( [ 1, 2, 3, 4]) >>> c = a - b >>> c Array[10 10 10 10]
Do note that when you subtract two arrays, they should be of equal dimensions.
The following command raises each element to the power of 2
to obtain this result:
>>> b**2 [1 4 9 16]
The following command applies cosine to each of the values in the
b
array to obtain the following result:
>>> np.cos(b) [ 0.54030231 -0.41614684 -0.9899925 -0.65364362]
The following command will apply a conditional operation to each of the elements of the b
array, in order to generate the respective Boolean values:
>>> b<2 [ True False False False]
Two matrices can be multiplied element by element or in a dot product. The following commands will perform the element-by-element multiplication:
>>> A1 = np.array([[1, 1], [0, 1]]) >>> A2 = np.array([[2, 0], [3, 4]]) >>> A1 * A2 [[2 0] [0 4]]
The dot
product can be performed with the following command:
>>> np.dot(A1, A2) [[5 4] [3 4]]
If you want to select a particular element of an array, it can be achieved using indexes:
>>> n_array[0,1] 1
The preceding command will select the first array and then select the second value in the array. It can also be seen as an intersection of the first row and the second column of the matrix.
If a range of values has to be selected on a row, then we can use the following command:
>>> n_array[ 0 , 0:3 ] [0 1 2]
The 0:3
value selects the first three values of the first row.
The whole row of values can be selected with the following command:
>>> n_array[ 0 , : ] [0 1 2 3]
Using the following command, an entire column of values need to be selected:
>>> n_array[ : , 1 ] [1 5 9]
Once the array has been created, we can change the shape of it too. The following command flattens the array:
>>> n_array.ravel() [ 0 1 2 3 4 5 6 7 8 9 10 11]
The following command reshapes the array in to a six rows and two columns format. Also, note that when reshaping, the new shape should have the same number of elements as the previous one:
>>> n_array.shape = (6,2) >>> n_array [[ 0 1] [ 2 3] [ 4 5] [ 6 7] [ 8 9] [10 11]]
The array can be transposed too:
>>> n_array.transpose() [[ 0 2 4 6 8 10] [ 1 3 5 7 9 11]]
The pandas library was developed by Wes McKinny when he was working at AQR Capital Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial data. Later, Chang She joined him and helped develop the package further.
The pandas library is an open source Python library, specially designed for data analysis. It has been built on NumPy and makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well.
The pandas library brings the richness of R in the world of Python to handle data. It's has efficient data structures to process data, perform fast joins, and read data from various sources, to name a few.
The pandas library essentially has three data structures:
Series
DataFrame
Panel
Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too. A series can be created by calling the following:
>>> import pandas as pd >>> pd.Series(np.random.randn(5)) 0 0.733810 1 -1.274658 2 -1.602298 3 0.460944 4 -0.632756 dtype: float64
The random.randn
parameter is part of the NumPy package and it generates random numbers. The series function creates a pandas series that consists of an index, which is the first column, and the second column consists of random values. At the bottom of the output is the datatype of the series.
The index of the series can be customized by calling the following:
>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) a -0.929494 b -0.571423 c -1.197866 d 0.081107 e -0.035091 dtype: float64
A series can be derived from a Python dict too:
>>> d = {'A': 10, 'B': 20, 'C': 30} >>> pd.Series(d) A 10 B 20 C 30 dtype: int64
DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a table. A DataFrame can be formed from the following data structures:
A NumPy array
Lists
Dicts
Series
A 2D NumPy array
A DataFrame
can be created from a dict of series by calling the following commands:
>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])} >>> df = pd.DataFrame(d) >>> df c1 c2 0 A 1 1 B 2 2 C 3 3 NaN 4
The DataFrame can be created using a dict of lists too:
>>> d = {'c1': ['A', 'B', 'C', 'D'], 'c2': [1, 2.0, 3.0, 4.0]} >>> df = pd.DataFrame(d) >>> print df c1 c2 0 A 1 1 B 2 2 C 3 3 D 4
A Panel
is a data structure that handles 3D data. The following command is an example of panel data:
>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)), 'Item2': pd.DataFrame(np.random.randn(4, 2))} >>> pd.Panel(d) <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: Item1 to Item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 2
The preceding command shows that there are 2 DataFrames represented by two items. There are four rows represented by four major axes and three columns represented by three minor axes.
The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library makes it convenient to read data from these formats or to export to these formats. We'll use a dataset that contains the weight statistics of the school students from the U.S..
We'll be using a file with the following structure:
Column |
Description |
---|---|
|
Unique location code |
|
The county the school belongs to |
|
The district the school belongs to |
|
The region the school belongs to |
|
The school year the data is addressing |
|
The number of overweight students |
|
The percentage of overweight students |
|
The number of obese students |
|
The percentage of obese students |
|
The number of students who are overweight or obese |
|
The percentage of students who are overweight or obese |
|
Whether they belong to elementary or high school |
|
The type of area |
|
The address of the school |
|
The city the school belongs to |
| |
| |
|
The address with longitude and latitude |
To read data from a .csv
file, the following read_csv
function can be used:
>>> d = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv') >>> d[0:5]['AREA NAME'] 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT
The read_csv
function takes the path of the .csv
file to input the data. The command after this prints the first five rows of the Location
column in the data.
To write a data to the .csv
file, the following to_csv
function can be used:
>>> d = {'c1': pd.Series(['A', 'B', 'C']), 'c2': pd.Series([1, 2., 3., 4.])} >>> df = pd.DataFrame(d) >>> df.to_csv('sample_data.csv')
The DataFrame is written to a .csv
file by using the to_csv
method. The path and the filename where the file needs to be created should be mentioned.
In addition to the pandas package, the xlrd
package needs to be installed for pandas to read the data from an Excel file:
>>> d=pd.read_excel('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xls')
The preceding function is similar to the CSV reading command. To write to an Excel file, the xlwt
package needs to be installed:
>>> df.to_excel('sample_data.xls')
To read the data from a JSON file, Python's standard json
package can be used. The following commands help in reading the file:
>>> import json >>> json_data = open('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.json') >>> data = json.load(json_data) >>> json_data.close()
In the preceding command, the open()
function opens a connection to the file. The json.load()
function loads the data into Python. The json_data.close()
function closes the connection to the file.
The pandas library also provides a function to read the JSON file, which can be accessed using pd.read_json()
.
To read data from a database, the following function can be used:
>>> pd.read_sql_table(table_name, con)
The preceding command generates a DataFrame. If a table name and an SQLAlchemy engine are given, they return a DataFrame. This function does not support the DBAPI connection. The following are the description of the parameters used:
table_name
: This refers to the name of the SQL table in a databasecon
: This refers to the SQLAlchemy engine
The following command reads SQL query into a DataFrame:
>>> pd.read_sql_query(sql, con)
The following are the description of the parameters used:
sql
: This refers to the SQL query that is to be executedcon
: This refers to the SQLAlchemy engine
The data in its raw form generally requires some cleaning so that it can be analyzed or a dashboard can be created on it. There are many reasons that data might have issues. For example, the Point of Sale system at a retail shop might have malfunctioned and inputted some data with missing values. We'll be learning how to handle such data in the following section.
Generally, most data will have some missing values. There could be various reasons for this: the source system which collects the data might not have collected the values or the values may never have existed. Once you have the data loaded, it is essential to check the missing elements in the data. Depending on the requirements, the missing data needs to be handled. It can be handled by removing a row or replacing a missing value with an alternative value.
In the Student Weight
data, to check if the location column has missing value, the following command can be utilized:
>>> d['Location 1'].isnull() 0 False 1 False 2 False 3 False 4 False 5 False 6 False
The notnull()
method will output each row of the value as TRUE
or FALSE
. If it's False
, then there is a missing value. This data can be aggregated to find the number of instances of the missing value:
>>> d['Location 1'].isnull().value_counts() False 3246 True 24 dtype: int64
The preceding command shows that the Location 1
column has 24
instances of missing values. These missing values can be handled by either removing the rows with the missing values or replacing it with some values. To remove the rows, execute the following command:
>>> d = d['Location 1'].dropna()
To remove all the rows with an instance of missing values, use the following command:
>>> d = d.dropna(how='any')
Let's define some DataFrames to work with:
>>> df = pd.DataFrame(np.random.randn(5, 3), index=['a0', 'a10', 'a20', 'a30', 'a40'], columns=['X', 'Y', 'Z']) >>> df X Y Z a0 -0.854269 0.117540 1.515373 a10 -0.483923 -0.379934 0.484155 a20 -0.038317 0.196770 -0.564176 a30 0.752686 1.329661 -0.056649 a40 -1.383379 0.632615 1.274481
We'll now add some extra row indexes, which will create null values in our DataFrame:
>>> df2 = df2.reindex(['a0', 'a1', 'a10', 'a11', 'a20', 'a21', 'a30', 'a31', 'a40', 'a41']) >>> df2 X Y Z a0 -1.193371 0.912654 -0.780461 a1 NaN NaN NaN a10 1.413044 0.615997 0.947334 a11 NaN NaN NaN a20 1.583516 1.388921 0.458771 a21 NaN NaN NaN a30 0.479579 1.427625 1.407924 a31 NaN NaN NaN a40 0.455510 -0.880937 1.375555 a41 NaN NaN NaN
If you want to replace the null values in the df2
DataFrame with a value of zero in the following case, execute the following command:
>>> df2.fillna(0) X Y Z a0 -1.193371 0.912654 -0.780461 a1 0.000000 0.000000 0.000000 a10 1.413044 0.615997 0.947334 a11 0.000000 0.000000 0.000000 a20 1.583516 1.388921 0.458771 a21 0.000000 0.000000 0.000000 a30 0.479579 1.427625 1.407924 a31 0.000000 0.000000 0.000000 a40 0.455510 -0.880937 1.375555 a41 0.000000 0.000000 0.000000
If you want to fill the value with forward propagation, which means that the value previous to the null value in the column will be used to fill the null value, the following command can be used:
>>> df2.fillna(method='pad') #filling with forward propagation X Y Z a0 -1.193371 0.912654 -0.780461 a1 -1.193371 0.912654 -0.780461 a10 1.413044 0.615997 0.947334 a11 1.413044 0.615997 0.947334 a20 1.583516 1.388921 0.458771 a21 1.583516 1.388921 0.458771 a30 0.479579 1.427625 1.407924 a31 0.479579 1.427625 1.407924 a40 0.455510 -0.880937 1.375555 a41 0.455510 -0.880937 1.375555
If you want to fill the null values of the column with the column mean, then the following command can be utilized:
>>> df2.fillna(df2.mean()) X Y Z a0 -1.193371 0.912654 -0.780461 a1 0.547655 0.692852 0.681825 a10 1.413044 0.615997 0.947334 a11 0.547655 0.692852 0.681825 a20 1.583516 1.388921 0.458771 a21 0.547655 0.692852 0.681825 a30 0.479579 1.427625 1.407924 a31 0.547655 0.692852 0.681825 a40 0.455510 -0.880937 1.375555 a41 0.547655 0.692852 0.681825
Sometimes, you would want to modify the string field column in your data. The following technique explains some of the string operations:
Substring: Let's start by choosing the first five rows of the
AREA NAME
column in the data as our sample data to modify:>>> df = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv') >>> df['AREA NAME'][0:5] 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT Name: AREA NAME, dtype: object
In order to extract the first word from the
Area Name
column, we'll use theextract
function as shown in the following command:>>> df['AREA NAME'][0:5].str.extract('(\w+)') 0 RAVENA 1 RAVENA 2 RAVENA 3 COHOES 4 COHOES Name: AREA NAME, dtype: object
In the preceding command, the
str
attribute of the series is utilized. Thestr
class contains anextract
method, where a regular expression could be fed to extract data, which is very powerful. It is also possible to extract a second word inAREA NAME
as a separate column:>>> df['AREA NAME'][0:5].str.extract('(\w+)\s(\w+)') 0 1 0 RAVENA COEYMANS 1 RAVENA COEYMANS 2 RAVENA COEYMANS 3 COHOES CITY 4 COHOES CITY
To extract data in different columns, the respective regular expression needs to be enclosed in separate parentheses.
Filtering: If we want to filter rows with data on
ELEMENTARY
school, then the following command can be used:>>> df[df['GRADE LEVEL'] == 'ELEMENTARY']
Uppercase: To convert the area name to uppercase, we'll use the following command:
>>> df['AREA NAME'][0:5].str.upper() 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT 3 COHOES CITY SCHOOL DISTRICT 4 COHOES CITY SCHOOL DISTRICT Name: AREA NAME, dtype: object
Since the data strings are in uppercase already, there won't be any difference seen.
Lowercase: To convert
Area Name
to lowercase, we'll use the following command:>>> df['AREA NAME'][0:5].str.lower() 0 ravena coeymans selkirk central school district 1 ravena coeymans selkirk central school district 2 ravena coeymans selkirk central school district 3 cohoes city school district 4 cohoes city school district Name: AREA NAME, dtype: object
Length: To find the length of each element of the
Area Name
column, we'll use the following command:>>> df['AREA NAME'][0:5].str.len() 0 47 1 47 2 47 3 27 4 27 Name: AREA NAME, dtype: int64
Split: To split
Area Name
based on a whitespace, we'll use the following command:>>> df['AREA NAME'][0:5].str.split(' ') 0 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 1 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 2 [RAVENA, COEYMANS, SELKIRK, CENTRAL, SCHOOL, D... 3 [COHOES, CITY, SCHOOL, DISTRICT] 4 [COHOES, CITY, SCHOOL, DISTRICT] Name: AREA NAME, dtype: object
Replace: If we want to replace all the area names ending with
DISTRICT
toDIST
, then the following command can be used:>>> df['AREA NAME'][0:5].str.replace('DISTRICT$', 'DIST') 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DIST 3 COHOES CITY SCHOOL DIST 4 COHOES CITY SCHOOL DIST Name: AREA NAME, dtype: object
The first argument in the replace method is the regular expression used to identify the portion of the string to replace. The second argument is the value for it to be replaced with.
To combine datasets together, the concat
function of pandas can be utilized. Let's take the Area Name
and the County
columns with its first five rows:
>>> d[['AREA NAME', 'COUNTY']][0:5] AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
We can divide the data as follows:
>>> p1 = d[['AREA NAME', 'COUNTY']][0:2] >>> p2 = d[['AREA NAME', 'COUNTY']][2:5]
The first two rows of the data are in p1
and the last three rows are in p2
. These pieces can be combined using the concat()
function:
>>> pd.concat([p1,p2]) AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
The combined pieces can be identified by assigning a key:
>>> concatenated = pd.concat([p1,p2], keys = ['p1','p2']) >>> concatenated AREA NAME COUNTY p1 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY p2 2 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 3 COHOES CITY SCHOOL DISTRICT ALBANY 4 COHOES CITY SCHOOL DISTRICT ALBANY
Using the keys, the pieces can be extracted back from the concatenated data:
>>> concatenated.ix['p1'] AREA NAME COUNTY 0 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY 1 RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT ALBANY
Once the missing data is handled, various operations can be performed on the data.
There are a number of aggregation operations, such as average, sum, and so on, which you would like to perform on a numerical field. These are the methods used to perform it:
Average: To find out the average number of students in the
ELEMENTARY
school who are obese, we'll first filter theELEMENTARY
data with the following command:>>> data = d[d['GRADE LEVEL'] == 'ELEMENTARY'] 213.41593780369291
Now, we'll find the mean using the following command:
>>> data['NO. OBESE'].mean()
The elementary grade level data is filtered and stored in the data object. The
NO. OBESE
column is selected, which contains the number of obese students and using themean()
method, the average is taken out.SUM: To find out the total number of elementary students who are obese across all the school, use the following command:
>>> data['NO. OBESE'].sum() 219605.0
MAX: To get the maximum number of students that are obese in an elementary school, use the following command:
>>> data['NO. OBESE'].max() 48843.0
MIN: To get the minimum number of students that are obese in an elementary school, use the following command:
>>> data['NO. OBESE'].min() 5.0
STD: To get the standard deviation of the number of obese students, use the following command:
>>> data['NO. OBESE'].std() 1690.3831128098113
COUNT: To count the total number of schools with the
ELEMENTARY
grade in theDELAWARE
county, use the following command:>>> data = df[(d['GRADE LEVEL'] == 'ELEMENTARY') & (d['COUNTY'] == 'DELAWARE')] >>> data['COUNTY'].count() 19
The table is filtered for the
ELEMENTARY
grade and theDELAWARE
county. Notice that the conditions are enclosed in parentheses. This is to ensure that individual conditions are evaluated and if the parentheses are not provided, then Python will throw an error.
SQL-like joins can be performed on the DataFrame using pandas. Let's define a lookup DataFrame, which assigns levels to each of the grades using the following command:
>>> grade_lookup = {'GRADE LEVEL': pd.Series(['ELEMENTARY', 'MIDDLE/HIGH', 'MISC']), 'LEVEL': pd.Series([1, 2, 3])} >>> grade_lookup = DataFrame(grade_lookup)
Let's take the first five rows of the GRADE
data column as an example for performing the joins:
>>> df[['GRADE LEVEL']][0:5] GRADE LEVEL 0 DISTRICT TOTAL 1 ELEMENTARY 2 MIDDLE/HIGH 3 DISTRICT TOTAL 4 ELEMENTARY
The following image is a sample of an inner join:

An inner join can be performed with the following command:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='inner') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 1 ELEMENTARY 1 4 ELEMENTARY 1 2 MIDDLE/HIGH 2
The join takes place with the join()
method. The first argument takes the DataFrame on which the lookup takes place. Note that the grade_lookup
DataFrame's index is being set by the set_index()
method. This is essential for a join, as without it, the join method won't know on which column to join the DataFrame to.
The second argument takes a column of the d
DataFrame to join the data. The third argument defines the join as an inner join.
The following image is a sample of a left outer join:

A left outer join can be performed with the following commands:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='left') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 0 DISTRICT TOTAL NaN 1 ELEMENTARY 1 2 MIDDLE/HIGH 2 3 DISTRICT TOTAL NaN 4 ELEMENTARY 1
You can notice that DISTRICT TOTAL
has missing values for a level column, as the grade_lookup
DataFrame does not have an instance for DISTRICT TOTAL
.
The following image is a sample of a full outer join:

The full outer join can be performed with the following commands:
>>> d_sub = df[0:5].join(grade_lookup.set_index(['GRADE LEVEL']), on=['GRADE LEVEL'], how='outer') >>> d_sub[['GRADE LEVEL', 'LEVEL']] GRADE LEVEL LEVEL 0 DISTRICT TOTAL NaN 3 DISTRICT TOTAL NaN 1 ELEMENTARY 1 4 ELEMENTARY 1 2 MIDDLE/HIGH 2 4 MISC 3
It's easy to do an SQL-like group by operation with pandas. Let's say, if you want to find the sum of the number of obese students in each of the grades, then you can use the following command:
>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).sum() GRADE LEVEL DISTRICT TOTAL 127101 ELEMENTARY 72880 MIDDLE/HIGH 53089
This command chooses the number of obese students column, then uses the group by method to group the data-based group level, and finally, the sum method sums up the number. The same can be achieved by the following function too:
>>> d['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate(sum)
Here, the aggregate method is utilized. The sum function is passed to obtain the required results.
It's also possible to obtain multiple kinds of aggregations on the same metric. This can be achieved by the following command:
>>> df['NO. OBESE'].groupby(d['GRADE LEVEL']).aggregate([sum, mean, std]) sum mean std GRADE LEVEL DISTRICT TOTAL 127101 128.384848 158.933263 ELEMENTARY 72880 76.958817 100.289578 MIDDLE/HIGH 53089 59.251116 65.905591
In this chapter, we got familiarized with the NumPy and pandas packages. We understood the different datatypes in pandas and how to utilize them. We learned how to perform data cleansing and manipulation, in which we handled missing values and performed string operations. This chapter gives us a foundation for data science and you can dive deeper into NumPy and pandas by clicking on the following links:
NumPy documentation: http://docs.scipy.org/doc/
pandas documentation: http://pandas.pydata.org/
In the next chapter, we'll learn about the meaning of inferential statistics and what they do, and also how to make sense of the different concepts in inferential statistics.