Chapter 1: Examining the Distribution of Features and Targets
Machine learning writing and instruction are often algorithm-focused. Sometimes, this gives the impression that all we have to do is choose the right model and that organization-changing insights will follow. But the best place to begin a machine learning project is with an understanding of how the features and targets we will use are distributed.
It is important to make room for the same kind of learning from data that has been central to our work as analysts for decades – studying the distribution of variables, identifying anomalies, and examining bivariate relationships – even as we focus more and more on the accuracy of our predictions.
We will explore tools for doing so in the first three chapters of this book, while also considering implications for model building.
In this chapter, we will use common NumPy and pandas techniques to get a better sense of the attributes of our data. We want to know how key features are distributed before we do any predictive analyses. We also want to know the central tendency, shape, and spread of the distribution of each continuous feature and have a count for each value for categorical features. We will take advantage of very handy NumPy and pandas tools for generating summary statistics, such as the mean, min, and max, as well as standard deviation.
After that, we will create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.
Specifically, in this chapter, we are going to cover the following topics:
- Subsetting data
- Generating frequencies for categorical features
- Generating summary statistics for continuous features
- Identifying extreme values and outliers in univariate analysis
- Using histograms, boxplots, and violin plots to examine the distribution of continuous features
Technical requirements
This chapter will rely heavily on the pandas, NumPy, and Matplotlib libraries, but you don't require any prior knowledge of these. If you have installed Python from a scientific distribution, such as Anaconda or WinPython, then these libraries are probably already installed. If you need to install one of them to run the code in this chapter, you can run pip install [package name]
from a terminal.
Subsetting data
Almost every statistical modeling project I have worked on has required removing some data from the analysis. Often, this is because of missing values or outliers. Sometimes, there are theoretical reasons for limiting our analysis to a subset of the data. For example, we have weather data going back to 1600, but our analysis goals only involve changes in weather since 1900. Fortunately, the subsetting tools in pandas are quite powerful and flexible. We will work with data from the United States National Longitudinal Survey (NLS) of Youth in this section.
Note
The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The NLS data is available for public use at https://www.nlsinfo.org/investigator/pages/search.
Let's start subsetting the data using pandas:
- We will start by loading the NLS data. We also set an index:
import pandas as pd import numpy as np nls97 = pd.read_csv("data/nls97.csv") nls97.set_index("personid", inplace=True)
- Let's select a few columns from the NLS data. The following code creates a new DataFrame that contains some demographic and employment data. A useful feature of pandas is that the new DataFrame retains the index of the old DataFrame, as shown here:
democols = ['gender','birthyear','maritalstatus', 'weeksworked16','wageincome','highestdegree'] nls97demo = nls97[democols] nls97demo.index.name 'personid'
- We can use slicing to select rows by position.
nls97demo[1000:1004]
selects every row, starting from the row indicated by the integer to the left of the colon (1000
, in this case) up to, but not including, the row indicated by the integer to the right of the colon (1004
). The row at1000
is the 1,001st row because of zero-based indexing. Each row appears as a column in the output since we have transposed the resulting DataFrame:nls97demo[1000:1004].T personid 195884 195891 195970\ gender Male Male Female birthyear 1981 1980 1982 maritalstatus NaN Never-married Never-married weeksworked16 NaN 53 53 wageincome NaN 14,000 52,000 highestdegree 4.Bachelors 2.High School 4.Bachelors personid 195996 gender Female birthyear 1980 maritalstatus NaN weeksworked16 NaN wageincome NaN highestdegree 3.Associates
- We can also skip rows over the interval by setting a value for the step after the second colon. The default value for the step is 1. The value for the following step is 2, which means that every other row between
1000
and1004
will be selected:nls97demo[1000:1004:2].T personid 195884 195970 gender Male Female birthyear 1981 1982 maritalstatus NaN Never-married weeksworked16 NaN 53 wageincome NaN 52,000 highestdegree 4.Bachelors 4. Bachelors
- If we do not include a value to the left of the colon, row selection will start with the first row. Notice that this returns the same DataFrame as the
head
method does:nls97demo[:3].T personid 100061 100139 100284 gender Female Male Male birthyear 1980 1983 1984 maritalstatus Married Married Never-married weeksworked16 48 53 47 wageincome 12,500 120,000 58,000 highestdegree 2.High School 2. High School 0.None nls97demo.head(3).T personid 100061 100139 100284 gender Female Male Male birthyear 1980 1983 1984 maritalstatus Married Married Never-married weeksworked16 48 53 47 wageincome 12,500 120,000 58,000 highestdegree 2.High School 2.High School 0. None
- If we use a negative number, -n, to the left of the colon, the last n rows of the DataFrame will be returned. This returns the same DataFrame as the
tail
method does:nls97demo[-3:].T personid 999543 999698 999963 gender Female Female Female birthyear 1984 1983 1982 maritalstatus Divorced Never-married Married weeksworked16 0 0 53 wageincome NaN NaN 50,000 highestdegree 2.High School 2.High School 4. Bachelors nls97demo.tail(3).T personid 999543 999698 999963 gender Female Female Female birthyear 1984 1983 1982 maritalstatus Divorced Never-married Married weeksworked16 0 0 53 wageincome NaN NaN 50,000 highestdegree 2.High School 2.High School 4. Bachelors
- We can select rows by index value using the
loc
accessor. Recall that for thenls97demo
DataFrame, the index ispersonid
. We can pass a list of the index labels to theloc
accessor, such asloc[[195884,195891,195970]]
, to get the rows associated with those labels. We can also pass a lower and upper bound of index labels, such asloc[195884:195970]
, to retrieve the indicated rows:nls97demo.loc[[195884,195891,195970]].T personid 195884 195891 195970 gender Male Male Female birthyear 1981 1980 1982 maritalstatus NaN Never-married Never-married weeksworked16 NaN 53 53 wageincome NaN 14,000 52,000 highestdegree 4.Bachelors 2.High School 4.Bachelors nls97demo.loc[195884:195970].T personid 195884 195891 195970 gender Male Male Female birthyear 1981 1980 1982 maritalstatus NaN Never-married Never-married weeksworked16 NaN 53 53 wageincome NaN 14,000 52,000 highestdegree 4.Bachelors 2.High School 4.Bachelors
- To select rows by position, rather than by index label, we can use the
iloc
accessor. We can pass a list of position numbers, such asiloc[[0,1,2]]
, to the accessor to get the rows at those positions. We can pass a range, such asiloc[0:3]
, to get rows between the lower and upper bound, not including the row at the upper bound. We can also use theiloc
accessor to select the last n rows.iloc[-3:]
selects the last three rows:nls97demo.iloc[[0,1,2]].T personid 100061 100139 100284 gender Female Male Male birthyear 1980 1983 1984 maritalstatus Married Married Never-married weeksworked16 48 53 47 wageincome 12,500 120,000 58,000 highestdegree 2.High School 2.High School 0. None nls97demo.iloc[0:3].T personid 100061 100139 100284 gender Female Male Male birthyear 1980 1983 1984 maritalstatus Married Married Never-married weeksworked16 48 53 47 wageincome 12,500 120,000 58,000 highestdegree 2.High School 2.High School 0. None nls97demo.iloc[-3:].T personid 999543 999698 999963 gender Female Female Female birthyear 1984 1983 1982 maritalstatus Divorced Never-married Married weeksworked16 0 0 53 wageincome NaN NaN 50,000 highestdegree 2.High School 2.High School 4. Bachelors
Often, we need to select rows based on a column value or the values of several columns. We can do this in pandas by using Boolean indexing. Here, we pass a vector of Boolean values (which can be a Series) to the loc
accessor or the bracket operator. The Boolean vector needs to have the same index as the DataFrame.
- Let's try this using the
nightlyhrssleep
column on the NLS DataFrame. We want a Boolean Series that isTrue
for people who sleep 6 or fewer hours a night (the 33rd percentile) andFalse
ifnightlyhrssleep
is greater than 6 or is missing.sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold
creates the boolean Series. If we display the first few values ofsleepcheckbool
, we will see that we are getting the expected values. We can also confirm that thesleepcheckbool
index is equal to thenls97
index:nls97.nightlyhrssleep.head() personid 100061 6 100139 8 100284 7 100292 nan 100583 6 Name: nightlyhrssleep, dtype: float64 lowsleepthreshold = nls97.nightlyhrssleep.quantile(0.33) lowsleepthreshold 6.0 sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold sleepcheckbool.head() personid 100061 True 100139 False 100284 False 100292 False 100583 True Name: nightlyhrssleep, dtype: bool sleepcheckbool.index.equals(nls97.index) True
Since the sleepcheckbool
Series has the same index as nls97
, we can just pass it to the loc
accessor to create a DataFrame containing people who sleep 6 hours or less a night. This is a little pandas magic here. It handles the index alignment for us:
lowsleep = nls97.loc[sleepcheckbool] lowsleep.shape (3067, 88)
- We could have created the
lowsleep
subset of our data in one step, which is what we would typically do unless we need the Boolean Series for some other purpose:lowsleep = nls97.loc[nls97.nightlyhrssleep<=lowsleepthreshold] lowsleep.shape (3067, 88)
- We can pass more complex conditions to the
loc
accessor and evaluate the values of multiple columns. For example, we can select rows wherenightlyhrssleep
is less than or equal to the threshold andchildathome
(number of children living at home) is greater than or equal to3
:lowsleep3pluschildren = \ nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>=3)] lowsleep3pluschildren.shape (623, 88)
Each condition in nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>3)]
is placed in parentheses. An error will be generated if the parentheses are excluded. The &
operator is the equivalent of and
in standard Python, meaning that both conditions have to be True
for the row to be selected. We could have used |
for or
if we wanted to select the row if either condition was True
.
- Finally, we can select rows and columns at the same time. The expression to the left of the comma selects rows, while the list to the right of the comma selects columns:
lowsleep3pluschildren = \ nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>=3), ['nightlyhrssleep','childathome']] lowsleep3pluschildren.shape (623, 2)
We used three different tools to select columns and rows from a pandas DataFrame in the last two sections: the []
bracket operator and two pandas-specific accessors, loc
and iloc
. This will be a little confusing if you are new to pandas, but it becomes clear which tool to use in which situation after just a few months. If you came to pandas with a fair bit of Python and NumPy experience, you will likely find the []
operator most familiar. However, the pandas documentation recommends against using the []
operator for production code. The loc
accessor is used for selecting rows by Boolean indexing or by index label, while the iloc
accessor is used for selecting rows by row number.
This section was a brief primer on selecting columns and rows with pandas. Although we did not go into too much detail on this, most of what you need to know to subset data was covered, as well as everything you need to know to understand the pandas-specific material in the rest of this book. We will start putting some of that to work in the next two sections by creating frequencies and summary statistics for our features.
Generating frequencies for categorical features
Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.
Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.
Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts
method to generate counts for each categorical value:
- Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:
nls97 = pd.read_csv("data/nls97.csv") nls97.set_index("personid", inplace=True) nls97abb = nls97.iloc[:,:20] nls97abb.dtypes gender object birthmonth int64 birthyear int64 highestgradecompleted float64 maritalstatus object childathome float64 childnotathome float64 wageincome float64 weeklyhrscomputer object weeklyhrstv object nightlyhrssleep float64 satverbal float64 satmath float64 gpaoverall float64 gpaenglish float64 gpamath float64 gpascience float64 highestdegree object govprovidejobs object govpricecontrols object dtype: object
Note
Recall from the previous section how column and row selection works with the
loc
andiloc
accessors. The colon to the left of the comma indicates that we want all the rows, while:20
to the right of the comma gets us the first 20 columns. - All of the object type columns in the preceding code are categorical. We can use
value_counts
to see the counts for each value formaritalstatus
. We can also usedropna=False
to getvalue_counts
to show the missing values (NaN
):nls97abb.maritalstatus.value_counts(dropna=False) Married 3066 Never-married 2766 NaN 2312 Divorced 663 Separated 154 Widowed 23 Name: maritalstatus, dtype: int64
- If we just want the number of missing values, we can chain the
isnull
andsum
methods.isnull
returns a Boolean Series containingTrue
values whenmaritalstatus
is missing andFalse
otherwise.sum
then counts the number ofTrue
values, since it will interpretTrue
values as 1 andFalse
values as 0:nls97abb.maritalstatus.isnull().sum() 2312
- You have probably noticed that the
maritalstatus
values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact thatvalue_counts
returns a Series with the values as the index:marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False) type(marstatcnt) <class 'pandas.core.series.Series'> marstatcnt.index Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
- To sort the index, we just need to call
sort_index
:marstatcnt.sort_index() Divorced 663 Married 3066 Never-married 2766 Separated 154 Widowed 23 NaN 2312 Name: maritalstatus, dtype: int64
- Of course, we could have gotten the same results in one step with
nls97.maritalstatus.value_counts(dropna=False).sort_index()
. We can also show ratios instead of counts by settingnormalize
toTrue
. In the following code, we can see that 34% of the responses wereMarried
(notice that we did not setdropna
toTrue
, so missing values have been excluded):nls97.maritalstatus.\ value_counts(normalize=True, dropna=False).\ sort_index() Divorced 0.07 Married 0.34 Never-married 0.31 Separated 0.02 Widowed 0.00 NaN 0.26 Name: maritalstatus, dtype: float64
- pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns,
catcols
. Then, we're looping through those columns and usingastype
to change the data type tocategory
:catcols = nls97abb.select_dtypes(include=["object"]).columns for col in nls97abb[catcols].columns: ... nls97abb[col] = nls97abb[col].astype('category') ... nls97abb[catcols].dtypes gender category maritalstatus category weeklyhrscomputer category weeklyhrstv category highestdegree category govprovidejobs category govpricecontrols category dtype: object
- Let's check our category features for missing values. There are no missing values for
gender
and very few forhighestdegree
. But the overwhelming majority of values forgovprovidejobs
(the government should provide jobs) andgovpricecontrols
(the government should control prices) are missing. This means that those features probably won't be useful for most modeling:nls97abb[catcols].isnull().sum() gender 0 maritalstatus 2312 weeklyhrscomputer 2274 weeklyhrstv 2273 highestdegree 31 govprovidejobs 7151 govpricecontrols 7125 dtype: int64
- We can generate frequencies for multiple features at once by passing a
value_counts
call toapply
. We can usefilter
to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not setdropna
toFalse
:nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.25 0.54 2. Probably 0.34 0.33 3. Probably not 0.25 0.09 4. Definitely not 0.16 0.04
- We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing
nls97abb[nls97abb.maritalstatus=="Married"]
beforefilter
:nls97abb.loc[nls97abb.maritalstatus=="Married"].\ filter(like="gov").\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
- Since, in this case, there were only two gov columns, it may have been easier to do the following:
nls97abb.loc[nls97abb.maritalstatus=="Married", ['govprovidejobs','govpricecontrols']].\ apply(pd.value_counts, normalize=True) govprovidejobs govpricecontrols 1. Definitely 0.17 0.46 2. Probably 0.33 0.38 3. Probably not 0.31 0.11 4. Definitely not 0.18 0.05
Nonetheless, it will often be easier to use filter
since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.
There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted
. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted
is missing.
- We need to do a little bit of cleaning up first, though.
highestgradecompleted
has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's usereplace
to fix that before moving on:nls97abb.highestgradecompleted.\ replace(95, np.nan, inplace=True)
- We can use NumPy's
where
function to assign values tohighschoolgrad
based on the values ofhighestgradecompleted
. Ifhighestgradecompleted
is null (NaN
), we assignNaN
to our new column,highschoolgrad
. If the value forhighestgradecompleted
is not null, the next clause tests for a value less than 12, settinghighschoolgrad
to 0 if that is true, and to 1 otherwise. We can confirm that the new column,highschoolgrad
, contains the values we want by usinggroupby
to get the min and max values ofhighestgradecompleted
at each level ofhighschoolgrad
:nls97abb['highschoolgrad'] = \ np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \ np.where(nls97abb.highestgradecompleted<12,0,1)) nls97abb.groupby(['highschoolgrad'], dropna=False) \ ['highestgradecompleted'].agg(['min','max','size']) min max size highschoolgrad 0 5 11 1231 1 12 20 5421 nan nan nan 2332 nls97abb['highschoolgrad'] = \ ... nls97abb['highschoolgrad'].astype('category')
While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad
, this would present some modeling challenges if we intended to use highschoolgrad
as a target. There is a pretty substantial class imbalance, with highschoolgrad
equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted
.
- One way to do this with pandas is with the
qcut
function. We can set theq
parameter ofqcut
to6
to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:nls97abb['highgradegroup'] = \ pd.qcut(nls97abb['highestgradecompleted'], q=6, labels=[1,2,3,4,5,6]) nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\ agg(['min','max','size']) min max size highgradegroup 1 5 11 1231 2 12 12 1389 3 13 14 1288 4 15 16 1413 5 17 17 388 6 18 20 943 nls97abb['highgradegroup'] = \ nls97abb['highgradegroup'].astype('category')
- Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs
value_counts
:freqout = open('views/frequencies.txt', 'w') for col in nls97abb.select_dtypes(include=["category"]): print(col, "----------------------", "frequencies", nls97abb[col].value_counts(dropna=False).sort_index(), "percentages", nls97abb[col].value_counts(normalize=True).\ sort_index(), sep="\n\n", end="\n\n\n", file=freqout) freqout.close()
These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts
method. We can use value_counts
to create frequencies a Series at a time, use it with apply
for multiple columns, or iterate over several columns and call value_counts
each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.
Generating summary statistics for continuous and discrete features
Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.
Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.
For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.
To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.
We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.
Note
Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.
Follow these steps to generate the summary statistics:
- Let's load the COVID
.csv
file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set,iso_code
, contains a unique value for each row. We usesample
to view two countries randomly, rather than the first two (we set a value forrandom_state
to get the same results each time we run the code):import pandas as pd import numpy as np import scipy.stats as scistat covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=['lastdate']) covidtotals.set_index("iso_code", inplace=True) covidtotals.shape (221, 16) covidtotals.index.nunique() 221 covidtotals.sample(2, random_state=6).T iso_code ISL CZE lastdate 2021-07-07 2021-07-07 location Iceland Czechia total_cases 6,555 1,668,277 total_deaths 29 30,311 total_cases_mill 19,209 155,783 total_deaths_mill 85 2,830 population 341,250 10,708,982 population_density 3 137 median_age 37 43 gdp_per_capita 46,483 32,606 aged_65_older 14 19 total_tests_thous NaN NaN life_expectancy 83 79 hospital_beds_thous 3 7 diabetes_prevalence 5 7 region Western Europe Western Europe
Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill
and total_deaths_mill
divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.
- Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:
covidtotals.info() <class 'pandas.core.frame.DataFrame'> Index: 221 entries, AFG to ZWE Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------- -------------- -------------- 0 lastdate 221 non-null datetime64[ns] 1 location 221 non-null object 2 total_cases 192 non-null float64 3 total_deaths 185 non-null float64 4 total_cases_mill 192 non-null float64 5 total_deaths_mill 185 non-null float64 6 population 221 non-null float64 7 population_density 206 non-null float64 8 median_age 190 non-null float64 9 gdp_per_capita 193 non-null float64 10 aged_65_older 188 non-null float64 11 total_tests_thous 13 non-null float64 12 life_expectancy 217 non-null float64 13 hospital_beds_thous 170 non-null float64 14 diabetes_prevalence 200 non-null float64 15 region 221 non-null object dtypes: datetime64[ns](1), float64(13), object(2) memory usage: 29.4+ KB
- Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the
describe
method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.
The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older
and diabetes_prevalence
) would have to do a lot of work to explain the huge variance in our targets:
keyvars = ['location','total_cases_mill','total_deaths_mill', ... 'aged_65_older','diabetes_prevalence'] covidkeys = covidtotals[keyvars] covidkeys.describe() total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence count 192.00 185.00 188.00 200.00 mean 36,649.37 683.14 8.61 8.44 std 41,403.98 861.73 6.12 4.89 min 8.52 0.35 1.14 0.99 25% 2,499.75 43.99 3.50 5.34 50% 19,525.73 293.50 6.22 7.20 75% 64,834.62 1,087.89 13.92 10.61 max 181,466.38 5,876.01 27.05 30.53
- I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The
quantile
method can take a single value for quantile, such asquantile(0.25)
for the 25th percentile, or a list or tuple, such asquantile((0.25,0.5))
for the 25th and 50th percentiles. In the following code, we're usingarange
from NumPy (np.arange(0.0, 1.1, 0.1)
) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to usecovidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0])
:covidkeys.quantile(np.arange(0.0, 1.1, 0.1)) total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence 0.00 8.52 0.35 1.14 0.99 0.10 682.13 10.68 2.80 3.30 0.20 1,717.39 30.22 3.16 4.79 0.30 3,241.84 66.27 3.86 5.74 0.40 9,403.58 145.06 4.69 6.70 0.50 19,525.73 293.50 6.22 7.20 0.60 33,636.47 556.43 7.93 8.32 0.70 55,801.33 949.71 11.19 10.08 0.80 74,017.81 1,333.79 14.92 11.62 0.90 94,072.18 1,868.89 18.85 13.75 1.00 181,466.38 5,876.01 27.05 30.53
For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.
- Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis.
total_deaths_mill
has values for skew and kurtosis that are worth noting, and thetotal_cases_mill
andaged_65_older
features have excessively low kurtosis (skinny tails):covidkeys.skew() total_cases_mill 1.21 total_deaths_mill 2.00 aged_65_older 0.84 diabetes_prevalence 1.52 dtype: float64 covidkeys.kurtosis() total_cases_mill 0.91 total_deaths_mill 6.58 aged_65_older -0.56 diabetes_prevalence 3.31 dtype: float64
- We can also explicitly test each distribution's normality by looping over the features in the
keyvars
list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())
). Notice that we need to drop missing values withdropna
for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:for var in keyvars[1:]: stat, p = scistat.shapiro(covidkeys[var].dropna()) print("feature=", var, " p-value=", '{:.6f}'.format(p)) feature= total_cases_mill p-value= 0.000000 feature= total_deaths_mill p-value= 0.000000 feature= aged_65_older p-value= 0.000000 feature= diabetes_prevalence p-value= 0.000000
These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.
We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.
This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.
Identifying extreme values and outliers in univariate analysis
An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.
I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.
For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.
A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.
Let's get started:
- Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm covidtotals = pd.read_csv("data/covidtotals.csv") covidtotals.set_index("iso_code", inplace=True) keyvars = ['location','total_cases_mill','total_deaths_mill', 'aged_65_older','diabetes_prevalence','gdp_per_capita'] covidkeys = covidtotals[keyvars]
- Let's take a look at
total_cases_mill
. We get the first and third quartile values and calculate the interquartile range,1.5*(thirdq-firstq)
. Then, we calculate their thresholds to determine the high and low extreme values, which areinterquartilerange+thirdq
andfirstq-interquartilerange
, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25) interquartilerange = 1.5*(thirdq-firstq) extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange print(extvallow, extvalhigh, sep=" <--> ") -91002.564625 <--> 158336.930375
- This calculation indicates that any value for
total_cases_mill
that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T iso_code AND MNE SYC lastdate 2021-07-07 2021-07-07 2021-07-07 location Andorra Montenegro Seychelles total_cases 14,021 100,392 16,304 total_deaths 127 1,619 71 total_cases_mill 181,466 159,844 165,792 total_deaths_mill 1,644 2,578 722 population 77,265 628,062 98,340 population_density 164 46 208 median_age NaN 39 36 gdp_per_capita NaN 16,409 26,382 aged_65_older NaN 15 9 total_tests_thous NaN NaN NaN life_expectancy 84 77 73 hospital_beds_thous NaN 4 4 diabetes_prevalence 8 10 11 region Western Eastern East Europe Europe Africa
- Andorra, Montenegro, and Seychelles all have
total_cases_mill
above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:covidtotals.mean() total_cases 963,933 total_deaths 21,631 total_cases_mill 36,649 total_deaths_mill 683 population 35,134,957 population_density 453 median_age 30 gdp_per_capita 19,141 aged_65_older 9 total_tests_thous 535 life_expectancy 73 hospital_beds_thous 3 diabetes_prevalence 8
The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.
An Alternative to the IQR Calculation
An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.
I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:
- Let's define a function,
getextremevalues
, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout
):def getextremevalues(dfin): dfout = pd.DataFrame(columns=dfin.columns, data=None) for col in dfin.columns[1:]: thirdq, firstq = dfin[col].quantile(0.75), \ dfin[col].quantile(0.25) interquartilerange = 1.5*(thirdq-firstq) extvalhigh, extvallow = \ interquartilerange+thirdq, \ firstq-interquartilerange df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)] df = df.assign(varname = col, threshlow = extvallow, threshhigh = extvalhigh) dfout = pd.concat([dfout, df]) return dfout
- Now, we can pass our
covidkeys
DataFrame to thegetextremevalues
function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill
). Now, we can output the data to an Excel file:extremevalues = getextremevalues(covidkeys) extremevalues.varname.value_counts() gdp_per_capita 9 diabetes_prevalence 8 total_deaths_mill 4 total_cases_mill 3 Name: varname, dtype: int64 extremevalues.to_excel("views/extremevaluescases.xlsx")
- Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the
threshhigh
value fortotal_deaths_mill
, which is2654
. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:extremevalues.loc[extremevalues.varname=="total_deaths_mill", 'threshhigh'][0] 2653.752 extremevalues.loc[extremevalues.varname=="total_deaths_mill", keyvars].sort_values(['total_deaths_mill'], ascending=False) location total_cases_mill total_deaths_mill PER Peru 62,830 5,876 HUN Hungary 83,676 3,105 BIH Bosnia and Herzegovina 62,499 2,947 CZE Czechia 155,783 2,830 _65_older diabetes_prevalence gdp_per_capita PER 7 6 12,237 HUN 19 8 26,778 BIH 17 10 11,714 CZE 19 7 32,606
Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill
above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.
So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.
In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:
- Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the
statsmodels
library for this:sm.qqplot(covidtotals[['total_cases_mill']]. \ sort_values(['total_cases_mill']).dropna(),line='s') ) plt.show()
This produces the following plot:
This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.
Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.
Data Note
The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.
- First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature,
avgtemp
. We must add a few percentile statistics to the normaldescribe
output to get a better sense of the range of values.avgtemp
is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:landtemps = pd.read_csv("data/landtemps2019avgs.csv") landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]) count 12,095.0 mean 11.2 std 8.6 min -60.8 5% -0.7 10% 1.7 25% 5.4 50% 10.4 75% 16.9 90% 23.1 95% 27.0 max 33.9 Name: avgtemp, dtype: float64 landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count() 14 landtemps.avgtemp.skew() -0.2678382583481768 landtemps.avgtemp.kurtosis() 2.1698313707061073
- Now, let's take a look at a Q-Q plot of the average temperature:
sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s') plt.tight_layout() plt.show()
This produces the following plot:
Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).
We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.
Using histograms, boxplots, and violin plots to examine the distribution of features
We have already generated many of the numbers that would make up the data points of a histogram or boxplot. But we often improve our understanding of the data when we see it represented graphically. We see observations bunched around the mean, we notice the size of the tails, and we see what seem to be extreme values.
Using histograms
Follow these steps to create a histogram:
- We will work with both the COVID data and the temperatures data in this section. In addition to the libraries we have worked with so far, we must import Seaborn to create some plots more easily than we could in Matplotlib:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns landtemps = pd.read_csv("data/landtemps2019avgs.csv") covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"]) covidtotals.set_index("iso_code", inplace=True)
- Now, let's create a simple histogram. We can use Matplotlib's
hist
method to create a histogram of total cases per million. We will also draw lines for the mean and median:plt.hist(covidtotals['total_cases_mill'], bins=7) plt.axvline(covidtotals.total_cases_mill.mean(), color='red', linestyle='dashed', linewidth=1, label='mean') plt.axvline(covidtotals.total_cases_mill.median(), color='black', linestyle='dashed', linewidth=1, label='median') plt.title("Total COVID Cases") plt.xlabel('Cases per Million') plt.ylabel("Number of Countries") plt.legend() plt.show()
This produces the following plot:
One aspect of the total distribution that this histogram highlights is that most countries (more than 100 of the 192) are in the very first bin, between 0 cases per million and 25,000 cases per million. Here, we can see the positive skew, with the mean pulled to the right by extreme high values. This is consistent with what we discovered when we used Q-Q plots in the previous section.
- Let's create a histogram of average temperatures from the land temperatures dataset:
plt.hist(landtemps['avgtemp']) plt.axvline(landtemps.avgtemp.mean(), color='red', linestyle='dashed', linewidth=1, label='mean') plt.axvline(landtemps.avgtemp.median(), color='black', linestyle='dashed', linewidth=1, label='median') plt.title("Average Land Temperatures") plt.xlabel('Average Temperature') plt.ylabel("Number of Weather Stations") plt.legend() plt.show()
This produces the following plot:
The histogram for the average land temperatures from the land temperatures dataset looks quite different. Except for a few highly negative values, this distribution looks closer to normal. Here, we can see that the mean and the median are quite close and that the distribution looks fairly symmetrical.
- We should take a look at the observations at the extreme left of the distribution. They are all in Antarctica or the extreme north of Canada. Here, we have to wonder if it makes sense to include observations with such extreme values in the models we construct. However, it would be premature to make that determination based on these results alone. We will come back to this in the next chapter when we examine multivariate techniques for identifying outliers:
landtemps.loc[landtemps.avgtemp<-25,['station','country','avgtemp']].\ ... sort_values(['avgtemp'], ascending=True) station country avgtemp 827 DOME_PLATEAU_DOME_A Antarctica -60.8 830 VOSTOK Antarctica -54.5 837 DOME_FUJI Antarctica -53.4 844 DOME_C_II Antarctica -50.5 853 AMUNDSEN_SCOTT Antarctica -48.4 842 NICO Antarctica -48.4 804 HENRY Antarctica -47.3 838 RELAY_STAT Antarctica -46.1 828 DOME_PLATEAU_EAGLE Antarctica -43.0 819 KOHNENEP9 Antarctica -42.4 1299 FORT_ROSS Canada -30.3 1300 GATESHEAD_ISLAND Canada -28.7 811 BYRD_STATION Antarctica -25.8 816 GILL Antarctica -25.5
An excellent way to visualize central tendency, spread, and outliers at the same time is with a boxplot.
Using boxplots
Boxplots show us the interquartile range, with whiskers representing 1.5 times the interquartile range, and data points beyond that range that can be considered extreme values. If this calculation seems familiar, it's because it's the same one we used earlier in this chapter to identify extreme values! Let's get started:
- We can use the Matplotlib
boxplot
method to create a boxplot of total cases per million people in the population. We can draw arrows to show the interquartile range (the first quartile, median, and third quartile) and the extreme value threshold. The three circles above the threshold can be considered extreme values. The line from the interquartile range to the extreme value threshold is typically referred to as the whisker. There are usually whiskers above and below the interquartile range, but the threshold value below the first quartile value would be negative in this case:plt.boxplot(covidtotals.total_cases_mill.dropna(), labels=['Total Cases per Million']) plt.annotate('Extreme Value Threshold', xy=(1.05,157000), xytext=(1.15,157000), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02)) plt.annotate('3rd quartile', xy=(1.08,64800), xytext=(1.15,64800), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02)) plt.annotate('Median', xy=(1.08,19500), xytext=(1.15,19500), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02)) plt.annotate('1st quartile', xy=(1.08,2500), xytext=(1.15,2500), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02)) plt.title("Boxplot of Total Cases") plt.show()
This produces the following plot:
It is helpful to take a closer look at the interquartile range, specifically where the median falls within the range. For this boxplot, the median is at the lower end of the range. This is what we see in distributions with positive skews.
- Now, let's create a boxplot for the average temperature. All of the extreme values are now at the low end of the distribution. Unsurprisingly, given what we have already seen with the average temperature feature, the median line is closer to the center of the interquartile range than with our previous boxplot (we will not annotate the plot this time – we only did this last time for explanatory purposes):
plt.boxplot(landtemps.avgtemp.dropna(), labels=['Boxplot of Average Temperature']) plt.title("Average Temperature") plt.show()
This produces the following plot:
Histograms help us see the spread of a distribution, while boxplots make it easy to identify outliers. We can get a good sense of both the spread of the distribution and the outliers in one graphic with a violin plot.
Using violin plots
Violin plots combine histograms and boxplots into one plot. They show the IQR, median, and whiskers, as well as the frequency of the observations at all the value ranges.
Let's get started:
- We can use Seaborn to create violin plots of both the COVID cases per million and the average temperature features. I am using Seaborn here, rather than Matplotlib, because I prefer its default options for violin plots:
import seaborn as sns fig = plt.figure() fig.suptitle("Violin Plots of COVID Cases and Land Temperatures") ax1 = plt.subplot(2,1,1) ax1.set_xlabel("Cases per Million") sns.violinplot(data=covidtotals.total_cases_mill, color="lightblue", orient="h") ax1.set_yticklabels([]) ax2 = plt.subplot(2,1,2) ax2.set_xlabel("Average Temperature") sns.violinplot(data=landtemps.avgtemp, color="wheat", orient="h") ax2.set_yticklabels([]) plt.tight_layout() plt.show()
This produces the following plot:
The black bar with the white dot in the middle is the interquartile range, while the white dot represents the median. The height at each point (when the violin plot is horizontal) gives us the relative frequency. The thin black lines to the right of the interquartile range for cases per million, and to the right and left for the average temperature are the whiskers. The extreme values are shown in the part of the distribution beyond the whiskers.
If I am going to create just one plot for a numeric feature, I will create a violin plot. Violin plots allow me to see central tendency, shape, and spread all in one graphic. Space does not permit it here, but I usually like to create violin plots of all of my continuous features and save those to a PDF file for later reference.
Summary
In this chapter, we looked at some common techniques for exploring data. We learned how to retrieve subsets of data when that is required for our analysis. We also used pandas methods to generate key statistics on features such as mean, interquartile range, and skew. This gave us a better sense of the central tendency, spread, and shape of the distribution of each feature. It also put us in a better position to identify outliers. Finally, we used the Matplotlib and Seaborn libraries to create histograms, boxplots, and violin plots. This yielded additional insights about the distribution of features, such as the length of the tail and divergence from the normal distribution.
Visualizations are a great supplement to the tools for univariate analysis that we have discussed in this chapter. Histograms, boxplots, and violin plots display the shape and spread of each feature's distribution. Graphically, they show what we may miss by examining a few summary statistics, such as where there is a bulge (or bulges) in the distribution and where the extreme values are. These visualizations will be every bit as helpful when we explore bivariate and multivariate relationships, which we will do in Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets.