Data Cleaning and Exploration with Machine Learning

By Michael Walker
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Examining the Distribution of Features and Targets
About this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.

As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.

By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Publication date:
August 2022
Publisher
Packt
Pages
542
ISBN
9781803241678

 

Chapter 1: Examining the Distribution of Features and Targets

Machine learning writing and instruction are often algorithm-focused. Sometimes, this gives the impression that all we have to do is choose the right model and that organization-changing insights will follow. But the best place to begin a machine learning project is with an understanding of how the features and targets we will use are distributed.

It is important to make room for the same kind of learning from data that has been central to our work as analysts for decades – studying the distribution of variables, identifying anomalies, and examining bivariate relationships – even as we focus more and more on the accuracy of our predictions.

We will explore tools for doing so in the first three chapters of this book, while also considering implications for model building.

In this chapter, we will use common NumPy and pandas techniques to get a better sense of the attributes of our data. We want to know how key features are distributed before we do any predictive analyses. We also want to know the central tendency, shape, and spread of the distribution of each continuous feature and have a count for each value for categorical features. We will take advantage of very handy NumPy and pandas tools for generating summary statistics, such as the mean, min, and max, as well as standard deviation.

After that, we will create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Specifically, in this chapter, we are going to cover the following topics:

  • Subsetting data
  • Generating frequencies for categorical features
  • Generating summary statistics for continuous features
  • Identifying extreme values and outliers in univariate analysis
  • Using histograms, boxplots, and violin plots to examine the distribution of continuous features
 

Technical requirements

This chapter will rely heavily on the pandas, NumPy, and Matplotlib libraries, but you don't require any prior knowledge of these. If you have installed Python from a scientific distribution, such as Anaconda or WinPython, then these libraries are probably already installed. If you need to install one of them to run the code in this chapter, you can run pip install [package name] from a terminal.

 

Subsetting data

Almost every statistical modeling project I have worked on has required removing some data from the analysis. Often, this is because of missing values or outliers. Sometimes, there are theoretical reasons for limiting our analysis to a subset of the data. For example, we have weather data going back to 1600, but our analysis goals only involve changes in weather since 1900. Fortunately, the subsetting tools in pandas are quite powerful and flexible. We will work with data from the United States National Longitudinal Survey (NLS) of Youth in this section.

Note

The NLS of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 89 variables on grades, employment, income, and attitudes toward government from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. The NLS data is available for public use at https://www.nlsinfo.org/investigator/pages/search.

Let's start subsetting the data using pandas:

  1. We will start by loading the NLS data. We also set an index:
    import pandas as pd
    import numpy as np
    nls97 = pd.read_csv("data/nls97.csv")
    nls97.set_index("personid", inplace=True)
  2. Let's select a few columns from the NLS data. The following code creates a new DataFrame that contains some demographic and employment data. A useful feature of pandas is that the new DataFrame retains the index of the old DataFrame, as shown here:
    democols = ['gender','birthyear','maritalstatus',
     'weeksworked16','wageincome','highestdegree']
    nls97demo = nls97[democols]
    nls97demo.index.name
    'personid'
  3. We can use slicing to select rows by position. nls97demo[1000:1004] selects every row, starting from the row indicated by the integer to the left of the colon (1000, in this case) up to, but not including, the row indicated by the integer to the right of the colon (1004). The row at 1000 is the 1,001st row because of zero-based indexing. Each row appears as a column in the output since we have transposed the resulting DataFrame:
    nls97demo[1000:1004].T
    personid      195884       195891        195970\
    gender        Male         Male          Female
    birthyear     1981         1980          1982
    maritalstatus NaN          Never-married Never-married
    weeksworked16 NaN          53            53
    wageincome    NaN          14,000        52,000   
    highestdegree 4.Bachelors  2.High School 4.Bachelors
    personid       195996  
    gender         Female  
    birthyear      1980  
    maritalstatus  NaN  
    weeksworked16  NaN  
    wageincome     NaN
    highestdegree  3.Associates    
  4. We can also skip rows over the interval by setting a value for the step after the second colon. The default value for the step is 1. The value for the following step is 2, which means that every other row between 1000 and 1004 will be selected:
    nls97demo[1000:1004:2].T
    personid        195884       195970
    gender          Male         Female
    birthyear       1981         1982
    maritalstatus   NaN          Never-married
    weeksworked16   NaN          53
    wageincome      NaN          52,000
    highestdegree   4.Bachelors  4. Bachelors
  5. If we do not include a value to the left of the colon, row selection will start with the first row. Notice that this returns the same DataFrame as the head method does:
    nls97demo[:3].T
    personid       100061         100139          100284
    gender         Female         Male            Male
    birthyear      1980           1983            1984
    maritalstatus  Married        Married         Never-married
    weeksworked16  48             53              47
    wageincome     12,500         120,000         58,000
    highestdegree  2.High School  2. High School  0.None
    nls97demo.head(3).T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
  6. If we use a negative number, -n, to the left of the colon, the last n rows of the DataFrame will be returned. This returns the same DataFrame as the tail method does:
     nls97demo[-3:].T
    personid       999543          999698        999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors
     nls97demo.tail(3).T
    personid       999543         999698         999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors
  7. We can select rows by index value using the loc accessor. Recall that for the nls97demo DataFrame, the index is personid. We can pass a list of the index labels to the loc accessor, such as loc[[195884,195891,195970]], to get the rows associated with those labels. We can also pass a lower and upper bound of index labels, such as loc[195884:195970], to retrieve the indicated rows:
     nls97demo.loc[[195884,195891,195970]].T
    personid       195884       195891         195970
    gender         Male         Male           Female
    birthyear      1981         1980           1982
    maritalstatus  NaN          Never-married  Never-married
    weeksworked16  NaN          53             53
    wageincome     NaN          14,000         52,000
    highestdegree  4.Bachelors  2.High School  4.Bachelors
     nls97demo.loc[195884:195970].T
    personid       195884       195891         195970
    gender         Male         Male           Female
    birthyear      1981         1980           1982
    maritalstatus  NaN          Never-married  Never-married
    weeksworked16  NaN          53             53
    wageincome     NaN          14,000         52,000
    highestdegree  4.Bachelors  2.High School  4.Bachelors
  8. To select rows by position, rather than by index label, we can use the iloc accessor. We can pass a list of position numbers, such as iloc[[0,1,2]], to the accessor to get the rows at those positions. We can pass a range, such as iloc[0:3], to get rows between the lower and upper bound, not including the row at the upper bound. We can also use the iloc accessor to select the last n rows. iloc[-3:] selects the last three rows:
     nls97demo.iloc[[0,1,2]].T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
     nls97demo.iloc[0:3].T
    personid       100061         100139         100284
    gender         Female         Male           Male
    birthyear      1980           1983           1984
    maritalstatus  Married        Married        Never-married
    weeksworked16  48             53             47
    wageincome     12,500         120,000        58,000
    highestdegree  2.High School  2.High School  0. None
     nls97demo.iloc[-3:].T
    personid       999543         999698         999963
    gender         Female         Female         Female
    birthyear      1984           1983           1982
    maritalstatus  Divorced       Never-married  Married
    weeksworked16  0              0              53
    wageincome     NaN            NaN            50,000
    highestdegree  2.High School  2.High School  4. Bachelors

Often, we need to select rows based on a column value or the values of several columns. We can do this in pandas by using Boolean indexing. Here, we pass a vector of Boolean values (which can be a Series) to the loc accessor or the bracket operator. The Boolean vector needs to have the same index as the DataFrame.

  1. Let's try this using the nightlyhrssleep column on the NLS DataFrame. We want a Boolean Series that is True for people who sleep 6 or fewer hours a night (the 33rd percentile) and False if nightlyhrssleep is greater than 6 or is missing. sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold creates the boolean Series. If we display the first few values of sleepcheckbool, we will see that we are getting the expected values. We can also confirm that the sleepcheckbool index is equal to the nls97 index:
    nls97.nightlyhrssleep.head()
    personid
    100061     6
    100139     8
    100284     7
    100292     nan
    100583     6
    Name: nightlyhrssleep, dtype: float64
    lowsleepthreshold = nls97.nightlyhrssleep.quantile(0.33)
    lowsleepthreshold
    6.0
    sleepcheckbool = nls97.nightlyhrssleep<=lowsleepthreshold
    sleepcheckbool.head()
    personid
    100061    True
    100139    False
    100284    False
    100292    False
    100583    True
    Name: nightlyhrssleep, dtype: bool
    sleepcheckbool.index.equals(nls97.index)
    True

Since the sleepcheckbool Series has the same index as nls97, we can just pass it to the loc accessor to create a DataFrame containing people who sleep 6 hours or less a night. This is a little pandas magic here. It handles the index alignment for us:

lowsleep = nls97.loc[sleepcheckbool]
lowsleep.shape
(3067, 88)
  1. We could have created the lowsleep subset of our data in one step, which is what we would typically do unless we need the Boolean Series for some other purpose:
    lowsleep = nls97.loc[nls97.nightlyhrssleep<=lowsleepthreshold]
    lowsleep.shape
    (3067, 88)
  2. We can pass more complex conditions to the loc accessor and evaluate the values of multiple columns. For example, we can select rows where nightlyhrssleep is less than or equal to the threshold and childathome (number of children living at home) is greater than or equal to 3:
    lowsleep3pluschildren = \
      nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
        & (nls97.childathome>=3)]
    lowsleep3pluschildren.shape
    (623, 88)

Each condition in nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold) & (nls97.childathome>3)] is placed in parentheses. An error will be generated if the parentheses are excluded. The & operator is the equivalent of and in standard Python, meaning that both conditions have to be True for the row to be selected. We could have used | for or if we wanted to select the row if either condition was True.

  1. Finally, we can select rows and columns at the same time. The expression to the left of the comma selects rows, while the list to the right of the comma selects columns:
    lowsleep3pluschildren = \
      nls97.loc[(nls97.nightlyhrssleep<=lowsleepthreshold)
        & (nls97.childathome>=3),
        ['nightlyhrssleep','childathome']]
    lowsleep3pluschildren.shape
    (623, 2)

We used three different tools to select columns and rows from a pandas DataFrame in the last two sections: the [] bracket operator and two pandas-specific accessors, loc and iloc. This will be a little confusing if you are new to pandas, but it becomes clear which tool to use in which situation after just a few months. If you came to pandas with a fair bit of Python and NumPy experience, you will likely find the [] operator most familiar. However, the pandas documentation recommends against using the [] operator for production code. The loc accessor is used for selecting rows by Boolean indexing or by index label, while the iloc accessor is used for selecting rows by row number.

This section was a brief primer on selecting columns and rows with pandas. Although we did not go into too much detail on this, most of what you need to know to subset data was covered, as well as everything you need to know to understand the pandas-specific material in the rest of this book. We will start putting some of that to work in the next two sections by creating frequencies and summary statistics for our features.

 

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

  1. Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:
    nls97 = pd.read_csv("data/nls97.csv")
    nls97.set_index("personid", inplace=True)
    nls97abb = nls97.iloc[:,:20]
    nls97abb.dtypes
    gender                   object
    birthmonth               int64
    birthyear                int64
    highestgradecompleted    float64
    maritalstatus            object
    childathome              float64
    childnotathome           float64
    wageincome               float64
    weeklyhrscomputer        object
    weeklyhrstv              object
    nightlyhrssleep          float64
    satverbal                float64
    satmath                  float64
    gpaoverall               float64
    gpaenglish               float64
    gpamath                  float64
    gpascience               float64
    highestdegree            object
    govprovidejobs           object
    govpricecontrols         object
    dtype: object

    Note

    Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

  2. All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):
    nls97abb.maritalstatus.value_counts(dropna=False)
    Married          3066
    Never-married    2766
    NaN              2312
    Divorced         663
    Separated        154
    Widowed          23
    Name: maritalstatus, dtype: int64
  3. If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
    nls97abb.maritalstatus.isnull().sum()
    2312
  4. You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
    marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
    type(marstatcnt)
    <class 'pandas.core.series.Series'>
    marstatcnt.index
    Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
  5. To sort the index, we just need to call sort_index:
    marstatcnt.sort_index()
    Divorced         663
    Married          3066
    Never-married    2766
    Separated        154
    Widowed          23
    NaN              2312
    Name: maritalstatus, dtype: int64
  6. Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):
    nls97.maritalstatus.\
      value_counts(normalize=True, dropna=False).\
         sort_index()
     
    Divorced             0.07
    Married              0.34
    Never-married        0.31
    Separated            0.02
    Widowed              0.00
    NaN                  0.26
    Name: maritalstatus, dtype: float64
  7. pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:
    catcols = nls97abb.select_dtypes(include=["object"]).columns
    for col in nls97abb[catcols].columns:
    ...      nls97abb[col] = nls97abb[col].astype('category')
    ... 
    nls97abb[catcols].dtypes
    gender                   category
    maritalstatus            category
    weeklyhrscomputer        category
    weeklyhrstv              category
    highestdegree            category
    govprovidejobs           category
    govpricecontrols         category
    dtype: object
  8. Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
    nls97abb[catcols].isnull().sum()
    gender               0
    maritalstatus        2312
    weeklyhrscomputer    2274
    weeklyhrstv          2273
    highestdegree        31
    govprovidejobs       7151
    govpricecontrols     7125
    dtype: int64
  9. We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:
     nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.25                0.54
    2. Probably                0.34                0.33
    3. Probably not            0.25                0.09
    4. Definitely not          0.16                0.04
  10. We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:
     nls97abb.loc[nls97abb.maritalstatus=="Married"].\
     filter(like="gov").\
       apply(pd.value_counts, normalize=True)
                     govprovidejobs    govpricecontrols
    1. Definitely              0.17                0.46
    2. Probably                0.33                0.38
    3. Probably not            0.31                0.11
    4. Definitely not          0.18                0.05
  11. Since, in this case, there were only two gov columns, it may have been easier to do the following:
     nls97abb.loc[nls97abb.maritalstatus=="Married",
       ['govprovidejobs','govpricecontrols']].\
       apply(pd.value_counts, normalize=True)
                      govprovidejobs     govpricecontrols
    1. Definitely               0.17                 0.46
    2. Probably                 0.33                 0.38
    3. Probably not             0.31                 0.11
    4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

  1. We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
    nls97abb.highestgradecompleted.\
      replace(95, np.nan, inplace=True)
  2. We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:
    nls97abb['highschoolgrad'] = \
      np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
      np.where(nls97abb.highestgradecompleted<12,0,1))
     
    nls97abb.groupby(['highschoolgrad'], dropna=False) \
      ['highestgradecompleted'].agg(['min','max','size'])
                      min       max       size
    highschoolgrad                
    0                   5        11       1231
    1                  12        20       5421
    nan               nan       nan       2332
     nls97abb['highschoolgrad'] = \
    ...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

  1. One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:
    nls97abb['highgradegroup'] = \
      pd.qcut(nls97abb['highestgradecompleted'], 
       q=6, labels=[1,2,3,4,5,6])
     
    nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
        agg(['min','max','size'])
                      min         max      size
    highgradegroup                
    1                   5          11       1231
    2                  12          12       1389
    3                  13          14       1288
    4                  15          16       1413
    5                  17          17        388
    6                  18          20        943
    nls97abb['highgradegroup'] = \
        nls97abb['highgradegroup'].astype('category')
  2. Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:
     freqout = open('views/frequencies.txt', 'w') 
     for col in nls97abb.select_dtypes(include=["category"]):
          print(col, "----------------------",
            "frequencies",
          nls97abb[col].value_counts(dropna=False).sort_index(),
            "percentages",
          nls97abb[col].value_counts(normalize=True).\
            sort_index(),
          sep="\n\n", end="\n\n\n", file=freqout)
     
     freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

 

Generating summary statistics for continuous and discrete features

Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.

Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.

For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.

To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.

We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.

Note

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.

Follow these steps to generate the summary statistics:

  1. Let's load the COVID .csv file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set, iso_code, contains a unique value for each row. We use sample to view two countries randomly, rather than the first two (we set a value for random_state to get the same results each time we run the code):
    import pandas as pd
    import numpy as np
    import scipy.stats as scistat
    covidtotals = pd.read_csv("data/covidtotals.csv",
        parse_dates=['lastdate'])
    covidtotals.set_index("iso_code", inplace=True)
    covidtotals.shape
    (221, 16)
    covidtotals.index.nunique()
    221
    covidtotals.sample(2, random_state=6).T
    iso_code                         ISL               CZE
    lastdate                  2021-07-07        2021-07-07
    location                     Iceland           Czechia
    total_cases                    6,555         1,668,277
    total_deaths                      29            30,311
    total_cases_mill              19,209           155,783
    total_deaths_mill                 85             2,830
    population                   341,250        10,708,982
    population_density                 3               137
    median_age                        37                43
    gdp_per_capita                46,483            32,606
    aged_65_older                     14                19
    total_tests_thous                NaN               NaN
    life_expectancy                   83                79
    hospital_beds_thous                3                 7
    diabetes_prevalence                5                 7
    region                Western Europe    Western Europe

Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill and total_deaths_mill divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.

  1. Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:
    covidtotals.info()
    <class 'pandas.core.frame.DataFrame'>
    Index: 221 entries, AFG to ZWE
    Data columns (total 16 columns):
     #   Column             Non-Null Count         Dtype 
    ---  -------            --------------  --------------
     0   lastdate             221 non-null  datetime64[ns]
     1   location             221 non-null          object
     2   total_cases          192 non-null         float64
     3   total_deaths         185 non-null         float64
     4   total_cases_mill     192 non-null         float64
     5   total_deaths_mill    185 non-null         float64
     6   population           221 non-null         float64
     7   population_density   206 non-null         float64
     8   median_age           190 non-null         float64
     9   gdp_per_capita       193 non-null         float64
     10  aged_65_older        188 non-null         float64
     11  total_tests_thous     13 non-null         float64
     12  life_expectancy      217 non-null         float64
     13  hospital_beds_thous  170 non-null         float64
     14  diabetes_prevalence  200 non-null         float64
     15  region               221 non-null          object
    dtypes: datetime64[ns](1), float64(13), object(2)
    memory usage: 29.4+ KB
  2. Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the describe method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.

The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older and diabetes_prevalence) would have to do a lot of work to explain the huge variance in our targets:

 keyvars = ['location','total_cases_mill','total_deaths_mill',
...  'aged_65_older','diabetes_prevalence']
 covidkeys = covidtotals[keyvars]
 covidkeys.describe()
total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence
count        192.00       185.00    188.00     200.00
mean      36,649.37       683.14      8.61       8.44
std       41,403.98       861.73      6.12       4.89
min            8.52         0.35      1.14       0.99
25%        2,499.75        43.99      3.50       5.34
50%       19,525.73       293.50      6.22       7.20
75%       64,834.62     1,087.89     13.92      10.61
max      181,466.38     5,876.01     27.05      30.53
  1. I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The quantile method can take a single value for quantile, such as quantile(0.25) for the 25th percentile, or a list or tuple, such as quantile((0.25,0.5)) for the 25th and 50th percentiles. In the following code, we're using arange from NumPy (np.arange(0.0, 1.1, 0.1)) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to use covidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]):
     covidkeys.quantile(np.arange(0.0, 1.1, 0.1))
          total_cases_mill  total_deaths_mill  aged_65_older  diabetes_prevalence
    0.00         8.52       0.35       1.14     0.99
    0.10       682.13      10.68       2.80     3.30
    0.20     1,717.39      30.22       3.16     4.79
    0.30     3,241.84      66.27       3.86     5.74
    0.40     9,403.58      145.06      4.69     6.70
    0.50     19,525.73     293.50      6.22     7.20
    0.60     33,636.47     556.43      7.93     8.32
    0.70     55,801.33     949.71     11.19    10.08
    0.80     74,017.81    1,333.79    14.92    11.62
    0.90     94,072.18    1,868.89    18.85    13.75
    1.00    181,466.38    5,876.01    27.05    30.53

For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.

  1. Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis. total_deaths_mill has values for skew and kurtosis that are worth noting, and the total_cases_mill and aged_65_older features have excessively low kurtosis (skinny tails):
    covidkeys.skew()
    total_cases_mill        1.21
    total_deaths_mill       2.00
    aged_65_older           0.84
    diabetes_prevalence     1.52
    dtype: float64
     covidkeys.kurtosis()
    total_cases_mill        0.91
    total_deaths_mill       6.58
    aged_65_older          -0.56
    diabetes_prevalence     3.31
    dtype: float64
  2. We can also explicitly test each distribution's normality by looping over the features in the keyvars list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())). Notice that we need to drop missing values with dropna for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:
    for var in keyvars[1:]:
          stat, p = scistat.shapiro(covidkeys[var].dropna())
          print("feature=", var, "     p-value=", '{:.6f}'.format(p))
     
    feature= total_cases_mill       p-value= 0.000000
    feature= total_deaths_mill      p-value= 0.000000
    feature= aged_65_older          p-value= 0.000000
    feature= diabetes_prevalence    p-value= 0.000000

These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.

We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.

This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.

 

Identifying extreme values and outliers in univariate analysis

An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.

I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.

For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.

A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.

Let's get started:

  1. Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    covidtotals = pd.read_csv("data/covidtotals.csv")
    covidtotals.set_index("iso_code", inplace=True)
    keyvars = ['location','total_cases_mill','total_deaths_mill',
      'aged_65_older','diabetes_prevalence','gdp_per_capita']
    covidkeys = covidtotals[keyvars]
  2. Let's take a look at total_cases_mill. We get the first and third quartile values and calculate the interquartile range, 1.5*(thirdq-firstq). Then, we calculate their thresholds to determine the high and low extreme values, which are interquartilerange+thirdq and firstq-interquartilerange, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):
    thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25)
    interquartilerange = 1.5*(thirdq-firstq)
    extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange
    print(extvallow, extvalhigh, sep=" <--> ")
    -91002.564625 <--> 158336.930375
  3. This calculation indicates that any value for total_cases_mill that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:
    covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T
    iso_code                   AND         MNE         SYC
    lastdate            2021-07-07  2021-07-07  2021-07-07
    location            Andorra     Montenegro  Seychelles
    total_cases          14,021        100,392      16,304
    total_deaths            127          1,619          71
    total_cases_mill    181,466        159,844     165,792
    total_deaths_mill     1,644          2,578         722
    population           77,265        628,062      98,340
    population_density      164             46         208
    median_age              NaN             39          36
    gdp_per_capita          NaN         16,409      26,382
    aged_65_older           NaN             15           9
    total_tests_thous       NaN            NaN         NaN
    life_expectancy          84             77          73
    hospital_beds_thous     NaN              4           4
    diabetes_prevalence       8             10          11
    region              Western        Eastern        East 
                         Europe         Europe      Africa
  4. Andorra, Montenegro, and Seychelles all have total_cases_mill above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:
    covidtotals.mean()
    total_cases               963,933
    total_deaths              21,631
    total_cases_mill          36,649
    total_deaths_mill         683
    population                35,134,957
    population_density        453
    median_age                30
    gdp_per_capita            19,141
    aged_65_older             9
    total_tests_thous         535
    life_expectancy           73
    hospital_beds_thous       3
    diabetes_prevalence       8 

The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.

An Alternative to the IQR Calculation

An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.

I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:

  1. Let's define a function, getextremevalues, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout):
    def getextremevalues(dfin):
          dfout = pd.DataFrame(columns=dfin.columns, 
                               data=None)
          for col in dfin.columns[1:]:
            thirdq, firstq = dfin[col].quantile(0.75), \
              dfin[col].quantile(0.25)
            interquartilerange = 1.5*(thirdq-firstq)
            extvalhigh, extvallow = \
              interquartilerange+thirdq, \
              firstq-interquartilerange
            df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)]
            df = df.assign(varname = col,
              threshlow = extvallow,
              threshhigh = extvalhigh)
            dfout = pd.concat([dfout, df])
        return dfout
  2. Now, we can pass our covidkeys DataFrame to the getextremevalues function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill). Now, we can output the data to an Excel file:
    extremevalues = getextremevalues(covidkeys)
    extremevalues.varname.value_counts()
    gdp_per_capita         9
    diabetes_prevalence    8
    total_deaths_mill      4
    total_cases_mill       3
    Name: varname, dtype: int64
    extremevalues.to_excel("views/extremevaluescases.xlsx")
  3. Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the threshhigh value for total_deaths_mill, which is 2654. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:
    extremevalues.loc[extremevalues.varname=="total_deaths_mill",
        'threshhigh'][0]
    2653.752
    extremevalues.loc[extremevalues.varname=="total_deaths_mill",
          keyvars].sort_values(['total_deaths_mill'], ascending=False)
          location              total_cases_mill  total_deaths_mill
    PER   Peru                    62,830            5,876
    HUN   Hungary                 83,676            3,105
    BIH   Bosnia and Herzegovina  62,499            2,947
    CZE   Czechia                 155,783           2,830
         _65_older  diabetes_prevalence  gdp_per_capita  
    PER      7              6                12,237
    HUN     19              8                26,778
    BIH     17             10                11,714
    CZE     19              7                32,606

Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.

So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.

In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:

  1. Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the statsmodels library for this:
    sm.qqplot(covidtotals[['total_cases_mill']]. \
      sort_values(['total_cases_mill']).dropna(),line='s')
    )
    plt.show()

This produces the following plot:

Figure 1.1 – Q-Q plot of total cases per million

Figure 1.1 – Q-Q plot of total cases per million

This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.

Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.

Data Note

The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.

  1. First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature, avgtemp. We must add a few percentile statistics to the normal describe output to get a better sense of the range of values. avgtemp is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:
    landtemps = pd.read_csv("data/landtemps2019avgs.csv")
    landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
    count        12,095.0
    mean             11.2
    std               8.6
    min             -60.8
    5%               -0.7
    10%               1.7
    25%               5.4
    50%              10.4
    75%              16.9
    90%              23.1
    95%              27.0
    max              33.9
    Name: avgtemp, dtype: float64
     landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count()
    14
    landtemps.avgtemp.skew()
    -0.2678382583481768
    landtemps.avgtemp.kurtosis()
    2.1698313707061073
  2. Now, let's take a look at a Q-Q plot of the average temperature:
    sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s')
    plt.tight_layout()
    plt.show()

This produces the following plot:

Figure 1.2 – Q-Q plot of average temperatures

Figure 1.2 – Q-Q plot of average temperatures

Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).

We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.

 

Using histograms, boxplots, and violin plots to examine the distribution of features

We have already generated many of the numbers that would make up the data points of a histogram or boxplot. But we often improve our understanding of the data when we see it represented graphically. We see observations bunched around the mean, we notice the size of the tails, and we see what seem to be extreme values.

Using histograms

Follow these steps to create a histogram:

  1. We will work with both the COVID data and the temperatures data in this section. In addition to the libraries we have worked with so far, we must import Seaborn to create some plots more easily than we could in Matplotlib:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    landtemps = pd.read_csv("data/landtemps2019avgs.csv")
    covidtotals = pd.read_csv("data/covidtotals.csv", parse_dates=["lastdate"])
    covidtotals.set_index("iso_code", inplace=True)
  2. Now, let's create a simple histogram. We can use Matplotlib's hist method to create a histogram of total cases per million. We will also draw lines for the mean and median:
    plt.hist(covidtotals['total_cases_mill'], bins=7)
    plt.axvline(covidtotals.total_cases_mill.mean(), color='red',
       linestyle='dashed', linewidth=1, label='mean')
    plt.axvline(covidtotals.total_cases_mill.median(), color='black',
       linestyle='dashed', linewidth=1, label='median')
    plt.title("Total COVID Cases")
    plt.xlabel('Cases per Million')
    plt.ylabel("Number of Countries")
    plt.legend()
    plt.show()

This produces the following plot:

Figure 1.3 – Total COVID cases

Figure 1.3 – Total COVID cases

One aspect of the total distribution that this histogram highlights is that most countries (more than 100 of the 192) are in the very first bin, between 0 cases per million and 25,000 cases per million. Here, we can see the positive skew, with the mean pulled to the right by extreme high values. This is consistent with what we discovered when we used Q-Q plots in the previous section.

  1. Let's create a histogram of average temperatures from the land temperatures dataset:
    plt.hist(landtemps['avgtemp'])
    plt.axvline(landtemps.avgtemp.mean(), color='red', linestyle='dashed', linewidth=1, label='mean')
    plt.axvline(landtemps.avgtemp.median(), color='black', linestyle='dashed', linewidth=1, label='median')
    plt.title("Average Land Temperatures")
    plt.xlabel('Average Temperature')
    plt.ylabel("Number of Weather Stations")
    plt.legend()
    plt.show()

This produces the following plot:

Figure 1.4 – Average land temperatures

Figure 1.4 – Average land temperatures

The histogram for the average land temperatures from the land temperatures dataset looks quite different. Except for a few highly negative values, this distribution looks closer to normal. Here, we can see that the mean and the median are quite close and that the distribution looks fairly symmetrical.

  1. We should take a look at the observations at the extreme left of the distribution. They are all in Antarctica or the extreme north of Canada. Here, we have to wonder if it makes sense to include observations with such extreme values in the models we construct. However, it would be premature to make that determination based on these results alone. We will come back to this in the next chapter when we examine multivariate techniques for identifying outliers:
     landtemps.loc[landtemps.avgtemp<-25,['station','country','avgtemp']].\
    ...  sort_values(['avgtemp'], ascending=True)
          station               country          avgtemp
    827   DOME_PLATEAU_DOME_A   Antarctica      -60.8
    830   VOSTOK                Antarctica      -54.5
    837   DOME_FUJI             Antarctica      -53.4
    844   DOME_C_II             Antarctica      -50.5
    853   AMUNDSEN_SCOTT        Antarctica      -48.4
    842   NICO                  Antarctica      -48.4
    804   HENRY                 Antarctica      -47.3
    838   RELAY_STAT            Antarctica      -46.1
    828   DOME_PLATEAU_EAGLE    Antarctica      -43.0
    819   KOHNENEP9             Antarctica      -42.4
    1299  FORT_ROSS             Canada          -30.3
    1300  GATESHEAD_ISLAND      Canada          -28.7
    811   BYRD_STATION          Antarctica      -25.8
    816   GILL                  Antarctica      -25.5

An excellent way to visualize central tendency, spread, and outliers at the same time is with a boxplot.

Using boxplots

Boxplots show us the interquartile range, with whiskers representing 1.5 times the interquartile range, and data points beyond that range that can be considered extreme values. If this calculation seems familiar, it's because it's the same one we used earlier in this chapter to identify extreme values! Let's get started:

  1. We can use the Matplotlib boxplot method to create a boxplot of total cases per million people in the population. We can draw arrows to show the interquartile range (the first quartile, median, and third quartile) and the extreme value threshold. The three circles above the threshold can be considered extreme values. The line from the interquartile range to the extreme value threshold is typically referred to as the whisker. There are usually whiskers above and below the interquartile range, but the threshold value below the first quartile value would be negative in this case:
    plt.boxplot(covidtotals.total_cases_mill.dropna(), labels=['Total Cases per Million'])
    plt.annotate('Extreme Value Threshold', xy=(1.05,157000), xytext=(1.15,157000), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
    plt.annotate('3rd quartile', xy=(1.08,64800), xytext=(1.15,64800), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
    plt.annotate('Median', xy=(1.08,19500), xytext=(1.15,19500), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
    plt.annotate('1st quartile', xy=(1.08,2500), xytext=(1.15,2500), size=7, arrowprops=dict(facecolor='black', headwidth=2, width=0.5, shrink=0.02))
    plt.title("Boxplot of Total Cases")
    plt.show()

This produces the following plot:

Figure 1.5 – Boxplot of total cases

Figure 1.5 – Boxplot of total cases

It is helpful to take a closer look at the interquartile range, specifically where the median falls within the range. For this boxplot, the median is at the lower end of the range. This is what we see in distributions with positive skews.

  1. Now, let's create a boxplot for the average temperature. All of the extreme values are now at the low end of the distribution. Unsurprisingly, given what we have already seen with the average temperature feature, the median line is closer to the center of the interquartile range than with our previous boxplot (we will not annotate the plot this time – we only did this last time for explanatory purposes):
     plt.boxplot(landtemps.avgtemp.dropna(), labels=['Boxplot of Average Temperature'])
     plt.title("Average Temperature")
     plt.show()

This produces the following plot:

Figure 1.6 – Boxplot of average temperature

Figure 1.6 – Boxplot of average temperature

Histograms help us see the spread of a distribution, while boxplots make it easy to identify outliers. We can get a good sense of both the spread of the distribution and the outliers in one graphic with a violin plot.

Using violin plots

Violin plots combine histograms and boxplots into one plot. They show the IQR, median, and whiskers, as well as the frequency of the observations at all the value ranges.

Let's get started:

  1. We can use Seaborn to create violin plots of both the COVID cases per million and the average temperature features. I am using Seaborn here, rather than Matplotlib, because I prefer its default options for violin plots:
    import seaborn as sns
    fig = plt.figure()
    fig.suptitle("Violin Plots of COVID Cases and Land Temperatures")
    ax1 = plt.subplot(2,1,1)
    ax1.set_xlabel("Cases per Million")
    sns.violinplot(data=covidtotals.total_cases_mill, color="lightblue", orient="h")
    ax1.set_yticklabels([])
    ax2 = plt.subplot(2,1,2)
    ax2.set_xlabel("Average Temperature")
    sns.violinplot(data=landtemps.avgtemp, color="wheat",  orient="h")
    ax2.set_yticklabels([])
    plt.tight_layout()
    plt.show()

This produces the following plot:

Figure 1.7 – Violin plots of COVID cases and land temperatures

Figure 1.7 – Violin plots of COVID cases and land temperatures

The black bar with the white dot in the middle is the interquartile range, while the white dot represents the median. The height at each point (when the violin plot is horizontal) gives us the relative frequency. The thin black lines to the right of the interquartile range for cases per million, and to the right and left for the average temperature are the whiskers. The extreme values are shown in the part of the distribution beyond the whiskers.

If I am going to create just one plot for a numeric feature, I will create a violin plot. Violin plots allow me to see central tendency, shape, and spread all in one graphic. Space does not permit it here, but I usually like to create violin plots of all of my continuous features and save those to a PDF file for later reference.

 

Summary

In this chapter, we looked at some common techniques for exploring data. We learned how to retrieve subsets of data when that is required for our analysis. We also used pandas methods to generate key statistics on features such as mean, interquartile range, and skew. This gave us a better sense of the central tendency, spread, and shape of the distribution of each feature. It also put us in a better position to identify outliers. Finally, we used the Matplotlib and Seaborn libraries to create histograms, boxplots, and violin plots. This yielded additional insights about the distribution of features, such as the length of the tail and divergence from the normal distribution.

Visualizations are a great supplement to the tools for univariate analysis that we have discussed in this chapter. Histograms, boxplots, and violin plots display the shape and spread of each feature's distribution. Graphically, they show what we may miss by examining a few summary statistics, such as where there is a bulge (or bulges) in the distribution and where the extreme values are. These visualizations will be every bit as helpful when we explore bivariate and multivariate relationships, which we will do in Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets.

About the Author
  • Michael Walker

    Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island.

    Browse publications by this author
Data Cleaning and Exploration with Machine Learning
Unlock this book and the full library FREE for 7 days
Start now