In this section, we'll look at how individual variables behave—what kind of values they take, what the distribution across those values is, and how those distributions can be represented visually.
Target Variable
The target variable can either have values that are continuous (in the case of a regression problem) or discrete (as in the case of a classification problem). The problem statement we're looking at in this chapter involves predicting whether an earthquake caused a tsunami, that is, the flag_tsunami variable, which takes on two discrete values only—making it a classification problem.
One way of visualizing how many earthquakes resulted in tsunamis and how many didn't involves the use of a bar chart, where each bar represents a single discrete value of the variable, and the height of the bars is equal to the count of the data points having the corresponding discrete value. This gives us a good comparison of the absolute counts of each category.
Exercise 2.06: Plotting a Bar Chart
Let's look at how many of the earthquakes in our dataset resulted in a tsunami. We will do this by using the value_counts() method over the column and using the .plot(kind='bar') function directly on the returned pandas series. This exercise is a continuation of Exercise 2.05: Performing Imputation Using Inferred Values:
- Use
plt.figure() to initiate the plotting:plt.figure(figsize=(8,6))
- Next, type in our primary plotting command:
data.flag_tsunami.value_counts().plot(kind='bar', \
color = ('grey', \
'black'))
- Set the display parameters and display the plot:
plt.ylabel('Number of data points')
plt.xlabel('flag_tsunami')
plt.show()The output will be as follows:
Figure 2.17: Bar chart showing how many earthquakes resulted in a tsunami
From this bar plot, we can see that most of the earthquakes did not result in tsunamis and that fewer than one-third of the earthquakes actually did. This shows us that the dataset is slightly imbalanced.
Note
To access the source code for this specific section, please refer to https://packt.live/2Yn4UfR.
You can also run this example online at https://packt.live/37QvoJI. You must execute the entire Notebook in order to get the desired result.
Let's look more closely at what these Matplotlib commands do:
plt.figure(figsize=(8,6)): This command defines how big our plot should be, by providing width and height values. This is always the first command before any plotting command is written.
plt.xlabel() and plt.ylabel(): These commands take a string as input and allow us to specify what the labels for the X and Y axes on the plot should be.
plt.show(): This is the final command that is written when plotting a visualization that displays the plot inline within the Jupyter notebook.
Categorical Data
Categorical variables are ones that take discrete values representing different categories or levels of observation that can either be string objects or integer values. For example, our target variable, flag_tsunami, is a categorical variable with two categories, Tsu and No.
Categorical variables can be of two types:
- Nominal variables: Variables in which the categories are labeled without any order of precedence are called nominal variables. An example of a nominal variable from our dataset would be
location_name. The values that this variable takes cannot be said to be ordered, that is, one location is not greater than the other. Similarly, more examples of such a variable would be color, types of footwear, ethnicity type, and so on.
- Ordinal variables: Variables that have some order associated with them are called ordinal variables. An example from our dataset would be
damage_description since each value represents an increasing value of damage incurred. Another example could be days of the week, which would have values from Monday to Sunday, which have some order associated with them and we know that Thursday comes after Wednesday but before Friday.Although ordinal variables can be represented by object data types, they are often represented as numerical data types as well, often making it difficult to differentiate between them and continuous variables.
One of the major challenges faced when dealing with categorical variables in a dataset is high cardinality, that is, a large number of categories or distinct values with each value appearing a relatively small number of times. For example, location_name has a large number of unique values, with each value occurring a small number of times in the dataset.
Additionally, non-numerical categorical variables will always require some form of preprocessing to be converted into a numerical format so that they can be ingested for training by a machine learning model. It can be a challenge to encode categorical variables numerically without losing out on contextual information that, despite being easy for humans to interpret (due to domain knowledge or otherwise just plain common sense), would be hard for a computer to automatically understand. For example, a geographical feature such as country or location name by itself would give no indication of the geographical proximity of different values, but that might just be an important feature—what if earthquakes that occur at locations in South East Asia trigger more tsunamis than those that occur in Europe? There would be no way of capturing that information by merely encoding the feature numerically.
Exercise 2.07: Identifying Data Types for Categorical Variables
Let's establish which variables in our Earthquake dataset are categorical and which are continuous. As we now know, categorical variables can also have numerical values, so having a numeric data type doesn't guarantee that a variable is continuous. This exercise is a continuation of Exercise 2.05: Performing Imputation Using Inferred Values:
- Find all the columns that are numerical and object types. We use the
.select_dtypes() method on the DataFrame to create a subset DataFrame having numeric (np.number) and categorical (np.object) columns, and then print the column names for each. For numeric columns, use this command:numeric_variables = data.select_dtypes(include=[np.number])
numeric_variables.columns
The output will be as follows:
Figure 2.18: All columns that are numerical
For categorical columns, use this command:
object_variables = data.select_dtypes(include=[np.object])
object_variables.columns
The output will be as follows:
Figure 2.19: All columns that are object types
Here, it is evident that the columns that are object types are categorical variables. To differentiate between the categorical and continuous variables from the numeric columns, let's see how many unique values there are for each of these features.
- Find the number of unique values for numeric features. We use the
select_dtypes method on the DataFrame to find the number of unique values in each column and sort the resulting series in ascending order. For numeric columns, use this command:numeric_variables.nunique().sort_values()
The output will be as follows:
Figure 2.20: Number of unique values for numeric features
For categorical columns, use this command:
object_variables.nunique().sort_values()
The output will be as follows:
Figure 2.21: Number of unique values for categorical columns
Note
To access the source code for this specific section, please refer to https://packt.live/2YlSmFt.
You can also run this example online at https://packt.live/31hnuIr. You must execute the entire Notebook in order to get the desired result.
For the numeric variables, we can see that the top nine have significantly fewer unique values than the remaining rows, and it's likely that these are categorical variables. However, we must keep in mind that it is possible that some of them might just be continuous variables with a low range of rounded-up values. Also, month and day would not be considered categorical variables here.
Exercise 2.08: Calculating Category Value Counts
For columns with categorical values, it would be useful to see what the unique values (categories) of the feature are, along with what the frequencies of these categories are, that is, how often does each distinct value occur in the dataset. Let's find the number of occurrences of each 0 to 4 label and NaN values for the injuries_description categorical variable. This exercise is a continuation of Exercise 2.07: Identifying Data Types for Categorical Variables:
- Use the
value_counts() function on the injuries_description column to find the frequency of each category. Using value_counts gives us the frequencies of each value in decreasing order in the form of a pandas series:counts = data.injuries_description.value_counts(dropna=False)
counts
The output should be as follows:
Figure 2.22: Frequency of each category
- Sort the values in increasing order of the ordinal variable. If we want the frequencies in the order of the values themselves, we can reset the index to give us a DataFrame and sort values by the index (that is, the ordinal variable):
counts.reset_index().sort_values(by='index')
The output will be as follows:
Figure 2.23: Sorted values
Note
To access the source code for this specific section, please refer to https://packt.live/2Yn5URj.
You can also run this example online at https://packt.live/314dYIr. You must execute the entire Notebook in order to get the desired result.
Exercise 2.09: Plotting a Pie Chart
Since our target variable in our sample data is categorical, the example in Exercise 2.06: Plotting a Bar Chart, showed us one way of visualizing how the categorical values are distributed (using a bar chart). Another plot that can make it easy to see how each category functions as a fraction of the overall dataset is a pie chart. Let's plot a pie chart to visualize the distribution of the discrete values of the damage_description variable. This exercise is a continuation of Exercise 2.08, Calculating Category Value Counts:
- Format the data into the form that needs to be plotted. Here, we run
value_counts() over the column and sort the series by index:counts = data.damage_description.value_counts()
counts = counts.sort_index()
- Plot the pie chart. The
plt.pie() category plots the pie chart using the count data. We will use the same three steps for plotting as described in Exercise 2.06: Plotting a Bar Chart:fig, ax = plt.subplots(figsize=(10,10))
slices = ax.pie(counts, \
labels=counts.index, \
colors = ['white'], \
wedgeprops = {'edgecolor': 'black'})
patches = slices[0]
hatches = ['/', '\\', '|', '-', '+', 'x', 'o', 'O', '\.', '*']
for patch in range(len(patches)):
patches[patch].set_hatch(hatches[patch])
plt.title('Pie chart showing counts for\ndamage_description '\
'categories')
plt.show()The output will be as follows:
Figure 2.24: Pie chart showing counts for damage_description categories
Note
To access the source code for this specific section, please refer to https://packt.live/37Ovj9s.
You can also run this example online at https://packt.live/37OvotM. You must execute the entire Notebook in order to get the desired result.
Figure 2.24 tells us the relative number of items in each of the five damage description categories. Note that it would be good practice to do the extra work to change the uninformative labels to the categories—recall from the EDA discussion that:
0 = NONE
1 = LIMITED (roughly corresponding to less than $1 million)
2 = MODERATE (~$1 to $5 million)
3 = SEVERE (~>$5 to $24 million)
4 = EXTREME (~$25 million or more)
In addition, while the pie chart gives us a quick visual impression of which are the largest and smallest categories, we get no idea of the actual quantities, so adding those labels would increase the value of the chart. You can use the code in the repository for this book to update the chart.
Continuous Data
Continuous variables can take any number of values and are usually integer (for example, number of deaths) or float data types (for example, the height of a mountain). It's useful to get an idea of the basic statistics of the values in the feature: the minimum, maximum, and percentile values we see from the output of the describe() function gives us a fair estimate of this.
However, for continuous variables, it is also very useful to see how the values are distributed in the range they operate in. Since we cannot simply find the counts of individual values, instead, we order the values in ascending order, group them into evenly-sized intervals, and find the counts for each interval. This gives us the underlying frequency distribution and plotting this gives us a histogram, which allows us to examine the shape, central values, and amount of variability in the data.
Histograms give us an easy view of the data that we're looking at. They tell us about the behavior of the values at a glance in terms of the underlying distribution (for example, a normal or exponential distribution), the presence of outliers, skewness, and more.
Note
It is easy to get confused between a bar chart and a histogram. The major difference is that a histogram is used to plot continuous data that has been binned to visualize the frequency distribution, while bar charts can be used for a variety of other use cases, including to represent categorical variables as we have done. Additionally, with histograms, the number of bins is something we can vary, so the range of values in a bin is determined by the number of bins, as is the height of the bars in the histogram. In a bar chart, the width of the bars does not generally convey meaning, and the height is usually a property of the category, like a count.
One of the most common frequency distributions is a Gaussian (or normal) distribution. This is a symmetric distribution that has a bell-shaped curve, which indicates that the values near the middle of the range have the highest occurrences in the dataset with a symmetrically decreasing frequency of occurrences as we move away from the middle. You almost certainly have seen examples of Gaussian distributions, because many natural and man-made processes generate values that vary nearly like the Gaussian distribution. Thus, it is extremely common to see data compared to the Gaussian distribution.
It is a probability distribution and the area under the curve equals one, as shown in Figure 2.25:
Figure 2.25: Gaussian (normal) distribution
A symmetric distribution like normal distribution can be characterized entirely by two parameters—the mean (µ) and the standard deviation (σ). In Figure 2.25, the mean is at 7.5, for example. However, there are significant amounts of real data that do not follow a normal distribution and may be asymmetric. The asymmetry of data is often referred to as a skew.
Skewness
A distribution is said to be skewed if it is not symmetric in nature, and skewness measures the asymmetry of a variable about its mean. The value can be positive or negative (or undefined). In the former case, the tail is on the right-hand side of the distribution, while the latter indicates that the tail is on the left-hand side.
However, it must be noted that a thick and short tail would have the same effect on the value of skewness as a long, thin tail.
Kurtosis
Kurtosis is a measure of the tailedness of the distribution of a variable and is used to measure the presence of outliers in one tail versus the other. A high value of kurtosis indicates a fatter tail and the presence of outliers. In a similar way to the concept of skewness, kurtosis also describes the shape of the distribution.
Exercise 2.10: Plotting a Histogram
Let's plot the histogram for the eq_primary feature using the Seaborn library. This exercise is a continuation of Exercise 2.09, Plotting a Pie Chart:
- Use
plt.figure() to initiate the plotting:plt.figure(figsize=(10,7))
sns.distplot() is the primary command that we will use to plot the histogram. The first parameter is the one-dimensional data over which to plot the histogram, while the bins parameter defines the number and size of the bins. Use this as follows:sns.distplot(data.eq_primary.dropna(), \
bins=np.linspace(0,10,21))
- Display the plot using
plt.show():plt.show()
The output will be as follows:
Figure 2.26: Histogram for the example primary feature
The plot gives us a normed (or normalized) histogram, which means that the area under the bars of the histogram equals unity. Additionally, the line over the histogram is the kernel density estimate, which gives us an idea of what the probability distribution for the variable would look like.
Note
To access the source code for this specific section, please refer to https://packt.live/2BwZrdj.
You can also run this example online at https://packt.live/3fMSxj2. You must execute the entire Notebook in order to get the desired result.
From the plot, we can see that the values of eq_primary lie mostly between 5 and 8, which means that most earthquakes had a magnitude with a moderate to high value, with barely any earthquakes having a low or very high magnitude.
Exercise 2.11: Computing Skew and Kurtosis
Let's calculate the skew and kurtosis values for all of the features in the dataset using the core pandas functions available to us. This exercise is a continuation of Exercise 2.10, Plotting a Histogram:
- Use the
.skew() DataFrame method to calculate the skew for all features and then sort the values in ascending order:data.skew().sort_values()
The output will be as follows:
Figure 2.27: Skew values for all the features in the dataset
- Use the
.kurt() DataFrame method to calculate the kurtosis for all features:data.kurt()
The output will be as follows:
Figure 2.28: Kurtosis values for all the features in the dataset
Here, we can see that the kurtosis values for some variables deviate significantly from 0. This means that these columns have a long tail. But the values that are at the tail end of these variables (which indicate the number of people dead, injured, and the monetary value of damage), in our case, may be outliers that we may need to pay special attention to. Larger values might, in fact, indicate an additional force that added to the devastation caused by an earthquake, that is, a tsunami.
Note
To access the source code for this specific section, please refer to https://packt.live/2Yklmh0.
You can also run this example online at https://packt.live/37PcMdj. You must execute the entire Notebook in order to get the desired result.
Activity 2.02: Representing the Distribution of Values Visually
In this activity, we will implement what we learned in the previous section by creating different plots such as histograms and pie charts. Furthermore, we will calculate the skew and kurtosis for the features of the dataset. Here, will use the same dataset we used in Activity 2.01: Summary Statistics and Missing Values, that is, House Prices: Advanced Regression Techniques. We'll use different types of plots to visually represent the distribution of values for this dataset. This activity is a continuation of Activity 2.01: Summary Statistics and Missing Values:
The steps to be performed are as follows:
- Plot a histogram using Matplotlib for the target variable,
SalePrice.The output will be as follows:
Figure 2.29: Histogram for the target variable
- Find the number of unique values within each column having an object type.
- Create a DataFrame representing the number of occurrences for each categorical value in the
HouseStyle column.
- Plot a pie chart representing these counts.
The output will be as follows:
Figure 2.30: Pie chart representing the counts
- Find the number of unique values within each column having a number type.
- Plot a histogram using seaborn for the
LotArea variable.The output will be as follows:
Figure 2.31: Histogram for the LotArea variable
- Calculate the skew and kurtosis values for the values in each column.
The output for skew values will be:
Figure 2.32: Skew values for each column
The output for kurtosis values will be:
Figure 2.33: Kurtosis values for each column
Note
The solution for this activity can be found via this link.
We have seen how to look into the nature of data in more detail, in particular, by beginning to understand the distribution of the data using histograms or density plots, relative counts of data using pie charts, as well as inspecting the skew and kurtosis of the variables as a first step to finding potentially problematic data, outliers, and so on.
By now, you should have a comfort level handling various statistical measures of data such as summary statistics, counts, and the distribution of values. Using tools such as histograms and density plots, you can explore the shape of datasets, and augment that understanding by calculating statistics such as skew and kurtosis. You should be developing some intuition for some flags that warrant further investigation, such as large skew or kurtosis values.