*Chapter 1*: Introduction to Data Analysis

Before we can begin our hands-on introduction to data analysis with `pandas`

, we need to learn about the fundamentals of data analysis. Those who have ever looked at the documentation for a software library know how overwhelming it can be if you have no clue what you are looking for. Therefore, it is essential that we master not only the coding aspect but also the thought process and workflow required to analyze data, which will prove the most useful in augmenting our skill set in the future.

Much like the scientific method, data science has some common workflows that we can follow when we want to conduct an analysis and present the results. The backbone of this process is **statistics**, which gives us ways to describe our data, make predictions, and also draw conclusions about it. Since prior knowledge of statistics is not a prerequisite, this chapter will give us exposure to the statistical concepts we will use throughout this book, as well as areas for further exploration.

After covering the fundamentals, we will get our Python environment set up for the remainder of this book. Python is a powerful language, and its uses go way beyond data science: building web applications, software, and web scraping, to name a few. In order to work effectively across projects, we need to learn how to make **virtual environments**, which will isolate each project's dependencies. Finally, we will learn how to work with Jupyter Notebooks in order to follow along with the text.

The following topics will be covered in this chapter:

- The fundamentals of data analysis
- Statistical foundations
- Setting up a virtual environment

# Chapter materials

All the files for this book are on GitHub at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition. While having a GitHub account isn't necessary to work through this book, it is a good idea to create one, as it will serve as a portfolio for any data/coding projects. In addition, working with Git will provide a version control system and make collaboration easy.

Tip

Check out this article to learn some Git basics: https://www.freecodecamp.org/news/learn-the-basics-of-git-in-under-10-minutes-da548267cc91/.

In order to get a local copy of the files, we have a few options (ordered from least useful to most useful):

- Download the ZIP file and extract the files locally.
- Clone the repository without forking it.
- Fork the repository and then clone it.

This book includes exercises for every chapter; therefore, for those who want to keep a copy of their solutions along with the original content on GitHub, it is highly recommended to **fork** the repository and **clone** the forked version. When we fork a repository, GitHub will make a repository under our own profile with the latest version of the original. Then, whenever we make changes to our version, we can push the changes back up. Note that if we simply clone, we don't get this benefit.

The relevant buttons for initiating this process are circled in the following screenshot:

Important note

The cloning process will copy the files to the current working directory in a folder called `Hands-On-Data-Analysis-with-Pandas-2nd-edition`

. To make a folder to put this repository in, we can use `mkdir my_folder && cd my_folder`

. This will create a new folder (directory) called `my_folder`

and then change the current directory to that folder, after which we can clone the repository. We can chain these two commands (and any number of commands) together by adding `&&`

in between them. This can be thought of as *and then* (provided the first command succeeds).

This repository has folders for each chapter. This chapter's materials can be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_01. While the bulk of this chapter doesn't involve any coding, feel free to follow along in the `introduction_to_data_analysis.ipynb`

notebook on the GitHub website until we set up our environment toward the end of the chapter. After we do so, we will use the `check_your_environment.ipynb`

notebook to get familiar with Jupyter Notebooks and to run some checks to make sure that everything is set up properly for the rest of this book.

Since the code that's used to generate the content in these notebooks is not the main focus of this chapter, the majority of it has been separated into the `visual_aids`

package, which is used to create visuals for explaining concepts throughout the book, and the `check_environment.py`

file. If you choose to inspect these files, don't be overwhelmed; everything that's relevant to data science will be covered in this book.

Every chapter includes exercises; however, for this chapter only, there is an `exercises.ipynb`

notebook, with code to generate some initial data. Knowledge of basic Python will be necessary to complete these exercises. For those who would like to review the basics, make sure to run through the `python_101.ipynb`

notebook, included in the materials for this chapter, for a crash course. The official Python tutorial is a good place to start for a more formal introduction: https://docs.python.org/3/tutorial/index.html.

# The fundamentals of data analysis

Data analysis is a highly iterative process involving collection, preparation (wrangling), **exploratory data analysis** (**EDA**), and drawing conclusions. During an analysis, we will frequently revisit each of these steps. The following diagram depicts a generalized workflow:

Over the next few sections, we will get an overview of each of these steps, starting with data collection. In practice, this process is heavily skewed toward the data preparation side. Surveys have found that although data scientists enjoy the data preparation side of their job the least, it makes up 80% of their work (https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/). This data preparation step is where `pandas`

really shines.

## Data collection

Data collection is the natural first step for any data analysis—we can't analyze data we don't have. In reality, our analysis can begin even before we have the data. When we decide what we want to investigate or analyze, we have to think about what kind of data we can collect that will be useful for our analysis. While data can come from anywhere, we will explore the following sources throughout this book:

- Web scraping to extract data from a website's HTML (often with Python packages such as
`selenium`

,`requests`

,`scrapy`

, and`beautifulsoup`

) **Application programming interfaces**(**APIs**) for web services from which we can collect data with HTTP requests (perhaps using`cURL`

or the`requests`

Python package)- Databases (data can be extracted with SQL or another database-querying language)
- Internet resources that provide data for download, such as government websites or Yahoo! Finance
- Log files
Important note

*Chapter 2*,*Working with Pandas DataFrames*, will give us the skills we need to work with the aforementioned data sources.*Chapter 12*,*The Road Ahead*, provides numerous resources for finding data sources.

We are surrounded by data, so the possibilities are limitless. It is important, however, to make sure that we are collecting data that will help us draw conclusions. For example, if we are trying to determine whether hot chocolate sales are higher when the temperature is lower, we should collect data on the amount of hot chocolate sold and the temperatures each day. While it might be interesting to see how far people traveled to get the hot chocolate, it's not relevant to our analysis.

Don't worry too much about finding the perfect data before beginning an analysis. Odds are, there will always be something we want to add/remove from the initial dataset, reformat, merge with other data, or change in some way. This is where data wrangling comes into play.

## Data wrangling

**Data wrangling** is the process of preparing the data and getting it into a format that can be used for analysis. The unfortunate reality of data is that it is often dirty, meaning that it requires cleaning (preparation) before it can be used. The following are some issues we may encounter with our data:

**Human errors**: Data is recorded (or even collected) incorrectly, such as putting`100`

instead of`1000`

, or typos. In addition, there may be multiple versions of the same entry recorded, such as`New York City`

,`NYC`

, and`nyc`

.**Computer error**: Perhaps we weren't recording entries for a while (missing data).**Unexpected values**: Maybe whoever was recording the data decided to use a question mark for a missing value in a numeric column, so now all the entries in the column will be treated as text instead of numeric values.**Incomplete information**: Think of a survey with optional questions; not everyone will answer them, so we will have missing data, but not due to computer or human error.**Resolution**: The data may have been collected per second, while we need hourly data for our analysis.**Relevance of the fields**: Often, data is collected or generated as a product of some process rather than explicitly for our analysis. In order to get it to a usable state, we will have to clean it up.**Format of the data**: Data may be recorded in a format that isn't conducive to analysis, which will require us to reshape it.**Misconfigurations in the data-recording process**: Data coming from sources such as misconfigured trackers and/or webhooks may be missing fields or passed in the wrong order.

Most of these data quality issues can be remedied, but some cannot, such as when the data is collected daily and we need it on an hourly resolution. It is our responsibility to carefully examine our data and handle any issues so that our analysis doesn't get distorted. We will cover this process in depth in *Chapter 3*, *Data Wrangling with Pandas*, and *Chapter 4*, *Aggregating Pandas DataFrames*.

Once we have performed an initial cleaning of the data, we are ready for EDA. Note that during EDA, we may need some additional data wrangling: these two steps are highly intertwined.

## Exploratory data analysis

During EDA, we use visualizations and summary statistics to get a better understanding of the data. Since the human brain excels at picking out visual patterns, data visualization is essential to any analysis. In fact, some characteristics of the data can only be observed in a plot. Depending on our data, we may create plots to see how a variable of interest has evolved over time, compare how many observations belong to each category, find outliers, look at distributions of continuous and discrete variables, and much more. In *Chapter 5*, *Visualizing Data with Pandas and Matplotlib*, and *Chapter 6*, *Plotting with Seaborn and Customization Techniques*, we will learn how to create these plots for both EDA and presentation.

Important note

Data visualizations are very powerful; unfortunately, they can often be misleading. One common issue stems from the scale of the *y*-axis because most plotting tools will zoom in by default to show the pattern up close. It would be difficult for software to know what the appropriate axis limits are for every possible plot; therefore, it is our job to properly adjust the axes before presenting our results. You can read about some more ways that plots can be misleading at https://venngage.com/blog/misleading-graphs/.

In the workflow diagram we saw earlier (*Figure 1.2*), EDA and data wrangling shared a box. This is because they are closely tied:

- Data needs to be prepped before EDA.
- Visualizations that are created during EDA may indicate the need for additional data cleaning.
- Data wrangling uses summary statistics to look for potential data issues, while EDA uses them to understand the data. Improper cleaning will distort the findings when we're conducting EDA. In addition, data wrangling skills will be required to get summary statistics across subsets of the data.

When calculating summary statistics, we must keep the type of data we collected in mind. Data can be **quantitative** (measurable quantities) or **categorical** (descriptions, groupings, or categories). Within these classes of data, we have further subdivisions that let us know what types of operations we can perform on them.

For example, categorical data can be **nominal**, where we assign a numeric value to each level of the category, such as `on = 1`

/`off = 0`

. Note that the fact that `on`

is greater than `off`

is meaningless because we arbitrarily chose those numbers to represent the states `on`

and `off`

. When there is a ranking among the categories, they are **ordinal**, meaning that we can order the levels (for instance, we can have `low < medium < high`

).

Quantitative data can use an **interval scale** or a **ratio scale**. The interval scale includes things such as temperature. We can measure temperatures in Celsius and compare the temperatures of two cities, but it doesn't mean anything to say one city is twice as hot as the other. Therefore, interval scale values can be meaningfully compared using addition/subtraction, but not multiplication/division. The ratio scale, then, are those values that can be meaningfully compared with ratios (using multiplication and division). Examples of the ratio scale include prices, sizes, and counts.

When we complete our EDA, we can decide on the next steps by drawing conclusions.

## Drawing conclusions

After we have collected the data for our analysis, cleaned it up, and performed some thorough EDA, it is time to draw conclusions. This is where we summarize our findings from EDA and decide the next steps:

- Did we notice any patterns or relationships when visualizing the data?
- Does it look like we can make accurate predictions from our data? Does it make sense to move to modeling the data?
- Should we handle missing data points? How?
- How is the data distributed?
- Does the data help us answer the questions we have or give insight into the problem we are investigating?
- Do we need to collect new or additional data?

If we decide to model the data, this falls under machine learning and statistics. While not technically data analysis, it is usually the next step, and we will cover it in *Chapter 9*, *Getting Started with Machine Learning in Python*, and *Chapter 10*, *Making Better Predictions – Optimizing Models*. In addition, we will see how this entire process will work in practice in *Chapter 11*, *Machine Learning Anomaly Detection*. As a reference, in the *Machine learning workflow* section in the *Appendix*, there is a workflow diagram depicting the full process from data analysis to machine learning. *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*, and *Chapter 8*, *Rule-Based Anomaly Detection*, will focus on drawing conclusions from data analysis, rather than building models.

The next section will be a review of statistics; those with knowledge of statistics can skip ahead to the *Setting up a virtual environment* section.

# Statistical foundations

When we want to make observations about the data we are analyzing, we often, if not always, turn to statistics in some fashion. The data we have is referred to as the **sample**, which was observed from (and is a subset of) the **population**. Two broad categories of statistics are descriptive and inferential statistics. With **descriptive statistics**, as the name implies, we are looking to *describe* the sample. **Inferential statistics** involves using the sample statistics to *infer*, or deduce, something about the population, such as the underlying distribution.

Important note

Sample statistics are used as **estimators** of the population parameters, meaning that we have to quantify their bias and variance. There is a multitude of methods for this; some will make assumptions on the shape of the distribution (parametric) and others won't (non-parametric). This is all well beyond the scope of this book, but it is good to be aware of.

Often, the goal of an analysis is to create a story for the data; unfortunately, it is very easy to misuse statistics. It's the subject of a famous quote:

This is especially true of inferential statistics, which is used in many scientific studies and papers to show the significance of the researchers' findings. This is a more advanced topic and, since this isn't a statistics book, we will only briefly touch upon some of the tools and principles behind inferential statistics, which can be pursued further. We will focus on descriptive statistics to help explain the data we are analyzing.

## Sampling

There's an important thing to remember before we attempt any analysis: our sample must be a **random sample** that is representative of the population. This means that the data must be sampled without bias (for example, if we are asking people whether they like a certain sports team, we can't only ask fans of the team) and that we should have (ideally) members of all distinct groups from the population in our sample (in the sports team example, we can't just ask men).

When we discuss machine learning in *Chapter 9*, *Getting Started with Machine Learning in Python*, we will need to sample our data, which will be a sample to begin with. This is called **resampling**. Depending on the data, we will have to pick a different method of sampling. Often, our best bet is a **simple random sample**: we use a random number generator to pick rows at random. When we have distinct groups in the data, we want our sample to be a **stratified random sample**, which will preserve the proportion of the groups in the data. In some cases, we don't have enough data for the aforementioned sampling strategies, so we may turn to random sampling with replacement (**bootstrapping**); this is called a **bootstrap sample**. Note that our underlying sample needs to have been a random sample or we risk increasing the bias of the estimator (we could pick certain rows more often because they are in the data more often if it was a convenience sample, while in the true population these rows aren't as prevalent). We will see an example of bootstrapping in *Chapter 8*, *Rule-Based Anomaly Detection*.

Important note

A thorough discussion of the theory behind bootstrapping and its consequences is well beyond the scope of this book, but watch this video for a primer: https://www.youtube.com/watch?v=gcPIyeqymOU.

You can read more about sampling methods, along with their strengths and weaknesses, at https://www.khanacademy.org/math/statistics-probability/designing-studies/sampling-methods-stats/a/sampling-methods-review.

## Descriptive statistics

We will begin our discussion of descriptive statistics with **univariate statistics**; univariate simply means that these statistics are calculated from one (**uni**) variable. Everything in this section can be extended to the whole dataset, but the statistics will be calculated per variable we are recording (meaning that if we had 100 observations of speed and distance pairs, we could calculate the averages across the dataset, which would give us the average speed and average distance statistics).

Descriptive statistics are used to describe and/or summarize the data we are working with. We can start our summarization of the data with a measure of **central tendency**, which describes where most of the data is centered around, and a measure of **spread** or **dispersion**, which indicates how far apart values are.

### Measures of central tendency

Measures of central tendency describe the center of our distribution of data. There are three common statistics that are used as measures of center: mean, median, and mode. Each has its own strengths, depending on the data we are working with.

#### Mean

Perhaps the most common statistic for summarizing data is the average, or **mean**. The population mean is denoted by *μ* (the Greek letter *mu*), and the sample mean is written as (pronounced *X-bar*). The sample mean is calculated by summing all the values and dividing by the count of values; for example, the mean of the numbers 0, 1, 1, 2, and 9 is 2.6 (`(0 + 1 + 1 + 2 + 9)/5`

):

We use *x*i to represent the *i*th* *observation of the variable *X*. Note how the variable as a whole is represented with a capital letter, while the specific observation is lowercase. *Σ* (the Greek capital letter *sigma*) is used to represent a summation, which, in the equation for the mean, goes from *1* to *n*, which is the number of observations.

One important thing to note about the mean is that it is very sensitive to **outliers** (values created by a different generative process than our distribution). In the previous example, we were dealing with only five values; nevertheless, the 9 is much larger than the other numbers and pulled the mean higher than all but the 9. In cases where we suspect outliers to be present in our data, we may want to instead use the median as our measure of central tendency.

#### Median

Unlike the mean, the **median** is robust to outliers. Consider income in the US; the top 1% is much higher than the rest of the population, so this will skew the mean to be higher and distort the perception of the average person's income. However, the median will be more representative of the average income because it is the 50th percentile of our data; this means that 50% of the values are greater than the median and 50% are less than the median.

Tip

The *i*th percentile is the value at which *i*% of the observations are less than that value, so the 99th percentile is the value in *X* where 99% of the *x*'s are less than it.

The median is calculated by taking the middle value from an ordered list of values; in cases where we have an even number of values, we take the mean of the middle two values. If we take the numbers 0, 1, 1, 2, and 9 again, our median is 1. Notice that the mean and median for this dataset are different; however, depending on the distribution of the data, they may be the same.

#### Mode

The **mode** is the most common value in the data (if we, once again, have the numbers 0, 1, 1, 2, and 9, then 1 is the mode). In practice, we will often hear things such as *the distribution is bimodal or multimodal* (as opposed to unimodal) in cases where the distribution has two or more most popular values. This doesn't necessarily mean that each of them occurred the same amount of times, but rather, they are more common than the other values by a significant amount. As shown in the following plots, a unimodal distribution has only one mode (at **0**), a bimodal distribution has two (at **-2** and **3**), and a multimodal distribution has many (at **-2**, **0.4**, and **3**):

Understanding the concept of the mode comes in handy when describing continuous distributions; however, most of the time when we're describing our continuous data, we will use either the mean or the median as our measure of central tendency. When working with categorical data, on the other hand, we will typically use the mode.

### Measures of spread

Knowing where the center of the distribution is only gets us partially to being able to summarize the distribution of our data—we need to know how values fall around the center and how far apart they are. Measures of spread tell us how the data is dispersed; this will indicate how thin (low dispersion) or wide (very spread out) our distribution is. As with measures of central tendency, we have several ways to describe the spread of a distribution, and which one we choose will depend on the situation and the data.

#### Range

The **range** is the distance between the smallest value (**minimum**) and the largest value (**maximum**). The units of the range will be the same units as our data. Therefore, unless two distributions of data are in the same units and measuring the same thing, we can't compare their ranges and say one is more dispersed than the other:

Just from the definition of the range, we can see why it wouldn't always be the best way to measure the spread of our data. It gives us upper and lower bounds on what we have in the data; however, if we have any outliers in our data, the range will be rendered useless.

Another problem with the range is that it doesn't tell us how the data is dispersed around its center; it really only tells us how dispersed the entire dataset is. This brings us to the variance.

#### Variance

The **variance** describes how far apart observations are spread out from their average value (the mean). The population variance is denoted as *σ*2 (pronounced *sigma-squared*), and the sample variance is written as *s*2. It is calculated as the average squared distance from the mean. Note that the distances must be squared so that distances below the mean don't cancel out those above the mean.

If we want the sample variance to be an unbiased estimator of the population variance, we divide by *n - 1* instead of *n* to account for using the sample mean instead of the population mean; this is called Bessel's correction (https://en.wikipedia.org/wiki/Bessel%27s_correction). Most statistical tools will give us the sample variance by default, since it is *very* rare that we would have data for the entire population:

The variance gives us a statistic with *squared* units. This means that if we started with data on income in dollars ($), then our variance would be in dollars squared ($2). This isn't really useful when we're trying to see how this describes the data; we can use the **magnitude** (size) itself to see how spread out something is (large values = large spread), but beyond that, we need a measure of spread with units that are the same as our data. For this purpose, we use the standard deviation.

#### Standard deviation

We can use the **standard deviation** to see how far from the mean data points are *on average*. A small standard deviation means that values are close to the mean, while a large standard deviation means that values are dispersed more widely. This is tied to how we would imagine the distribution curve: the smaller the standard deviation, the thinner the peak of the curve (**0.5**); the larger the standard deviation, the wider the peak of the curve (**2**):

The standard deviation is simply the square root of the variance. By performing this operation, we get a statistic in units that we can make sense of again ($ for our income example):

Note that the population standard deviation is represented as *σ*, and the sample standard deviation is denoted as *s*.

#### Coefficient of variation

When we moved from variance to standard deviation, we were looking to get to units that made sense; however, if we then want to compare the level of dispersion of one dataset to another, we would need to have the same units once again. One way around this is to calculate the **coefficient of variation** (**CV**), which is unitless. The CV is the ratio of the standard deviation to the mean:

We will use this metric in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*; since the CV is unitless, we can use it to compare the volatility of different assets.

#### Interquartile range

So far, other than the range, we have discussed mean-based measures of dispersion; now, we will look at how we can describe the spread with the median as our measure of central tendency. As mentioned earlier, the median is the 50th percentile or the 2nd **quartile** (Q2). Percentiles and quartiles are both **quantiles**—values that divide data into equal groups each containing the same percentage of the total data. Percentiles divide the data into 100 parts, while quartiles do so into four (25%, 50%, 75%, and 100%).

Since quantiles neatly divide up our data, and we know how much of the data goes in each section, they are a perfect candidate for helping us quantify the spread of our data. One common measure for this is the **interquartile range** (**IQR**), which is the distance between the 3rd and 1st quartiles:

The IQR gives us the spread of data around the median *and* quantifies how much dispersion we have in the middle 50% of our distribution. It can also be useful when checking the data for outliers, which we will cover in *Chapter 8*, *Rule-Based Anomaly Detection*. In addition, the IQR can be used to calculate a unitless measure of dispersion, which we will discuss next.

#### Quartile coefficient of dispersion

Just like we had the coefficient of variation when using the mean as our measure of central tendency, we have the **quartile coefficient of dispersion** when using the median as our measure of center. This statistic is also unitless, so it can be used to compare datasets. It is calculated by dividing the **semi-quartile range** (half the IQR) by the **midhinge** (midpoint between the first and third quartiles):

We will see this metric again in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*, when we assess stock volatility. For now, let's take a look at how we can use measures of central tendency and dispersion to summarize our data.

### Summarizing data

We have seen many examples of descriptive statistics that we can use to summarize our data by its center and dispersion; in practice, looking at the **5-number summary** and visualizing the distribution prove to be helpful first steps before diving into some of the other aforementioned metrics. The 5-number summary, as its name indicates, provides five descriptive statistics that summarize our data:

A **box plot** (or box and whisker plot) is a visual representation of the 5-number summary. The median is denoted by a thick line in the box. The top of the box is Q3 and the bottom of the box is Q1. Lines (whiskers) extend from both sides of the box boundaries toward the minimum and maximum. Based on the convention our plotting tool uses, though, they may only extend to a certain statistic; any values beyond these statistics are marked as outliers (using points). For this book in general, the lower bound of the whiskers will be **Q**1** – 1.5 * IQR** and the upper bound will be **Q**3** + 1.5 * IQR**, which is called the **Tukey box plot**:

While the box plot is a great tool for getting an initial understanding of the distribution, we don't get to see how things are distributed inside each of the quartiles. For this purpose, we turn to **histograms** for **discrete** variables (for instance, the number of people or books) and **kernel density estimates** (**KDEs**) for **continuous** variables (for instance, heights or time). There is nothing stopping us from using KDEs on discrete variables, but it is easy to confuse people that way. Histograms work for both discrete and continuous variables; however, in both cases, we must keep in mind that the number of bins we choose to divide the data into can easily change the shape of the distribution we see.

To make a histogram, a certain number of equal-width bins are created, and then bars with heights for the number of values we have in each bin are added. The following plot is a histogram with 10 bins, showing the three measures of central tendency for the same data that was used to generate the box plot in *Figure 1.6*:

Important note

In practice, we need to play around with the number of bins to find the best value. However, we have to be careful as this can misrepresent the shape of the distribution.

KDEs are similar to histograms, except rather than creating bins for the data, they draw a smoothed curve, which is an estimate of the distribution's **probability density function** (**PDF**). The PDF is for continuous variables and tells us how probability is distributed over the values. Higher values for the PDF indicate higher likelihoods:

When the distribution starts to get a little lopsided with long tails on one side, the mean measure of center can easily get pulled to that side. Distributions that aren't symmetric have some **skew** to them. A **left (negative) skewed distribution** has a long tail on the left-hand side; a **right (positive) skewed distribution** has a long tail on the right-hand side. In the presence of negative skew, the mean will be less than the median, while the reverse happens with a positive skew. When there is no skew, both will be equal:

Important note

There is also another statistic called **kurtosis**, which compares the density of the center of the distribution with the density at the tails. Both skewness and kurtosis can be calculated with the SciPy package.

Each column in our data is a **random variable**, because every time we observe it, we get a value according to the underlying distribution—it's not static. When we are interested in the probability of getting a value of *x* or less, we use the **cumulative distribution function** (**CDF**), which is the integral (area under the curve) of the PDF:

The probability of the random variable *X* being less than or equal to the specific value of *x* is denoted as *P(X ≤ x)*. With a continuous variable, the probability of getting exactly *x* is 0. This is because the probability will be the integral of the PDF from *x* to *x* (area under a curve with zero width), which is 0:

In order to visualize this, we can find an estimate of the CDF from the sample, called the **empirical cumulative distribution function** (**ECDF**). Since this is cumulative, at the point where the value on the *x*-axis is equal to *x*, the *y* value is the cumulative probability of *P(X ≤ x)*. Let's visualize **P(X ≤ 50)**, **P(X = 50)**, and **P(X > 50)** as an example:

In addition to examining the distribution of our data, we may find the need to utilize probability distributions for uses such as simulation (discussed in *Chapter 8*, *Rule-Based Anomaly Detection*) or hypothesis testing (see the *Inferential statistics* section); let's take a look at a few distributions that we are likely to come across.

### Common distributions

While there are many probability distributions, each with specific use cases, there are some that we will come across often. The **Gaussian**, or **normal**, looks like a bell curve and is parameterized by its mean (*μ*) and standard deviation (*σ*). The **standard normal** (*Z*) has a mean of 0 and a standard deviation of 1. Many things in nature happen to follow the normal distribution, such as heights. Note that testing whether a distribution is normal is not trivial—check the *Further reading* section for more information.

The **Poisson distribution** is a discrete distribution that is often used to model arrivals. The time between arrivals can be modeled with the **exponential distribution**. Both are defined by their mean, lambda (*λ*). The **uniform distribution** places equal likelihood on each value within its bounds. We often use this for random number generation. When we generate a random number to simulate a single success/failure outcome, it is called a **Bernoulli trial**. This is parameterized by the probability of success (*p*). When we run the same experiment multiple times (*n*), the total number of successes is then a **binomial** random variable. Both the Bernoulli and binomial distributions are discrete.

We can visualize both discrete and continuous distributions; however, discrete distributions give us a **probability mass function** (**PMF**) instead of a PDF:

We will use some of these distributions in *Chapter 8*, *Rule-Based Anomaly Detection*, when we simulate some login attempt data for anomaly detection.

### Scaling data

In order to compare variables from different distributions, we would have to **scale** the data, which we could do with the range by using **min-max scaling**. We take *each* data point, subtract the minimum of the dataset, then divide by the range. This **normalizes** our data (scales it to the range [0, 1]):

This isn't the only way to scale data; we can also use the mean and standard deviation. In this case, we would subtract the mean from each observation and then divide by the standard deviation to **standardize** the data. This gives us what is known as a **Z-score**:

We are left with a normalized distribution with a mean of 0 and a standard deviation (and variance) of 1. The Z-score tells us how many standard deviations from the mean each observation is; the mean has a Z-score of 0, while an observation of 0.5 standard deviations below the mean will have a Z-score of -0.5.

There are, of course, additional ways to scale our data, and the one we end up choosing will be dependent on our data and what we are trying to do with it. By keeping the measures of central tendency and measures of dispersion in mind, you will be able to identify how the scaling of data is being done in any other methods you come across.

### Quantifying relationships between variables

In the previous sections, we were dealing with univariate statistics and were only able to say something about the variable we were looking at. With multivariate statistics, we seek to quantify relationships between variables and attempt to make predictions for future behavior.

The **covariance** is a statistic for quantifying the relationship between variables by showing how one variable changes with respect to another (also referred to as their joint variance):

Important note

*E[X]* is a new notation for us. It is read as *the expected value of X* or *the expectation of X*, and it is calculated by summing all the possible values of *X* multiplied by their probability—it's the long-run average of *X*.

The magnitude of the covariance isn't easy to interpret, but its sign tells us whether the variables are positively or negatively correlated. However, we would also like to quantify how *strong* the relationship is between the variables, which brings us to correlation. **Correlation** tells us how variables change together both in direction (same or opposite) and magnitude (strength of the relationship). To find the correlation, we calculate the **Pearson correlation coefficient**, symbolized by *ρ* (the Greek letter *rho*), by dividing the covariance by the product of the standard deviations of the variables:

This normalizes the covariance and results in a statistic bounded between -1 and 1, making it easy to describe both the direction of the correlation (sign) and the strength of it (magnitude). Correlations of 1 are said to be perfect positive (linear) correlations, while those of -1 are perfect negative correlations. Values near 0 aren't correlated. If correlation coefficients are near 1 in absolute value, then the variables are said to be strongly correlated; those closer to 0.5 are said to be weakly correlated.

Let's look at some examples using scatter plots. In the leftmost subplot of *Figure 1.12* (**ρ = 0.11**), we see that there is no correlation between the variables: they appear to be random noise with no pattern. The next plot with **ρ = -0.52** has a weak negative correlation: we can see that the variables appear to move together with the *x* variable increasing, while the *y* variable decreases, but there is still a bit of randomness. In the third plot from the left (**ρ = 0.87**), there is a strong positive correlation: *x* and *y* are increasing together. The rightmost plot with **ρ = -0.99** has a near-perfect negative correlation: as *x* increases, *y* decreases. We can also see how the points form a line:

To quickly eyeball the strength and direction of the relationship between two variables (and see whether there even seems to be one), we will often use scatter plots rather than calculating the exact correlation coefficient. This is for a couple of reasons:

- It's easier to find patterns in visualizations, but it's more work to arrive at the same conclusion by looking at numbers and tables.
- We might see that the variables seem related, but they may not be
*linearly*related. Looking at a visual representation will make it easy to see if our data is actually quadratic, exponential, logarithmic, or some other non-linear function.

Both of the following plots depict data with strong positive correlations, but it's pretty obvious when looking at the scatter plots that these are not linear. The one on the left is logarithmic, while the one on the right is exponential:

It's very important to remember that while we may find a correlation between *X* and *Y*, it doesn't mean that *X causes Y* or that *Y causes X*. There could be some *Z* that actually causes both; perhaps *X* causes some intermediary event that causes *Y*, or it is actually just a coincidence. Keep in mind that we often don't have enough information to report causation—*correlation does not imply causation*.

Tip

Be sure to check out Tyler Vigen's *Spurious Correlations* blog (https://www.tylervigen.com/spurious-correlations) for some interesting correlations.

### Pitfalls of summary statistics

There is a very interesting dataset illustrating how careful we must be when only using summary statistics and correlation coefficients to describe our data. It also shows us that plotting is not optional. **Anscombe's quartet** is a collection of four different datasets that have identical summary statistics and correlation coefficients, but when plotted, it is obvious they are not similar:

Notice that each of the plots in *Figure 1.14* has an identical best-fit line defined by the equation **y = 0.50x + 3.00**. In the next section, we will discuss, at a high level, how this line is created and what it means.

Important note

Summary statistics are very helpful when we're getting to know the data, but be wary of relying exclusively on them. Remember, statistics can be misleading; be sure to also plot the data before drawing any conclusions or proceeding with the analysis. You can read more about Anscombe's quartet at https://en.wikipedia.org/wiki/Anscombe%27s_quartet. Also, be sure to check out the **Datasaurus Dozen**, which are 13 datasets that also have the same summary statistics, at https://www.autodeskresearch.com/publications/samestats.

## Prediction and forecasting

Say our favorite ice cream shop has asked us to help predict how many ice creams they can expect to sell on a given day. They are convinced that the temperature outside has a strong influence on their sales, so they have collected data on the number of ice creams sold at a given temperature. We agree to help them, and the first thing we do is make a scatter plot of the data they collected:

We can observe an upward trend in the scatter plot: more ice creams are sold at higher temperatures. In order to help out the ice cream shop, though, we need to find a way to make predictions from this data. We can use a technique called **regression** to model the relationship between temperature and ice cream sales with an equation. Using this equation, we will be able to **predict** ice cream sales at a given temperature.

Important note

Remember that correlation does not imply causation. People may buy ice cream when it is warmer, but warmer temperatures don't necessarily cause people to buy ice cream.

In *Chapter 9*, *Getting Started with Machine Learning in Python*, we will go over regression in depth, so this discussion will be a high-level overview. There are many types of regression that will yield a different type of equation, such as linear (which we will use for this example) and logistic. Our first step will be to identify the **dependent variable**, which is the quantity we want to predict (ice cream sales), and the variables we will use to predict it, which are called **independent variables**. While we can have many independent variables, our ice cream sales example only has one: temperature. Therefore, we will use simple linear regression to model the relationship as a line:

The regression line in the previous scatter plot yields the following equation for the relationship:

Suppose that today the temperature is 35°C—we would plug that in for *temperature* in the equation. The result predicts that the ice cream shop will sell 24.54 ice creams. This prediction is along the red line in the previous plot. Note that the ice cream shop can't actually sell fractions of ice cream.

Before leaving the model in the hands of the ice cream shop, it's important to discuss the difference between the dotted and solid portions of the regression line that we obtained. When we make predictions using the solid portion of the line, we are using **interpolation**, meaning that we will be predicting ice cream sales for temperatures the regression was created on. On the other hand, if we try to predict how many ice creams will be sold at 45°C, it is called **extrapolation** (the dotted portion of the line), since we didn't have any temperatures this high when we ran the regression. Extrapolation can be very dangerous as many trends don't continue indefinitely. People may decide not to leave their houses because it is so hot. This means that instead of selling the predicted 39.54 ice creams, they would sell zero.

When working with time series, our terminology is a little different: we often look to **forecast** future values based on past values. Forecasting is a type of prediction for time series. Before we try to model the time series, however, we will often use a process called **time series decomposition** to split the time series into components, which can be combined in an additive or multiplicative fashion and may be used as parts of a model.

The **trend** component describes the behavior of the time series in the **long term** without accounting for seasonal or cyclical effects. Using the trend, we can make broad statements about the time series in the long run, such as *the population of Earth is increasing* or *the value of a stock is stagnating*. The **seasonality** component explains the systematic and calendar-related movements of a time series. For example, the number of ice cream trucks on the streets of New York City is high in the summer and drops to nothing in the winter; this pattern repeats every year, regardless of whether the actual amount each summer is the same. Lastly, the **cyclical** component accounts for anything else unexplained or irregular with the time series; this could be something such as a hurricane driving the number of ice cream trucks down in the **short term** because it isn't safe to be outside. This component is difficult to anticipate with a forecast due to its unexpected nature.

We can use Python to **decompose** the time series into trend, seasonality, and **noise** or **residuals**. The cyclical component is captured in the noise (random, unpredictable data); after we remove the trend and seasonality from the time series, what we are left with is the residual:

When building models to forecast time series, some common methods include exponential smoothing and ARIMA-family models. **ARIMA** stands for **autoregressive** (**AR**), **integrated** (**I**), **moving average** (**MA**). **Autoregressive** models take advantage of the fact that an observation at time *t* is *correlated* to a previous observation, for example, at time *t - 1*. In *Chapter 5*, *Visualizing Data with Pandas and Matplotlib*, we will look at some techniques for determining whether a time series is autoregressive; note that not all time series are. The **integrated** component concerns the **differenced** data, or the *change* in the data from one time to another. For example, if we were concerned with a **lag** (distance between times) of 1, the differenced data would be the value at time *t* subtracted by the value at time *t - 1*. Lastly, the **moving average** component uses a sliding window to average the last *x* observations, where *x* is the length of the sliding window. If, for example, we have a 3-period moving average, by the time we have all of the data up to time 5, our moving average calculation only uses time periods 3, 4, and 5 to forecast time 6. We will build an ARIMA model in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*.

The moving average puts equal weight on each time period in the past involved in the calculation. In practice, this isn't always a realistic expectation of our data. Sometimes, *all* past values are important, but they *vary* in their influence on future data points. For these cases, we can use **exponential smoothing**, which allows us to put more weight on more recent values and less weight on values further away from what we are predicting.

Note that we aren't limited to predicting numbers; in fact, depending on the data, our predictions could be categorical in nature—things such as determining which flavor of ice cream will sell the most on a given day or whether an email is spam or not. This type of prediction will be introduced in *Chapter 9*, *Getting Started with Machine Learning in Python*.

## Inferential statistics

As mentioned earlier, inferential statistics deals with inferring or deducing things from the sample data we have in order to make statements about the population as a whole. When we're looking to state our conclusions, we have to be mindful of whether we conducted an observational study or an experiment. With an **observational study**, the independent variable is not under the control of the researchers, and so we are *observing* those taking part in our study (think about studies on smoking—we can't force people to smoke). The fact that we can't control the independent variable means that we *cannot* conclude causation.

With an **experiment**, we are able to directly influence the independent variable and randomly assign subjects to the control and test groups, such as A/B tests (for anything from website redesigns to ad copy). Note that the control group doesn't receive treatment; they can be given a placebo (depending on what the study is). The ideal setup for this is **double-blind**, where the researchers administering the treatment don't know which treatment is the placebo and also don't know which subject belongs to which group.

Important note

We can often find reference to Bayesian inference and frequentist inference. These are based on two different ways of approaching probability. Frequentist statistics focuses on the frequency of the event, while Bayesian statistics uses a degree of belief when determining the probability of an event. We will see an example of Bayesian statistics in *Chapter 11*, *Machine Learning Anomaly Detection*. You can read more about how these methods differ at https://www.probabilisticworld.com/frequentist-bayesian-approaches-inferential-statistics/.

Inferential statistics gives us tools to translate our understanding of the sample data to a statement about the population. Remember that the sample statistics we discussed earlier are estimators for the population parameters. Our estimators need **confidence intervals**, which provide a point estimate and a margin of error around it. This is the range that the true population parameter will be in at a certain **confidence level**. At the 95% confidence level, 95% of the confidence intervals that are calculated from random samples of the population contain the true population parameter. Frequently, 95% is chosen for the confidence level and other purposes in statistics, although 90% and 99% are also common; the higher the confidence level, the wider the interval.

Hypothesis tests allow us to test whether the true population parameter is less than, greater than, or not equal to some value at a certain **significance level** (called **alpha**). The process of performing a hypothesis test starts with stating our initial assumption or **null hypothesis**: for example, *the true population mean is 0*. We pick a level of statistical significance, usually 5%, which is the probability of rejecting the null hypothesis when it is true. Then, we calculate the critical value for the test statistic, which will depend on the amount of data we have and the type of statistic (such as the mean of one population or the proportion of votes for a candidate) we are testing. The critical value is compared to the test statistic from our data and, based on the result, we either reject or fail to reject the null hypothesis. Hypothesis tests are closely related to confidence intervals. The significance level is equivalent to 1 minus the confidence level. This means that a result is statistically significant if the null hypothesis value is not in the confidence interval.

Important note

There are many things we have to be aware of when picking the method to calculate a confidence interval or the proper test statistic for a hypothesis test. This is beyond the scope of this book, but check out the link in the *Further reading* section at the end of this chapter for more information. Also, be sure to look at some of the mishaps with the p-values used in hypothesis testing, such as p-hacking, at https://en.wikipedia.org/wiki/Misuse_of_p-values.

Now that we have an overview of statistics and data analysis, we are ready to get started with the Python portion of this book. Let's start by setting up a virtual environment.

# Setting up a virtual environment

This book was written using Python 3.7.3, but the code should work for Python 3.7.1+, which is available on all major operating systems. In this section, we will go over how to set up the virtual environment in order to follow along with this book. If Python isn't already installed on your computer, read through the following sections on virtual environments first, and then decide whether to install Anaconda, since it will also install Python. To install Python without Anaconda, download it from https://www.python.org/downloads/, and then follow the *venv* section instead of the *conda* section.

Important note

To check whether Python is already installed, run `where python3`

from the command line on Windows or `which python3`

from the command line on Linux/macOS. If this returns nothing, try running it with just `python`

(instead of `python3`

). If Python is installed, check the version by running `python3 --version`

. Note that if `python3`

works, then you should use that throughout the book (and conversely, use `python`

if `python3`

doesn't work).

## Virtual environments

Most of the time, when we want to install software on our computer, we simply download it, but the nature of programming languages where packages are constantly being updated and rely on specific versions of others means this can cause issues. We could be working on a project one day where we need a certain version of a Python package (say 0.9.1), but the next day be working on an analysis where we need the most recent version of that same package to access some newer functionality (1.1.0). Sounds like there wouldn't be an issue, right? Well, what happens if this update causes a breaking change to the first project or another package in our project that relies on this one? This is a common enough problem that a solution already exists to prevent this from being an issue: virtual environments.

A **virtual environment** allows us to create separate environments for each of our projects. Each of our environments will only have the packages that it needs installed. This makes it easy to share our environment with others, have multiple versions of the same package installed on our machine for different projects without interfering with each other, and avoid unexpected side effects from installing packages that update or have dependencies on others. It's good practice to make a dedicated virtual environment for any projects we work on.

We will discuss two common ways to achieve this setup, and you can decide which fits best. Note that all the code in this section will be executed on the command line.

### venv

Python 3 comes with the `venv`

module, which will create a virtual environment in the location of our choice. The process of setting up and using a development environment is as follows (after Python is installed):

- Create a folder for the project.
- Use
`venv`

to create an environment in this folder. - Activate the environment.
- Install Python packages in the environment with
`pip`

. - Deactivate the environment when finished.

In practice, we will create environments for each project we work on, so our first step will be to create a directory for all of our project files. For this, we can use the `mkdir`

command. Once this has been created, we will change our current directory to the newly created one using the `cd`

command. Since we already obtained the project files (from the instructions in the *Chapter materials* section), the following is for reference only. To make a new directory and move to that directory, we can use the following command:

$ mkdir my_project && cd my_project

Tip

`cd <path>`

changes the current directory to the path specified in `<path>`

, which can be an **absolute** (full) path or **relative** (how to get there from the current directory) path.

Before moving on, use `cd`

to navigate to the directory containing this book's repository. Note that the path will depend on where it was cloned/downloaded:

$ cd path/to/Hands-On-Data-Analysis-with-Pandas-2nd-edition

Since there are slight differences in operating systems for the remaining steps, we will go over Windows and Linux/macOS separately. Note that if you have both Python 2 and Python 3, make sure you use `python3`

and not `python`

in the following commands.

#### Windows

To create our environment for this book, we will use the `venv`

module from the standard library. Note that we must provide a name for our environment (`book_env`

). Remember, if your Windows setup has `python`

associated with Python 3, then use `python`

instead of `python3`

in the following command:

C:\...> python3 -m venv book_env

Now, we have a folder for our virtual environment named `book_env`

inside the repository folder that we cloned/downloaded earlier. In order to use the environment, we need to activate it:

C:\...> %cd%\book_env\Scripts\activate.bat

Tip

Windows replaces `%cd%`

with the path to the current directory. This saves us from having to type the full path up to the `book_env`

part.

Note that after we activate the virtual environment, we can see `(book_env)`

in front of our prompt on the command line; this lets us know we are in the environment:

(book_env) C:\...>

When we are finished using the environment, we simply deactivate it:

(book_env) C:\...> deactivate

Any packages that are installed in the environment don't exist outside the environment. Note that we no longer have `(book_env)`

in front of our prompt on the command line. You can read more about `venv`

in the Python documentation at https://docs.python.org/3/library/venv.html.

Now that the virtual environment is created, activate it and then head to the *Installing the required Python packages* section for the next step.

#### Linux/macOS

To create our environment for this book, we will use the `venv`

module from the standard library. Note that we must provide a name for our environment (`book_env`

):

$ python3 -m venv book_env

Now, we have a folder for our virtual environment named `book_env`

inside of the repository folder we cloned/downloaded earlier. In order to use the environment, we need to activate it:

$ source book_env/bin/activate

Note that after we activate the virtual environment, we can see `(book_env)`

in front of our prompt on the command line; this lets us know we are in the environment:

(book_env) $

When we are finished using the environment, we simply deactivate it:

(book_env) $ deactivate

Any packages that are installed in the environment don't exist outside the environment. Note that we no longer have `(book_env)`

in front of our prompt on the command line. You can read more about `venv`

in the Python documentation at https://docs.python.org/3/library/venv.html.

Now that the virtual environment is created, activate it and then head to the *Installing the required Python packages* section for the next step.

### conda

Anaconda provides a way to set up a Python environment specifically for data science. It includes some of the packages we will use in this book, along with several others that may be necessary for tasks that aren't covered in this book (and also deals with dependencies outside of Python that might be tricky to install otherwise). Anaconda uses `conda`

as the environment and package manager instead of `pip`

, although packages can still be installed with `pip`

(as long as the `pip`

installed by Anaconda is called). Note that some packages may not be available with `conda`

, in which case we will have to use `pip`

. Consult this page in the `conda`

documentation for a comparison of commands used with `conda`

, `pip`

, and `venv`

: https://conda.io/projects/conda/en/latest/commands.html#conda-vs-pip-vs-virtualenv-commands.

Important note

Be warned that Anaconda is a very large install (although the Miniconda version is much lighter). Those who use Python for purposes aside from data science may prefer the `venv`

method we discussed earlier in order to have more control over what gets installed.

Anaconda can also be packaged with the Spyder **integrated development environment** (**IDE**) and Jupyter Notebooks, which we will discuss later. Note that we can use Jupyter with the `venv`

option as well.

You can read more about Anaconda and how to install it at the following pages in their official documentation:

**Windows**: https://docs.anaconda.com/anaconda/install/windows/**macOS**: https://docs.anaconda.com/anaconda/install/mac-os/**Linux**: https://docs.anaconda.com/anaconda/install/linux/**User guide**: https://docs.anaconda.com/anaconda/user-guide/

Once you have installed either Anaconda or Miniconda, confirm that it is properly installed by running `conda -V`

on the command line to display the version. Note that on Windows, all `conda`

commands need to be run in **Anaconda Prompt** (as opposed to **Command Prompt**).

To create a new `conda`

environment for this book, called `book_env`

, run the following:

(base) $ conda create --name book_env

Running `conda env list`

will show all the `conda`

environments on the system, which will now include `book_env`

. The current active environment will have an asterisk (`*`

) next to it—by default, `base`

will be active until we activate another environment:

(base) $ conda env list # conda environments: # base * /miniconda3 book_env /miniconda3/envs/book_env

To activate the `book_env`

environment, we run the following command:

(base) $ conda activate book_env

Note that after we activate the virtual environment, we can see `(book_env)`

in front of our prompt on the command line; this lets us know we are in the environment:

(book_env) $

When we are finished using the environment, we deactivate it:

(book_env) $ conda deactivate

Any packages that are installed in the environment don't exist outside the environment. Note that we no longer have `(book_env)`

in front of our prompt on the command line. You can read more about how to use `conda`

to manage virtual environments at https://www.freecodecamp.org/news/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c/.

In the next section, we will install the Python packages required for following along with this book, so be sure to activate the virtual environment now.

## Installing the required Python packages

We can do a lot with the Python standard library; however, we will often find the need to install and use an outside package to extend functionality. The `requirements.txt`

file in the repository contains all the packages we need to install to work through this book. It will be in our current directory, but it can also be found at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/blob/master/requirements.txt. This file can be used to install a bunch of packages at once with the `-r`

flag in the call to `pip3 install`

and has the advantage of being easy to share.

Before installing anything, be sure to activate the virtual environment that you created with either `venv`

or `conda`

. Be advised that if the environment is not activated before running the following command, the packages will be installed outside the environment:

(book_env) $ pip3 install -r requirements.txt

Tip

If you encounter any issues, report them at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/issues.

## Why pandas?

When it comes to data science in Python, the `pandas`

library is pretty much ubiquitous. It is built on top of the NumPy library, which allows us to perform mathematical operations on arrays of single-type data efficiently. Pandas expands this to **dataframes**, which can be thought of as tables of data. We will get a more formal introduction to dataframes in *Chapter 2*, *Working with Pandas DataFrames*.

Aside from efficient operations, `pandas`

also provides **wrappers** around the `matplotlib`

plotting library, making it very easy to create a variety of plots without needing to write many lines of `matplotlib`

code. We can always tweak our plots using `matplotlib`

, but for quickly visualizing our data, we only need one line of code in `pandas`

. We will explore this functionality in *Chapter 5*, *Visualizing Data with Pandas and Matplotlib*, and *Chapter 6*, *Plotting with Seaborn and Customization Techniques*.

Important note

Wrapper functions wrap around code from another library, obscuring some of its complexity and leaving us with a simpler interface for repeating that functionality. This is a core principle of **object-oriented programming** (**OOP**) called **abstraction**, which reduces complexity and the duplication of code. We will create our own wrapper functions throughout this book.

In addition to `pandas`

, this book makes use of Jupyter Notebooks. While you are free to choose not to use them, it's important to be familiar with Jupyter Notebooks as they are very common in the data world. As an introduction, we will use a Jupyter Notebook to validate our setup in the next section.

## Jupyter Notebooks

Each chapter of this book includes Jupyter Notebooks for following along. Jupyter Notebooks are omnipresent in Python data science because they make it very easy to write and test code in more of a discovery environment compared to writing a program. We can execute one block of code at a time and have the results printed to the notebook, directly beneath the code that generated it. In addition, we can use **Markdown** to add text explanations to our work. Jupyter Notebooks can be easily packaged up and shared; they can be pushed to GitHub (where they will be rendered), converted into HTML or PDF, sent to someone else, or presented.

### Launching JupyterLab

JupyterLab is an IDE that allows us to create Jupyter Notebooks and Python scripts, interact with the terminal, create text documents, reference documentation, and much more from a clean web interface on our local machine. There are lots of keyboard shortcuts to master before really becoming a power user, but the interface is pretty intuitive. When we created our environment, we installed everything we needed to run JupyterLab, so let's take a quick tour of the IDE and make sure that our environment is set up properly. First, we activate our environment, and then we launch JupyterLab:

(book_env) $ jupyter lab

This will then launch a window in the default browser with JupyterLab. We will be greeted with the **Launcher** tab and the **File Browser** pane to the left:

Using the **File Browser** pane, double-click on the **ch_01** folder, which contains the Jupyter Notebook that we will use to validate our setup.

### Validating the virtual environment

Open the `checking_your_setup.ipynb`

notebook in the **ch_01** folder, as shown in the following screenshot:

Important note

The **kernel** is the process that runs and introspects our code in a Jupyter Notebook. Note that we aren't limited to running Python—we can run kernels for R, Julia, Scala, and other languages as well. By default, we will be running Python using the IPython kernel. We will learn a little more about IPython throughout the book.

Click on the code cell indicated in the previous screenshot and run it by clicking the play (▶) button. If everything shows up in green, the environment is all set up. However, if this isn't the case, run the following command from the virtual environment to create a special kernel with the `book_env`

virtual environment for use with Jupyter:

(book_env) $ ipython kernel install --user --name=book_env

This adds an additional option in the **Launcher** tab, and we can now switch to the `book_env`

kernel from a Jupyter Notebook as well:

It's important to note that Jupyter Notebooks will retain the values we assign to variables while the kernel is running, and the results in the **Out[#]** cells will be saved when we save the file. Closing the file doesn't stop the kernel and neither does closing the JupyterLab tab in the browser.

### Closing JupyterLab

Closing the browser with JupyterLab in it doesn't stop JupyterLab or the kernels it is running (we also won't get the command-line interface back). To shut down JupyterLab entirely, we need to hit *Ctrl* + *C* (which is a keyboard interrupt signal that lets JupyterLab know we want to shut it down) a couple of times in the terminal until we get the prompt back:

... [I 17:36:53.166 LabApp] Interrupted... [I 17:36:53.168 LabApp] Shutting down 1 kernel [I 17:36:53.770 LabApp] Kernel shutdown: a38e1[...]b44f (book_env) $

For more information about Jupyter, including a tutorial, check out http://jupyter.org/. Learn more about JupyterLab at https://jupyterlab.readthedocs.io/en/stable/.

# Summary

In this chapter, we learned about the main processes in conducting data analysis: data collection, data wrangling, EDA, and drawing conclusions. We followed that up with an overview of descriptive statistics and learned how to describe the central tendency and spread of our data; how to summarize it both numerically and visually using the 5-number summary, box plots, histograms, and kernel density estimates; how to scale our data; and how to quantify relationships between variables in our dataset.

We got an introduction to prediction and time series analysis. Then, we had a very brief overview of some core topics in inferential statistics that can be explored after mastering the contents of this book. Note that while all the examples in this chapter were of one or two variables, real-life data is often high-dimensional. *Chapter 10*, *Making Better Predictions – Optimizing Models*, will touch on some ways to address this. Lastly, we set up our virtual environment for this book and learned how to work with Jupyter Notebooks.

Now that we have built a strong foundation, we will start working with data in Python in the next chapter.

# Exercises

Run through the `introduction_to_data_analysis.ipynb`

notebook for a review of this chapter's content, review the `python_101.ipynb`

notebook (if needed), and then complete the following exercises to practice working with JupyterLab and calculating summary statistics in Python:

- Explore the JupyterLab interface and look at some of the shortcuts that are available. Don't worry about memorizing them for now (eventually, they will become second nature and save you a lot of time)—just get comfortable using Jupyter Notebooks.
- Is all data normally distributed? Explain why or why not.
- When would it make more sense to use the median instead of the mean for the measure of center?
- Run the code in the first cell of the
`exercises.ipynb`

notebook. It will give you a list of 100 values to work with for the rest of the exercises in this chapter. Be sure to treat these values as a sample of the population. - Using the data from
*exercise 4*, calculate the following statistics without importing anything from the`statistics`

module in the standard library (https://docs.python.org/3/library/statistics.html), and then confirm your results match up to those that are obtained when using the`statistics`

module (where possible):a) Mean

b) Median

c) Mode (hint: check out the

`Counter`

class in the`collections`

module of the standard library at https://docs.python.org/3/library/collections.html#collections.Counter)d) Sample variance

e) Sample standard deviation

- Using the data from
*exercise 4*, calculate the following statistics using the functions in the`statistics`

module where appropriate:a) Range

b) Coefficient of variation

c) Interquartile range

d) Quartile coefficient of dispersion

- Scale the data created in
*exercise 4*using the following strategies:a) Min-max scaling (normalizing)

b) Standardizing

- Using the scaled data from
*exercise 7*, calculate the following:a) The covariance between the standardized and normalized data

b) The Pearson correlation coefficient between the standardized and normalized data (this is actually 1, but due to rounding along the way, the result will be slightly less)

# Further reading

The following are some resources that you can use to become more familiar with Jupyter:

*Jupyter Notebook Basics*: https://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb*JupyterLab introduction*: https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906*Learning Markdown to make your Jupyter Notebooks presentation-ready*: https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed*28 Jupyter Notebook Tips, Tricks, and Shortcuts*: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

Some resources for learning more advanced concepts of statistics (that we won't cover here) and carefully applying them are as follows:

*A Gentle Introduction to Normality Tests in Python*: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/*How Hypothesis Tests Work: Confidence Intervals and Confidence Levels*: https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-confidence-intervals-levels/*Intro to Inferential Statistics (Making Predictions with Data) on Udacity*: https://www.udacity.com/course/intro-to-inferential-statistics--ud201*Lesson 4: Confidence Intervals (Penn State Elementary Statistics)*: https://online.stat.psu.edu/stat200/lesson/4*Seeing Theory: A visual introduction to probability and statistics*: https://seeing-theory.brown.edu/index.html*Statistics Done Wrong: The Woefully Complete Guide by Alex Reinhart*: https://www.statisticsdonewrong.com/*Survey Sampling Methods*: https://stattrek.com/survey-research/sampling-methods.aspx