Reader small image

You're reading from  Time Series Analysis with Python Cookbook

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801075541
Edition1st Edition
Concepts
Right arrow
Author (1)
Tarek A. Atwan
Tarek A. Atwan
author image
Tarek A. Atwan

Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at various universities in the United States. He is regarded as a data science mentor and advisor, working with executive leaders in numerous industries to solve complex problems using a data-driven approach.
Read more about Tarek A. Atwan

Right arrow

Chapter 8: Outlier Detection Using Statistical Methods

In addition to missing data, as discussed in Chapter 7, Handling Missing Data, a common data issue you may face is the presence of outliers. Outliers can be point outliers, collective outliers, or contextual outliers. For example, a point outlier occurs when a data point deviates from the rest of the population—sometimes referred to as a global outlier. Collective outliers, which are groups of observations, differ from the population and don't follow the expected pattern. Lastly, contextual outliers occur when an observation is considered an outlier based on a particular condition or context, such as deviation from neighboring data points. Note that with contextual outliers, the same observation may not be considered an outlier if the context changes.

In this chapter, you will be introduced to a handful of practical statistical techniques that cover parametric and non-parametric methods. In Chapter 14, Outlier Detection...

Technical requirements

You can download the Jupyter Notebooks and needed datasets from the GitHub repository:

Throughout the chapter, you will be using a dataset from the Numenta Anomaly Benchmark (NAB), which provides outlier detection benchmark datasets. For more information about NAB, please visit their GitHub repository here: https://github.com/numenta/NAB.

The New York Taxi dataset captures the number of NYC taxi passengers at a specific timestamp. The data contains known anomalies that are provided to evaluate the performance of our outlier detectors. The dataset contains 10,320 records between July 1, 2014, to May 31, 2015. The observations are captured in a 30-minute interval, which translates to freq = '30T...

Understanding outliers

The presence of outliers requires special handling and further investigation before hastily jumping to decisions on how to handle them. First, you will need to detect and spot their existence, which this chapter is all about. Domain knowledge can be instrumental in determining whether these identified points are outliers, their impact on your analysis, and how you should deal with them.

Outliers can indicate bad data due to a random variation in the process, known as noise, or due to data entry error, faulty sensors, bad experiment, or natural variation. Outliers are usually undesirable if they seem synthetic, for example, bad data. On the other hand, if outliers are a natural part of the process, you may need to rethink removing them and opt to keep these data points. In such circumstances, you can rely on non-parametric statistical methods that do not make assumptions on the underlying distribution.

Generally, outliers can cause side effects when building...

Resampling time series data

A typical transformation that is done on time series data is resampling. The process implies changing the frequency or level of granularity of the data.

Usually, you will have limited control over how the time series is generated in terms of frequency. For example, the data can be generated and stored in small intervals, such as milliseconds, minutes, or hours. In some cases, the data can be in larger intervals, such as daily, weekly, or monthly.

The need for resampling time series can be driven by the nature of your analysis and at what granular level you need your data to be. For instance, you can have daily data, but your analysis requires the data to be weekly, and thus you will need to resample. This process is known as downsampling. When you are downsampling, you will need to provide some level of aggregation, such as mean, sum, min, or max, to name a few. On the other hand, some situations require you to resample your data from daily to hourly...

Detecting outliers using visualizations

There are two general approaches for using statistical techniques to detect outliers: parametric and non-parametric methods. Parametric methods assume you know the underlying distribution of the data. For example, if your data follows a normal distribution. On the other hand, in non-parametric methods, you make no such assumptions.

Using histograms and box plots are basic non-parametric techniques that can provide insight into the distribution of the data and the presence of outliers. More specifically, box plots, also known as box and whisker plots, provide a five-number summary: the minimum, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), and the maximum. There are different implementations for how far the whiskers extend, for example, the whiskers can extend to the minimum and maximum values. In most statistical software, including Python's matplotlib and seaborn libraries, the whiskers...

Detecting outliers using the Tukey method

This recipe will extend on the previous recipe, Detecting outliers using visualizations. In Figure 8.5, the box plot showed the quartiles with whiskers extending to the upper and lower fences. These boundaries or fences were calculated using the Tukey method.

Let's expand on Figure 8.5 with additional information on the other components:

Figure 8.10 – Box plot for the daily average taxi passengers data

Visualizations are great to give you a high-level perspective on the data you are working with, such as the overall distribution and potential outliers. Ultimately you want to identify these outliers programmatically so you can isolate these data points for further investigation and analysis. This recipe will teach how to calculate IQR and define points that fall outside the lower and upper Tukey fences.

How to do it...

Most statistical methods allow you to spot extreme values beyond a certain threshold...

Detecting outliers using a z-score

The z-score is a common transformation for standardizing the data. This is common when you want to compare different datasets. For example, it is easier to compare two data points from two different datasets relative to their distributions. This can be done because the z-score standardizes the data to be centered around a zero mean and the units represent standard deviations away from the mean. For example, in our dataset, the unit is measured in daily taxi passengers (in thousands). Once you apply the z-score transformation, you are no longer dealing with the number of passengers, but rather, the units represent standard deviation, which tells us how far an observation is from the mean. Here is the formula for the z-score:

Where is a data point (an observation), mu () is the mean of the dataset, and sigma () is the standard deviation for the dataset.

Keep in mind that the z-score is a lossless transformation, which...

Detecting outliers using a modified z-score

In the Detecting outliers using a z-score recipe, you experienced how simple and intuitive the method is. But it has one major drawback: it assumes your data is normally distributed.

But, what if your data is not normally distributed? Luckily, there is a modified version of the z-score to work with non-normal data. The main difference between the regular z-score and the modified z-score is that we replace the mean with the median:

Where (tilde x) is the median of the dataset, and MAD is the median absolute deviation of the dataset:

The 0.6745 value is the standard deviation unit that corresponds to the 75th percentile (Q3) in a Gaussian distribution and is used as a normalization factor. In other words, it is used to approximate the standard deviation. This way, the units you obtain from this method are measured in standard deviation, similar to how you would interpret the regular...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Time Series Analysis with Python Cookbook
Published in: Jun 2022Publisher: PacktISBN-13: 9781801075541
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Tarek A. Atwan

Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at various universities in the United States. He is regarded as a data science mentor and advisor, working with executive leaders in numerous industries to solve complex problems using a data-driven approach.
Read more about Tarek A. Atwan