You're reading from The Data Visualization Workshop

Product typeBook

Published inJul 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781800568846

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Visualization

Authors (2):

Mario Döbler

Tim Großmann

View More author details

Overview of Statistics

Statistics is a combination of the analysis, collection, interpretation, and representation of numerical data. Probability is a measure of the likelihood that an event will occur and is quantified as a number between 0 and 1.

A probability distribution is a function that provides the probability for every possible event. A probability distribution is frequently used for statistical analysis. The higher the probability, the more likely the event. There are two types of probability distributions, namely discrete and continuous.

A discrete probability distribution shows all the values that a random variable can take, together with their probability. The following diagram illustrates an example of a discrete probability distribution. If we have a six-sided die, we can roll each number between 1 and 6. We have six events that can occur based on the number that's rolled. There is an equal probability of rolling any of the numbers, and the individual probability of any of the six events occurring is 1/6:

Figure 1.3: Discrete probability distribution for die rolls

A continuous probability distribution defines the probabilities of each possible value of a continuous random variable. The following diagram provides an example of a continuous probability distribution. This example illustrates the distribution of the time needed to drive home. In most cases, around 60 minutes is needed, but sometimes, less time is needed because there is no traffic, and sometimes, much more time is needed if there are traffic jams:

Figure 1.4: Continuous probability distribution for the time taken to reach home

Measures of Central Tendency

Measures of central tendency are often called averages and describe central or typical values for a probability distribution. We are going to discuss three kinds of averages in this chapter:

Mean: The arithmetic average is computed by summing up all measurements and dividing the sum by the number of observations. The mean is calculated as follows:

Figure 1.5: Formula for mean

Median: This is the middle value of the ordered dataset. If there is an even number of observations, the median will be the average of the two middle values. The median is less prone to outliers compared to the mean, where outliers are distinct values in data.
Mode: Our last measure of central tendency, the mode is defined as the most frequent value. There may be more than one mode in cases where multiple values are equally frequent.

For example, a die was rolled 10 times, and we got the following numbers: 4, 5, 4, 3, 4, 2, 1, 1, 2, and 1.

The mean is calculated by summing all the events and dividing them by the number of observations: (4+5+4+3+4+2+1+1+2+1)/10=2.7.

To calculate the median, the die rolls have to be ordered according to their values. The ordered values are as follows: 1, 1, 1, 2, 2, 3, 4, 4, 4, 5. Since we have an even number of die rolls, we need to take the average of the two middle values. The average of the two middle values is (2+3)/2=2.5.

The modes are 1 and 4 since they are the two most frequent events.

Measures of Dispersion

Dispersion, also called variability, is the extent to which a probability distribution is stretched or squeezed.

The different measures of dispersion are as follows:

Variance: The variance is the expected value of the squared deviation from the mean. It describes how far a set of numbers is spread out from their mean. Variance is calculated as follows:

Figure 1.6: Formula for mean

Standard deviation: This is the square root of the variance.
Range: This is the difference between the largest and smallest values in a dataset.
Interquartile range: Also called the midspread or middle 50%, this is the difference between the 75th and 25th percentiles, or between the upper and lower quartiles.

Correlation

The measures we have discussed so far only considered single variables. In contrast, correlation describes the statistical relationship between two variables:

In a positive correlation, both variables move in the same direction.
In a negative correlation, the variables move in opposite directions.
In zero correlation, the variables are not related.
Note
One thing you should be aware of is that correlation does not imply causation. Correlation describes the relationship between two or more variables, while causation describes how one event is caused by another. For example, consider a scenario in which ice cream sales are correlated with the number of drowning deaths. But that doesn't mean that ice cream consumption causes drowning. There could be a third variable, say temperature, that may be responsible for this correlation. Higher temperatures may cause an increase in both ice cream sales and more people engaging in swimming, which may be the real reason for the increase in deaths due to drowning.

Example

Consider you want to find a decent apartment to rent that is not too expensive compared to other apartments you've found. The other apartments (all belonging to the same locality) you found on a website are priced as follows: $700, $850, $1,500, and $750 per month. Let's calculate some values statistical measures to help us make a decision:

The mean is ($700 + $850 + $1,500 + $750) / 4 = $950.
The median is ($750 + $850) / 2 = $800.
The standard deviation is .
The range is $1,500 - $700 = $800.

As an exercise, you can try and calculate the variance as well. However, note that compared with all the above values, the median value ($800) is a better statistical measure in this case since it is less prone to outliers (the rent amount of $1,500). Given that all apartments belong to the same locality, you can clearly see that the apartment costing $1500 is definitely priced much higher as compared with other apartments. A simple statistical analysis helped us to narrow down our choices.

Types of Data

It is important to understand what kind of data you are dealing with so that you can select both the right statistical measure and the right visualization. We categorize data as categorical/qualitative and numerical/quantitative. Categorical data describes characteristics, for example, the color of an object or a person's gender. We can further divide categorical data into nominal and ordinal data. In contrast to nominal data, ordinal data has an order.

Numerical data can be divided into discrete and continuous data. We speak of discrete data if the data can only have certain values, whereas continuous data can take any value (sometimes limited to a range).

Another aspect to consider is whether the data has a temporal domain – in other words, is it bound to time or does it change over time? If the data is bound to a location, it might be interesting to show the spatial relationship, so you should keep that in mind as well. The following flowchart classifies the various data types:

Figure 1.7: Classification of types of data

Summary Statistics

In real-world applications, we often encounter enormous datasets. Therefore, summary statistics are used to summarize important aspects of data. They are necessary to communicate large amounts of information in a compact and simple way.

We have already covered measures of central tendency and dispersion, which are both summary statistics. It is important to know that measures of central tendency show a center point in a set of data values, whereas measures of dispersion show how much the data varies.

The following table gives an overview of which measure of central tendency is best suited to a particular type of data:

Figure 1.8: Best suited measures of central tendency for different types of data

In the next section, we will learn about the NumPy library and implement a few exercises using it.

The rest of the page is locked

You have been reading a chapter from

The Data Visualization Workshop

Published in: Jul 2020Publisher: PacktISBN-13: 9781800568846

Authors (2)

Mario Döbler

Mario Döbler is a Ph.D. student with a focus on deep learning at the University of Stuttgart. He previously interned at the Bosch Center for artificial intelligence in the Silicon Valley in the field of deep learning. He used state-of-the-art algorithms to develop cutting-edge products. In his master thesis, he dedicated himself to applying deep learning to medical data to drive medical applications.
Read more about Mario Döbler

Tim Großmann

Tim Großmann is a computer scientist with interest in diverse topics, ranging from AI and IoT to Security. He previously worked in the field of big data engineering at the Bosch Center for Artificial Intelligence in Silicon Valley. In addition to that, he worked on an Eclipse project for IoT device abstractions in Singapore. He's highly involved in several open-source projects and actively speaks at tech meetups and conferences about his projects and experiences.
Read more about Tim Großmann

Other recommended products

Related to this chapter

The Data Visualization Workshop

Cut through the noise and get real results with a step-by-step approach to learning data visualization with Python

BookFeb 2020480 pages

Hands-On Data Visualization with Bokeh

Adding a layer of interactivity to your plots and converting these plots into applications hold immense value in the field of data science. The standard approach to adding interactivity would be to use paid software such as Tableau, but the Bokeh package in Python offers users a way to create both interactive and visually aesthetic plots for free.

BookJun 2018174 pages

Interactive Data Visualization with Python

Interactive Data Visualization with Python sharpens your data exploration skills, tells you everything there is to know about interactive data visualization in Python, and most importantly, helps you make your storytelling more intuitive and persuasive.

BookOct 2019362 pages

Interactive Data Visualization with Python

Interactive Data Visualization with Python sharpens your data exploration skills, tells you everything there is to know about interactive data visualization in Python, and most importantly, helps you make your storytelling more intuitive and persuasive.

BookApr 2020362 pages

Matplotlib 3.0 Cookbook

This book presents highly practical, ready to implement recipes on using Python's Matplotlib package for effective data visualization. It contains quick solutions to the common and not-so-common problems encountered while designing different types of visualizations, including histograms, bar plots, and other advanced charts.

BookOct 2018676 pages

Matplotlib 2.x By Example

Big data analytics are driving innovations in scientific research, digital marketing, policymaking and much more. Matplotlib offers simple but powerful plotting interface, versatile plot types and robust customizations, which help resolve the complexity in Big data visualization. “Matplotlib 2.x By Example” illustrates the methods and applications of various plot types through real world examples. It begins by giving readers the basic knowhow on how to create and customize plots by Matplotlib. It further covers how to plot different types of economic data in the form of 2D and 3D graphs, which give insights from a deluge of data from public repositories, such as Quandl Finance. You will learn to visualize geographical data on maps and implement interactive charts. By the end of this book, you will become well versed with Matplotlib in your day-to-day work to perform advanced data visualization.

BookAug 2017334 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from The Data Visualization Workshop

Overview of Statistics

Measures of Central Tendency

Measures of Dispersion

Correlation

Types of Data

Summary Statistics

Unlock this book and the full library FREE for 7 days

Authors (2)

The Data Visualization Workshop

Cut through the noise and get real results with a step-by-step approach to learning data visualization with Python

Hands-On Data Visualization with Bokeh

Interactive Data Visualization with Python

Interactive Data Visualization with Python sharpens your data exploration skills, tells you everything there is to know about interactive data visualization in Python, and most importantly, helps you make your storytelling more intuitive and persuasive.

Interactive Data Visualization with Python

Interactive Data Visualization with Python sharpens your data exploration skills, tells you everything there is to know about interactive data visualization in Python, and most importantly, helps you make your storytelling more intuitive and persuasive.

Matplotlib 3.0 Cookbook

Matplotlib 2.x By Example

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook