Home Data Mastering Python Data Visualization

Mastering Python Data Visualization

books-svg-icon Book
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $43.99 $29.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    A Conceptual Framework for Data Visualization
About this book
Publication date:
October 2015
Publisher
Packt
Pages
372
ISBN
9781783988327

 

Chapter 1. A Conceptual Framework for Data Visualization

The existence of the Internet and social media in modern times has led to an abundance of data, and data sizes are growing beyond imagination. How and when did this begin?

A decade ago, a new way of doing business evolved: of corporations collecting, combining, and crunching large amount of data from sources throughout the enterprise. Their goal was to use a high volume of data to improve the decision-making process. Around that same time, corporations like Amazon, Yahoo, and Google, which handled large amounts of data, made significant headway. Those milestones led to the creation of several technologies supporting big data. We will not get into details about big data, but will try exploring why many organizations have changed their ways to use similar ideas for better decision-making.

How exactly are these large amount of data used for making better decisions? We will get to that eventually, but first let us try to understand the difference between data, information, and knowledge, and how they are all related to data visualization. One may wonder, why are we talking about data, information, and knowledge. There is a storyline that connects how we start, what we start with, how all these things benefit the business, and the role of visualization. We will determine the required conceptual framework for data visualization by briefly reviewing the steps involved.

In this chapter, we will cover the following topics:

  • The difference between data, information, knowledge, and insight

  • The transformation of information into knowledge, and further, to insight

  • Collecting, processing, and organizing data

  • The history of data visualization

  • How does visualizing data help decision-making?

  • Visualization plots

 

Data, information, knowledge, and insight


The terms data, information, and knowledge are used extensively in the context of computer science. There are many definitions of these terms, often conflicting and inconsistent. Before we dive into these definitions, we will understand how these terms are related to visualization. The primary objective of data visualization is to gain insight (hidden truth) into the data or information. The whole discussion about data, knowledge, and insight in this book is within the context of computer science, and not psychology or cognitive science. For the cognitive context, one may refer to https://www.ucsf.edu/news/2014/05/114321/converting-data-knowledge-insight-and-action.

Data

The term data implies a premise from which one may draw conclusions. Though data and information appear to be interrelated in a certain context, data actually refers to discrete, objective facts in a digital form. Data are the basic building blocks that, when organized and arranged in different ways, lead to information that is useful in answering some questions about the business.

Data can be something very simple, yet voluminous and unorganized. This discrete data cannot be used to make decisions on its own because it has no meaning and, more importantly, because there is no structure or relationship between them. The process by which data is collected, transmitted, and stored varies widely with the types of data and storage methods. Data comes in many forms; some notable forms are listed as follows:

  • CSV files

  • Database tables

  • Document formats (Excel, PDF, Word, and so on)

  • HTML files

  • JSON files

  • Text files

  • XML files

Information

Information is processed data presented as an answer to a business question. Data becomes information when we add a relationship or an association. The association is accomplished by providing a context or background to the data. The background is helpful because it allows us to answer questions about the data.

For example, let us assume that the data given for a basketball player includes height, weight, position, college, date of birth, draft pick, draft round, NBA-debut, and recruiting rank. The answer to the question, "Who is the first draft pick with a height of more than six feet and plays on the point guard position?" is also the information.

Similarly, each player's score is one piece of data. The answer to the question "Who has the highest point per game this year and what is his score" is "LeBron James, 27.47", which is also information.

Knowledge

Knowledge emerges when humans interpret and organize information and use that to drive decision-making. Knowledge is the data, information, and the skills acquired through experience. Knowledge comprises the ability to make the appropriate decision as well as the skills to execute it.

The essential ingredient—connecting the data—allows us to understand the relative importance of each piece of information. By comparing results from the past and by recognizing patterns, we don't have to build a solution to a problem from scratch. The following diagram summarizes the concepts of data, information, and knowledge:

Knowledge changes in an incremental way, particularly when information is rearranged or reorganized or when some computing algorithm changes. Knowledge is like an arrow pointing to the results of an algorithm that is dependent on past information that comes from data. In many instances, knowledge is also gained by visually interacting with the results. Insight on the other hand, opens the way to the future.

Data analysis and insight

Before we dive into the definition of insight and how it relates to business, let us see how the idea of capturing insight ever began. For over a decade, organizations have been struggling to make sense of all the data and information they have, particularly with the exploding data size. They all realized the importance of data analysis (also known as data analytics or analytics) in order to arrive at an optimal or realistic business decision based on existing data and information.

Analytics hinges upon mathematical algorithms to determine the relationships between the data that can yield insight. One simple way to understand insight is by considering an analogy: when data does not have a structure and proper alignment with the business, it gives a clearer and deeper understanding by converting the data to a more structured form and aligning it more closely to the business goals. Insight is that "eureka" moment when there is a breakthrough result that comes out. One should not get confused between the terms Analytics and Business Intelligence. Analytics has predictive capabilities while Business Intelligence provides results based on the analysis of historical data.

Analytics is usually applicable to a broader spectrum of data and, for this reason, it is very common that data collaboration happens internally and/or externally. In some business paradigms, the collaboration only happens internally in an extensive collection of a dataset, but in most other cases, an external connection helps in connecting the dots or completing the puzzle. Two of the most common sources of external data connection are social media and consumer base.

Later in this chapter, we refer to real-life business stories that achieved some remarkable results by applying analytics to gain insight and drive business value, improve decision-making, and understand their customers better.

 

The transformation of data


By now we know what data is, but now the question is: what is the purpose of collecting data? Data is useful for describing a physical or social phenomenon and to further answer questions about that phenomenon. For this reason, it is important to ensure that the data is not faulty, inaccurate, or incomplete; otherwise, the responses based on that data will also not be accurate or complete.

There are different categories of data, some of which are past performance data, experimental data, and benchmark data. Past performance data and experimental data are pretty self-explanatory. Benchmark data, on the other hand, is data that compares the characteristics of two different items or products to a standard measure. Data gets transformed into information, is processed further, and is then used for answering questions. It is apparent, therefore, that our next step is to achieve that transformation.

Transforming data into information

Data is collected and stored in several different forms depending on the content and its significance. For instance, if the data is about playoff basketball games, then it will be in a text and video format. Another example is the temperature recordings from all the cities of a country, collected and made accessible via different formats. The transformation from data to information involves collection, processing, and organization of data as shown in the following diagram:

The collected data needs some processing and organizing, which later may or may not have a structure, model, or a pattern. However, this process at least gives us an organized way of finding answers to questions about the data. The process could be a simple sorting based on the total points scored by basketball players or a sorting based on the names of the city and state.

The transformation from data to information could also be a little more than just sorting such as statistical modeling or a computational algorithm. It is this transformation from data to information that is really important and enables the data to be queried, accessed, and manipulated. In some cases, when there is a vast and divergent amount of data, the transformation may involve processing methods such as filtering, aggregating, applying correlation, scaling and normalizing, and classifying.

Data collection

Data collection is a time-consuming process. So, businesses are looking for better ways to automate data capture. However, manual data collection is still prevalent for many processes. Data collection by automatic processes in modern times uses input devices such as sensors. For instance, underwater coral reefs are monitored via sensors; agriculture is another area where sensors are used in monitoring soil properties, controlling irrigation, and fertilization methods.

Another way to collect data automatically is by scanning documents and log files, which is a form of server-side data collection. Manual processes include data collection via web-based methods that get stored in the database, which can then be transformed into information. Nowadays, web-based collaborative environments are benefiting from improved communication and sharing of data.

Traditional visualization and visual analytic tools are typically designed for a single user interacting with a visualization application on a single machine. Extending these tools to include support for collaboration has clearly come a long way towards increasing the scope and applicability of visualizations in the real world.

Data preprocessing

Today, data is highly susceptible to noise and inconsistency due to its size and likely origin from multiple, heterogeneous sources and types. There are several data preprocessing techniques such as data cleaning, data integration, data reduction, and data transformation. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges and combines the data from multiple sources into a coherent format, mostly known as data warehouse. Data reduction can reduce data size by, for instance, merging, aggregating, and eliminating the redundant features. Data transformations may be applied where data is scaled to fall within a smaller range, thus improving the accuracy and efficiency in processing and visualizing them. The transformation cycle of data is shown in the following diagram:

Anomaly detection is the identification of unusual data that might not fall into an expected behavior or pattern in the collected data. Anomalies are also known as outliers or noise; for example in signal data, a particular signal that is unusual is considered noise, and in transaction data, an outlier is a fraudulent transaction. Accurate data collection is essential for maintaining the integrity of data. As much as the down side of anomalies, on the flip side, there is also a significant importance of outliers—specifically in cases where one would want to find fraudulent insurance claims, for instance.

Data processing

Data processing is a significant step in the transformation process. It is imperative that the focus be on data quality. Some processing steps that help in preparing data for analyzing and understanding it better are dependency modeling and clustering. There are other processing techniques, but we will limit our discussion here with the two most popular processing methods.

Dependency modeling is the fundamental principle of modeling data to determine the nature and structure of the representation. This process searches for relationships between the data elements; for example, a department store might gather data on the purchasing habits of its customers. This process helps the department store deduce the information about frequent purchases.

Clustering is the task of discovering groups in the data that have, in some way or another, a "similar pattern", without using known structures in the data.

Organizing data

Database management systems allow users to store data in a structured format. However, the databases are too large to fit into memory. There are two ways of structuring data:

  • Storing large data in disks in a structured format like tables, trees, or graphs

  • Storing data in memory using data structure formats for faster access

A data structure comprises a set of different formats for structuring data to be able to store and access it. The general data structure types are arrays, files, tables, trees, lists, maps, and so on. Any data structure is designed to organize the data to suit a specific purpose so that it can be stored, accessed, and manipulated at runtime. A data structure may be selected or designed to store data for the purpose of working on it with various algorithms for faster access.

Data that is collected, processed, and organized to be stored efficiently is much easier to understand, which leads to information that can be better understood.

Getting datasets

For readers who do not have access to organizational data, there are plenty of resources on the Internet with rich datasets from several different sources, such as:

Transforming information into knowledge

Information is quantifiable and measurable, it has a shape, and can be accessed, generated, stored, distributed, searched for, compressed and duplicated. It is quantifiable by the volume or amount of information.

Information transforms into knowledge by the application of discrete algorithms, and knowledge is expected to be more qualitative than information. In some problem domains, knowledge continues to go through an evolving cycle. This evolution happens particularly when the data changes in real time.

Knowledge is like the recipe that lets you make bread out of the information, in this case, the ingredients of flour and yeast. Another way to look at knowledge is as the combination of data and information, to which experience and expert opinion is added to aid decision making. Knowledge is not merely a result of filtering or algorithms.

What are the steps involved in this transformation, and how does the change happen? Naturally, it cannot happen by itself. Though the word information is subject to different interpretations based on the definition, we will explore it further within the context of computing.

A simple analogy to illustrate the difference between information and knowledge: course materials for a particular course provide you the necessary information about the concepts, and the teacher later helps the students to understand the concepts through discussions. This helps the students in gaining knowledge about the course. By a similar process, something needs to be done to transform information into knowledge. The following diagram shows the transformation from information to knowledge:

As illustrated in the figure, information when aggregated and run through some discrete algorithms, gets transformed into knowledge. The information needs to be aggregated to get broader knowledge. The knowledge obtained by this transformation helps in answering questions about the data or information such as which quarter did the company have maximum revenue from sales? How much has advertising driven the sales? Or, how many new products have been released this year?

Transforming knowledge into insight

In the traditional system, information is processed, and then analyzed to generate reports. Ever since the Internet came into existence, processed information is already and always available, and social media has emerged as a new way of conducting business.

Organizations have been using external data to gain insights via data analysis. For example, the measure of user sentiments from tweets by consumers via Twitter is used to follow the opinions about product brands. In some cases, there is a higher percentage of users giving a positive message on social media about a new product, say an iPhone or a tablet computer. The analytical tool can provide numerical evidence of that sentiment, and this is where data visualization plays a significant role.

Another example to illustrate this transformation, Netflix announced a competition in 2009 for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings. The winner of that competition used the pragmatic theory and achieved a 10.05 percent improvement in predicting user ratings, which increased the business value for Netflix.

Transforming knowledge into insight is achieved using collaboration and analytics as shown in the preceding diagram. Insight implies seeing the solution and realizing what needs to be done. Achieving data and information is easy and organizations have known methods to achieve that, but getting insight is very hard. Achieving insight requires new and creative thinking and the ability to connect the dots. In addition to applying creative thinking, data analysis and data visualization play a big role in achieving insight. Data visualization is considered both an art and a science.

 

Data visualization history


Visualization has its roots in a long historical tradition of representing information using primitive paintings and maps on walls, tables of numbers, and paintings on clay. However, they were not known as visualization or data visualization. Data visualization is a new term; it expresses the idea that it involves more than just representing data in a graphical form. The information behind the data should be revealed in an intuitive representation using good display; the graphic should inherently aid viewers in seeing the structure of data.

Visualization before computers

In early Babylonian times, pictures were drawn on clay and in the later periods were rendered on papyrus. The goal of those paintings and maps was to provide the viewer with a qualitative understanding of the information. We also know that understanding pictures are our natural instincts as a visual presentation of information is perceived with greater ease. This section includes only partial details about the history of visualization. For elaborate details and examples, we recommend two interesting resources:

Minard's Russian campaign (1812)

Charles Minard was a civil engineer working in Paris. He summarized the War of 1812—Napoleon's march on Moscow—in a figurative map. This map is a simple picture, which is both a visual timeline and a geographic map depicting the size and direction of the army, temperature, and the landmarks and locations. Prof. Edward Tufte famously described this picture as possibly being the best statistical graphic ever drawn.

The wedge starts with being thick on the left-hand side, and we see the army begin the campaign at the Polish border with 422,000 men. The wedge becomes narrower as it gets deeper into Russia and the temperature gets lower. This visualization manages to condense a number of different numeric and geographic facts into one image: when the army gets reduced, the reason for the reduction, and subsequently, their retreat.

The Cholera epidemics in London (1831-1855)

In October 1831, the first case of Asiatic cholera occurred in Great Britain, and over 52,000 people died in the epidemic. Subsequently, in 1848-1849 and 1853-1854, more cholera epidemics produced large death tolls.

In 1855, Dr. John Snow produced a map showing the deaths due to cholera clustered around the Broad Street pump in London. This map by Dr. John Snow was a landmark graphic discovery, but unfortunately, it was devised at the end of that period. His map showed the location of each of the deceased, and that provided an insight for his conclusion that the source of outbreak could be localized to contaminated water from a pump on Broad Street. Around that time, the use of graphs became important in economic and state planning.

Statistical graphics (1850-1915)

By the mid 18th century, a rapid growth of visualization had been established throughout Europe. In 1863, one page of Galton's multivariate weather chart of Europe showed barometric pressure, wind direction, rain, and temperature for the month of December 1861 (source: The life, letters and labors of Francis Galton, Cambridge University Press).

During this period, statistical graphics became mainstream and there were many textbooks written on the same. These textbooks contained detailed descriptions of the graphic method, discussing frequencies, and the effects of the choice of scales and baselines on the visual estimation of differences and ratios. They also contained historical diagrams in which two or more time series could be shown on a single chart for comparative views of their histories.

Later developments in data visualization

In the year 1962, John W. Tukey issued a call for the recognition of data analysis as a legitimate branch of statistics; shortly afterwards, he began the invention of a wide variety of new, simple, and effective graphic displays under the rubric Exploratory Data Analysis (EDA), which was followed by Exploratory Spatial Data Analysis (ESDA). Tukey later wrote a book titled Exploratory Data Analysis in 1977. There are a number of tools that are useful for EDA with graphical techniques, which are listed as follows:

  • Box-and-whisker plot (box plot)

  • Histogram

  • Multivari chart (from candlestick charts)

  • Run-sequence plot

  • Pareto chart (named after Vilfredo Pareto)

  • Scatter plot

  • Multidimensional scaling

  • Targeted projection pursuit

Visualization in scientific computing is emerging as an important computer-based field, with the goal to improve the understanding of data and to make quick real-time decisions. Today, the ability of medical doctors to diagnose ailments is dependent upon vision. For example, in hip-replacement surgeries, custom hips can now be fabricated before surgical procedures. Accurate measurements can be made prior to surgery using non-invasive 3D imaging thereby reducing the number of post-operative body rejections from 30 percent to a mere 5 percent (source: http://bonesmart.org/hip/hip-implants-specialized-and-custom-fitted-options/).

Visualization of the human brain structure and function in 3D is a research frontier of far-reaching importance. Few advances have transformed the fields of neuroscience and brain-imaging technology, like the ability to see inside and read the brain of a living human. For continued progress in brain research, it will be necessary to integrate structural and functional information at many levels of abstraction.

The rate at which the hardware performance power has been on the rise tells us that we are already able to analyze DNA sequences and visually represent them. The future advances in computing promises a much brighter progress in the fields of medicine and other scientific areas.

 

How does visualization help decision-making?


There is a variety of ways to represent data visually. However, there are only a few ways in which one can portray the data in a manner that allows one to see something visually and observe new patterns. Data visualization is not as easy as it seems; it is an art and requires a great deal of practice and experience. (Just like painting a picture—one cannot be a master painter from day one, it takes a lot of practice.)

Human perception plays an important role in the field of data visualization. A pair of healthy human eyes has a total field view of approximately 200 degrees horizontally (about 120 degrees of which are shared by both the eyes). About one quarter of the human brain is involved in visual processing, which is more than any other sense. Among the three senses of hearing, seeing, and smelling, human vision has the maximum sense—measured to be sixty per cent (http://contemplatingmadness.tumblr.com/post/27478393311/10-limits-to-human-perception-and-how-they-shape).

Effective visualization helps us in analyzing and understanding data. Author Stephen Few described the following eight types of quantitative messages (via visualization) that may help us with understanding or communicating from a set of data (source: https://www.perceptualedge.com/articles/ie/the_right_graph.pdf):

  • Time-series

  • Ranking

  • Part-to-whole

  • Deviation

  • Frequency distribution

  • Correlation

  • Nominal comparison

  • Geographic or geospatial

Scientists have mapped the human genome, and this is one of the reasons why we are faced with the challenges of transforming knowledge into a visual representation for better understanding. In other words, we may have to find new ways to visually present the human genome so that it is not difficult for a common person to understand.

Where does visualization fit in?

It is important to note that data visualization is not scientific visualization. Scientific visualization deals with the data that has an inherent physical structure, such as air molecules flowing over an aircraft wing. Information visualization, on the other hand, deals with abstract data, and helps in solving problems involving large datasets. One of the challenges is to ensure that the data is clean and subsequently, to reduce the dimensions so that unnecessary information is discarded.

Visualization can be used wherever we see increased knowledge or value of data. That can be determined by doing more data analysis and running through algorithms. The data analysis might vary from the simplest form to a more complicated one.

Sometimes, there is value in looking at data beyond the mean, median, or total, because these measurements only measure things that may seem obvious. Sometimes, aggregates or values around a region hide the interesting details that need special focus. One classic example is the "Anscombe's quartet" which comprises of four datasets that have nearly identical simple statistical properties yet appear very different when graphed. For more on this, one can refer to the link, https://en.wikipedia.org/wiki/Anscombe%27s_quartet.

Mostly, datasets that lend themselves well to visualization can take different forms, but some paint a clearer picture to understand than others. In some cases, it is mandatory to analyze them several times to get a much better understanding of the visualization as shown in the preceding diagram.

A good visualization is not just a static picture that one can look at, like an exhibit in a museum. It is something that allows us to drill down and find more about the change in data. For example, view first, zoom and filter, change the values of some scale of display, and view the results in an incremental way, as described in http://www.mat.ucsb.edu/~g.legrady/academic/courses/11w259/schneiderman.pdf by Ben Shneiderman. Sometimes, it is much harder to display everything on a single display and on a single scale, and only by experience, one can better understand these visualization methods. Summarizing further, visualization is useful in both organizing and making sense out of data, particularly when it is in abundance.

Interactive visualization is emerging as a new form of communication, which allows users to analyze the information in order to construct their own, new understanding of the data.

Data visualization today

While many areas of computing aim to replace human judgment with automation, visualization systems are unique and are explicitly designed not to replace humans. In fact, they are designed to keep the humans actively involved in the whole process; why is that?

Data Visualization is an art, driven by data and yet created by humans with the help of various computing tools. An artist paints a picture using tools and materials like brushes, and colors. Similarly, another artist tries to create data visualization with the help of computing tools. Visualization can be aesthetically pleasing and helps in making things clear; sometimes, it may lack one or both of those qualities depending on the users who create it.

Today, there are over thirty different visual representations of data, each having a reason to represent data in that specific way. As the visualization methods progress, we have much more than just bar graphs and pie charts. Despite the many benefits of data visualization, they are undermined due to a lack of understanding and, in some cases, due to cluttering together of things on a dashboard that becomes too cumbersome.

There are many ways to present data, but only a handful of those make sense in most cases; this will be explained in detail in later sections of this chapter. Before that discussion, let us take a look at a list of some important things that make a good visualization.

What is a good visualization?

Good visualization helps the users to explore and understand data, providing value and deep insights. It is effective, visually appealing, scalable, and is easy to understand (good visualization does not have to be too complicated). Visualization is a central tool in finding patterns and trends in the data by carrying out research and analysis, using whichever one can answer questions about the data.

The main principle behind an effective visualization is to identify the main point that you want to make, recognize the level and background of your audience, accurately represent the data, and then create a clear presentation that conveys the message to that audience.

Example: The following representations have been created with a small sample data source that shows the percentage of women and men conferred with degrees in ten different disciplines for the years from 1970-2012 (womens-undergrad-degrees.csv and mens-undergrad-degrees.csv from http://www.knapdata.com/python/):

The full data source available at http://nces.ed.gov/programs/digest/d11/tables/dt11_290.asp maintains the complete set of data.

One simple way is to represent them on one scale, although there is no relationship between the numbers between the different disciplines. Let us analyze and see if this representation makes sense, and if it doesn't, then what else do we need? Are there any other representations?

For one thing, all the data about the different disciplines is displayed on one screen, which is an excellent comparison. However, if we need to get the information for the year 2000, there is no straightforward way. Unless there is an interactive mode of display that is similar to a financial stock chart, there is no easy way to determine the information about the degrees conferred in multiple disciplines for the year 2000. Another confusing part of these plots is that the percentage doesn't add up to a sum of 100 percent. On the other hand, the percentage of conferred degrees within one discipline for men and women add up to 100 percent; for instance, the percentage of degrees conferred in the Health Professions discipline for men and women are 15.2 percent and 84.8 percent respectively.

Can we represent these through other visualization methods? One can create bubble charts for each year, have an interactive visualization with year selection, and also have a play button that transitions the bubbles for each year.

This visualization better suits the data that we are looking at. We can also use the same slider with the original plot and make it interactive by highlighting the data for the selected year. It is a good habit to visualize the data in several different ways to see if some display makes more sense than the other. We may have to scale the values on a logarithmic scale if there is a very large range of numerical values (for example, from 20 to 200,000).

One can write a program in Python to accomplish this bubble chart. Other alternate languages are JavaScript using D3.js and R using R-Studio. It is left for the reader to explore other visualization options.

Google Motion Chart can be used for visualization to represent this interactive chart at developers.google.com/chart/interactive/docs/gallery/motionchart?csw=1#Example where it shows a working example that is similar to this bubble chart. The bubble chart shown here is for only three years, but you can create another one for all the years.

Data visualization is a process that has to be used after data analysis. We also noticed earlier that data transformation, data analysis, and data visualization are done several times; why is that so? We all know the famous quote, Knowledge is having the right answer, Intelligence is asking the right question. Data analysis helps us to understand the data better and therefore be in a position to respond to questions about the data. However, when the data is represented visually in several different ways, some new questions emerge, and this is one of the reasons why there is a repeated process of analysis and visualization.

Visualization of data is one of the primary tools for data exploration, and almost always precedes or inspires data analysis. There are many tools to display data visually, but there are fewer tools to do the analysis. Programming languages like Julia, R, and Python have ranked higher for performing data analysis, but for visualization, JavaScript based D3.js has a much greater potential to generate interactive data visualization.

Between R and Python, R is a more difficult language to learn. Python, on the other hand, is much easier. This is also debated on Quora; one may check the validity of this on the internet (https://www.quora.com/Which-is-better-for-data-analysis-R-or-Python). Today there are numerous tools in Python for statistical modeling and data analysis, and therefore, it is an attractive choice for data science.

 

Visualization plots


One of the reasons why we perform visualization is to confirm our knowledge of data. However, if the data is not well understood, you may not frame the right questions about the data.

When creating visualizations, the first step is to be clear on the question to be answered. In other words, how is visualization going to help? There is another challenge that follows this—knowing the right plotting method. Some visualization methods are as follows:

  • Bar graph and pie chart

  • Box plot

  • Bubble chart

  • Histogram

  • Kernel Density Estimation (KDE) plot

  • Line and surface plot

  • Network graph plot

  • Scatter plot

  • Tree map

  • Violin plot

In the course of identifying the message that the visualization should convey, it makes sense to look at the following questions:

  • How many variables are we dealing with, and what are we trying to plot?

  • What do the x axis and y axis refer to? (For 3D, z axis as well.)

  • Are the data sizes normalized and does the size of data points mean anything?

  • Are we using the right choices of colors?

  • For time series data, are we trying to identify a trend or a correlation?

If there are too many variables, it makes sense to draw multiple instances of the same plot on different subsets of data. This technique is called lattice or trellis plotting. It allows a viewer to quickly extract a large amount of information about complex data.

Consider a subset of student data that has an unusual mixture of information about (gender, sleep, tv, exercise, computer, gpa) and (height, momheight, dadheight). The units for computer, tv, sleep, and exercise are hours, height is in inches and gpa is measured on a scale of 4.0.

The preceding data is an example that has more variables than usual, and therefore, it makes sense to do a trellis plot to visualize and see the relationship between these variables.

One of the reasons we perform visualization is to confirm our knowledge of data. However, if the data is not well understood, one may not frame the right questions about it.

Since there are only two genders in the data, there are 10 combinations of variables that can be possible (sleep, tv), (sleep, exercise), (sleep, computer), (sleep, gpa), (tv, exercise), (tv, computer), (tv, gpa), (exercise, computer), (exercise, gpa), and (computer, gpa) for the first set of variables; another two, (height, momheight) and (height, dadheight) for the second set. Following are all the combinations except (sleep, tv), (tv, exercise).

Our goal is to find what combination of variables can be used to make some sense out of this data, or to see if any of these variables have any meaningful impact. Since the data is about students, gpa may be a key variable that drives the relevance of the other variables. The preceding image depicts scatter plots that show that a greater number of female students have a higher gpa than the male students and a greater number of male students spend more time on computer and get a similar gpa range of values. Although all scatter plots are being shown here, the intent is to find out which data plays a more significant role, and what sense can we make out of this data.

A greater number of blue dots high up (for gpa on the y axis) shows that there are more female students with a higher gpa (this data was collected from UCSD).

The data can be downloaded from http://www.knapdata.com/python/ucdavis.csv.

One can use the seaborn package and display a scatter plot with very few lines of code, and the following example shows a scatter plot of gpa along the x - axis compared with the time spent on computer by students:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

students = pd.read_csv("/Users/kvenkatr/Downloads/ucdavis.csv")

g = sns.FacetGrid(students, hue="gender", palette="Set1", size=6)
g.map(plt.scatter, "gpa", "computer", s=250, linewidth=0.65,
  edgecolor="white")

g.add_legend()

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

These plots were generated using the matplotlib, pandas, and seaborn library packages. Seaborn is a statistical data visualization library based on matplotlib, created by Michael Waskom from Stanford University. Further details about these libraries will be discussed in the following chapters.

There are many useful classes in the Seaborn library. In particular, the FacetGrid class comes in handy when we need to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of data. FacetGrid can be drawn with up to three dimensions, that is, row, column and hue. These library packages and their functions will be described in later chapters.

When creating visualizations, the first step is to be clear on the question to be answered. In other words, how is visualization going to help? The other challenge is choosing the right plotting method.

Bar graphs and pie charts

When do we choose bar graphs and pie charts? They are the oldest visualization methods and pie chart is best used to compare the parts of a whole. However, bar graphs can compare things between different groups to show patterns.

Bar graphs, histograms, and pie charts help us compare different data samples, categorize them, and determine the distribution of data values across that sample. Bar graphs come in several different styles varying from single, multiple, and stacked.

Bar graphs

Bar graphs are especially effective when you have numerical data that splits nicely into different categories, so you can quickly see trends within your data.

Bar graphs are useful when comparing data across categories. Some notable examples include the following:

  • Volume of jeans in different sizes

  • World population change in the past two decades

  • Percent of spending by department

In addition to this, consider the following:

  • Add color to bars for more impact: Showing revenue performance with bars is informative, but adding color to reveal the profits adds visual insight. However, if there are too many bars, colors might make the graph look clumsy.

  • Include multiple bar charts on a dashboard: This helps the viewer to quickly compare related information instead of flipping through a bunch of spreadsheets or slides to answer a question.

  • Put bars on both sides of an axis: Plotting both positive and negative data points along a continuous axis is an effective way to spot trends.

  • Use stacked bars or side-by-side bars: Displaying related data on top of or next to each other gives depth to your analysis and addresses multiple questions at once.

These plots can be achieved with fewer than 12 lines of Python code, and more examples will be discussed in the later chapters.

With bar graphs, each column represents a group defined by a specific category; with histograms, each column represents a group defined by a quantitative variable. With bar graphs, the x axis does not have a low-end or a high-end value, because the labels on the x axis are categorical and not quantitative. On the other hand, in a histogram, there is going to be a range of values. The following bar graph shows the statistics of Oscar winners and nominees in the US from 2000-2009:

The following Python code uses matplotlib to display bar graphs for a small data sample from the movies (This may not necessarily be a real example, but gives an idea of plotting and comparing):

[5]: import numpy as np
     import matplotlib.pyplot as plt

     N = 7
     winnersplot = (142.6, 125.3, 62.0, 81.0, 145.6, 319.4, 178.1)

     ind = np.arange(N)  # the x locations for the groups
     width = 0.35        # the width of the bars

     fig, ax = plt.subplots()
     winners = ax.bar(ind, winnersplot, width, color='#ffad00')

     nomineesplot = (109.4, 94.8, 60.7, 44.6, 116.9, 262.5, 102.0)
     nominees = ax.bar(ind+width, nomineesplot, width,
       color='#9b3c38')

     # add some text for labels, title and axes ticks
     ax.set_xticks(ind+width)
     ax.set_xticklabels( ('Best Picture', 'Director', 'Best Actor',
       'Best Actress','Editing', 'Visual Effects', 'Cinematography'))

     ax.legend( (winners[0], nominees[0]), ('Academy Award Winners',  
       'Academy Award Nominees') )

     def autolabel(rects):
       # attach some text labels
       for rect in rects:
         height = rect.get_height()
         hcap = "$"+str(height)+"M"
         ax.text(rect.get_x()+rect.get_width()/2., height, hcap,
           ha='center', va='bottom', rotation="vertical")

     autolabel(winners)
     autolabel(nominees)

     plt.show()

Pie charts

When it comes to pie charts, one should really consider answering the questions, "Do the parts make up a meaningful whole?" and "Do you have sufficient real-estate to represent them using a circular view?". There are critics who come crashing down on pie charts, and one of the main reasons, for that is that when there are numerous categories, it becomes very hard to get the proportions and compare those categories to gain any insight. (Source: https://www.quora.com/How-and-why-are-pie-charts-considered-evil-by-data-visualization-experts).

Pie charts are useful for showing proportions on a single space or across a map. Some notable examples include the following:

  • Response categories from a survey

  • Top five company market shares in a specific technology (in this case, one can quickly know which companies have a major share in the market)

In addition to this, consider the following:

  • Limit pie wedges to eight: If there are more than eight proportions to represent, consider a bar graph. Due to limited real - estate, it is difficult to meaningfully represent and interpret the pieces.

  • Overlay pie charts on maps: Pie charts can be much easier to spread across a map and highlight geographical trends. (The wedges should be limited here too.)

Consider the following code for a simple pie-chart to compare how the intake of admissions among several disciplines are distributed:

[6]: import matplotlib.pyplot as plt

     labels = 'Computer Science', 'Foreign Languages', 
       'Analytical Chemistry', 'Education', 'Humanities', 
       'Physics', 'Biology', 'Math and Statistics', 'Engineering'

     sizes = [21, 4, 7, 7, 8, 9, 10, 15, 19]
     colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral',
       'red', 'purple', '#f280de', 'orange', 'green']
     explode = (0,0,0,0,0,0,0,0,0.1)
     plt.pie(sizes, explode=explode, labels=labels, 
       autopct='%1.1f%%', colors=colors)
     plt.axis('equal')
     plt.show()

The following pie chart example shows the university admission intake in some chosen top-study areas:

Box plots

Box plots are also known as box-and-whisker plots. This is a standardized way of displaying the distribution of data based on the five number summaries: minimum, first quartile, median, third quartile, and maximum. The following diagram shows how a box plot can be read:

A box plot is a quick way of examining one or more sets of data graphically, and they take up less space to define five summaries at a time. One example that we can think of for this usage is: if the same exam is given to two or more classes, then a box plot can tell when the most students in one class did better than most students in the other class. Another example is that if there are more people who eat burgers, the median is going to be higher or the top whisker could be longer than the bottom one. In such a case, it gives one a good overview of the data distribution.

Before we try to understand when to use box plots, here is a definition that one needs to understand. An outlier in a collection of data values is an observation that lies at an abnormal distance from other values.

Box plots are most useful in showing the distribution of a set of data. Some notable examples are as follows:

  • Identifying outliers in the data

  • Determining how the data is skewed towards either end

In addition to this, consider the following:

  • Hide the points within the box: focus on the outliers

  • Compare across distributions: Box plots are good for comparing quickly with distributions between data set

Scatter plots and bubble charts

A scatter plot is a type of visualization method for displaying two variables. The pattern of their intersecting points can graphically show the relationship patterns. A scatter plot is a visualization of the relationship between two variables measured on the same set of individuals. On the other hand, a Bubble chart displays three dimensions of data. Each entity with its triplet (a,b,c) of associated data is plotted as a disk that expresses two of those three variables through the xy location and the third shows the quantity measured for significance.

Scatter plots

The data is usually displayed as a collection of points, and is often used to plot various kinds of correlations. For instance, a positive correlation is noticed when the increase in the value of one set of data increases the other value as well. The student record data shown earlier has various scatter plots that show the correlations among them.

In the following example, we compare the heights of students with the height of their mother to determine if there is any positive correlation. The data can be downloaded from http://www.knapdata.com/python/ucdavis.csv.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
students = pd.read_csv("/Users/Macbook/python/data/ucdavis.csv")
g = sns.FacetGrid(students, palette="Set1", size=7)
g.map(plt.scatter, "momheight", "height", s=140, linewidth=.7, edgecolor="#ffad40", color="#ff8000")
g.set_axis_labels("Mothers Height", "Students Height")

We demonstrate this example using the seaborn package, but one can also accomplish this using only matplotlib, which will be shown in the following section. The scatterplot map for the preceding code is depicted as follows:

Scatter plots are most useful for investigating the relationship between two different variables. Some notable examples are as follows:

  • The likelihood of having skin cancer at different ages in males versus females

  • The correlation between the IQ test score and GPA

In addition to this, consider the following:

  • Add a trend line or line of best-fit (if the relation is linear): Adding a trend line can show the correlation among the data values

  • Use informative mark types: Informative mark types should be used if the story to be revealed is about data that can be visually enhanced with relevant shapes and colors

Bubble charts

The following example shows how one can use color map as a third dimension that may indicate the volume of sales or any appropriate indicator that drives the profit:

 [7]: import numpy as np
     import pandas as pd
     import seaborn as sns
     import matplotlib.pyplot as plt

     sns.set(style="whitegrid")
     mov = pd.read_csv("/Users/MacBook/python/data/2014_gross.csv")

     x=mov.ProductionCost
     y=mov.WorldGross
     z=mov.WorldGross

     cm = plt.cm.get_cmap('RdYlBu')
     fig, ax = plt.subplots(figsize=(12,10))

     sc = ax.scatter(x,y,s=z*3, c=z,cmap=cm, linewidth=0.2, alpha=0.5)
     ax.grid()
     fig.colorbar(sc)

     ax.set_xlabel('Production Cost', fontsize=14)
     ax.set_ylabel('Gross Profits', fontsize=14)

     plt.show()
..-.

The following scatter plot is the result of the example using color map:

Bubble charts are extremely useful for comparing relationships between data in three numeric-data dimensions: the x axis data, the y axis data, and the data represented by the bubble size. Bubble charts are like XY scatter plots, except that each point on the scatter plot has an additional data value associated with it that is represented by the size of the circle or "bubble" centered on the XY point. Another example of a bubble chart is shown here (without the python code, to demonstrate a different style):

In the preceding display, the bubble chart shows the Life Expectancy versus Gross Domestic Product per Capita around different continents.

Bubble charts are most useful for showing the concentration of data along two axes with a third data element being the significance value measured. Some notable examples are as follows:

  • The production cost of movies and gross profit made, and the significance measured along a heated scale as shown in the example

In addition to this, consider the following:

  • Adding color and shape significance: By varying the size and color, the data points can be transformed into a visualization that clearly answers some questions

  • Make it interactive: If there are too many data points, bubble charts could get cluttered, so group them on the time axis or categories, and visualize them interactively

KDE plots

Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function and its average across the observed data points to create a smooth approximation. They are closely related to histograms, but sometimes can be endowed with smoothness or continuity by a concept called kernel.

The kernel of a Probability Density Function (PDF) is the form of the PDF in which any factors that are not functions of any of the variables in the domain are omitted. We will focus only on the visualization aspect of it; for more theory, one may refer to books on statistics.

There are several different Python libraries that can be used to accomplish a KDE plot at various depths and levels including matplotlib, Scipy, scikit-learn, and seaborn. Following are two examples of KDE Plots. There will be more examples in later chapters, wherever necessary to demonstrate various other ways of displaying KDE plots.

In the following example, we use a random dataset of size 250 and the seaborn package to show the distribution plot in a few simple lines:

One can display simple distribution of a data plot using seaborn, which is demonstrated here using a random sample generated using numpy.random:

from numpy.random import randn
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_palette("hls")
mpl.rc("figure", figsize=(10,6))
data = randn(250)
plt.title("KDE Demonstration using Seaborn and Matplotlib", fontsize=20)
sns.distplot(data, color='#ff8000')

In the second example, we are demonstrating the probability density function using SciPy and NumPy. First we use norm() from SciPy to create normal distribution samples and later, use hstack() from NumPy to stack them horizontally and apply gaussian_kde() from SciPy.

The preceding plot is the result of a KDE plot using SciPy and NumPy, which is shown as follows:

from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace, hstack
from pylab import plot, show, hist

sample1 = norm.rvs(loc=-1.0, scale=1, size=320)
sample2 = norm.rvs(loc=2.0, scale=0.6, size=320)
sample = hstack([sample1, sample2])
probDensityFun = gaussian_kde(sample)
plt.title("KDE Demonstration using Scipy and Numpy", fontsize=20)
x = linspace(-5,5,200)
plot(x, probDensityFun(x), 'r')
hist(sample, normed=1, alpha=0.45, color='purple')
show()

The other visualization methods such as the line and surface plot, network graph plot, tree maps, heat maps, radar or spider chart, and the violin plot will be discussed in the next few chapters.

 

Summary


The examples shown so far are just to give you an idea of how one should think and plan before making a presentation. The most important stage is the data familiarization and preparation process for visualization. Whether one can get the data first or shape the desired story is mainly influenced by exactly what outcome is attempted. It is like the "chicken and the egg" situation—does data come first or the focus? Initially, it may not be clear what data one may need, but in most cases, after a few iterations, things will be clear as long as there are no errors in the data.

Transform the quality of data by doing some cleanup or reducing the dimensions (if required), and fill gaps if any. Unless the data is good, the efforts that one may put into presenting it visually will be wasted. After a reasonable understanding of the data is achieved, it makes sense to determine what kind of visualization may be appropriate. In some cases, it would be better to display it in several different ways to see the story clearly.

Latest Reviews (8 reviews total)
Svært lærerik innføring i visualisering
I really like the book content, the ability to have the high quality Images (I often used a black and white ereader) and the source code examples on github
Very Good reading experience!
Mastering Python Data Visualization
Unlock this book and the full library FREE for 7 days
Start now