Since this is the first chapter, it may be considered prudent to start out by providing a simple explanation of just what data visualization is and then a quick overview of various generally accepted data visualization concepts.
From there, we will proceed by pointing out the specific challenges that big data brings--to the practice of visualizing data--and then finally we will tee up a number of approaches for successfully creating valuable visualizations using big data sources.
After completing this chapter, the reader will be ready to start with the practical big data visualization examples covered in this book's subsequent chapters; each of which will focus on a specific big data visualization topic, using a specific trending tool or technology thought to be well fitted (note that other tools or technologies may be available) to address that particular topic or challenge.
We'll break down this first chapter into:
An explanation of data visualization
Conventional data visualization concepts
Challenges of big data visualization
Approaches to big data visualization
So what is data visualization? Simply put, one can think of the two words, data meaning information/numbers and visualization meaning picturing, or picturing the information as shown in the following figure:
Perhaps a simplistic example that can be used to define data visualization is the practice of striking lines between stars in the night sky to create an image.
Imagine certain stars as the data points you are interested in (among the billions of other stars that are visible in the sky) and connecting them in a certain order to create a picture to help one visualize the constellation.
Voila! Data visualization!
Nowadays, it is reported within the industry that data visualization is regarded by many disciplines as the modern equivalent of visual communication.
Okay, so then what is the point of or chief objective of visual communication or visualizing your data?
The main point (although there are other goals and objectives) when leveraging data visualization is to make something complex appear simple (or in our star example earlier, perhaps to make a data pattern more visible to a somewhat untrained eye).
Communicating a particular point or simplifying the complexities of mountains of data does not require the use of data visualization, but in some way today's world might demand it. That is, the majority of the readers of this book would most likely agree that scanning numerous worksheets, spreadsheets, or reports is mundane and tedious at best, while looking at charts and graphs is typically much easier on the eyes. Additionally, the fact is that we humans are able to process even very large amounts of data much quicker when the data is presented graphically. Therefore, data visualization is a way to convey concepts in a universal manner, allowing your audience or target to quickly get your point.
Other motives for using data visualization include:
To explain the data or put the data in context (that is, highlight demographical statistics)
To solve a specific problem, (for example, identifying problem areas within a particular business model)
To explore the data to reach a better understanding or add clarity (that is, what periods of time does this data span?)
To highlight or illustrate otherwise invisible data (such as isolating outliers residing in the data)
To predict, for example, potential sales volumes (perhaps based upon seasonality sales statistics)
With computers, technology, and the corporate business landscape changing so rapidly today (and all indications are that it will continue to change at an even faster pace in the future), what can be considered the future of the art of data visualization?
As per Data Visualization: The future of data visualization, Towler, 2015:
"Data visualization is entering a new era. Emerging sources of intelligence, theoretical developments, and advances in multidimensional imaging are reshaping the potential value that analytics and insights can provide, with visualization playing a key role."
With big data getting bigger (and bigger!), it is safe to undertake the notion that the use of data visualization will only continue to grow, to evolve, and to be of outstanding value. In addition, how one approaches the process and practice of data visualization will need to grow and evolve as well.
Let's start out this section by clarifying what we mean when we say conventional.
In the context of this book, when I say conventional, I am referring to the ideas and methods that have been used with some level of success within the industry over time (for data visualization).
Although it seems that every day, new technologies and practices are being discovered, developed, and deployed providing new and different options for performing ever more ingenious real-time (or near real time) data visualization, understanding the basic concepts for visualizing data is still essential.
To that point, gaining an understanding of just how to go about choosing the correct or most effective visualization method is essential.
To make that choice, one typically needs to establish:
The size and volume of the data to be visualized.
The data's cardinality and context.
What is it you are trying to communicate? What is the point that you want to communicate?
Who is your audience? Who will consume this information?
What kind or type of visual might best convey your message to your audience?
We have also been realistic that sometimes the approach taken or method used is solely based upon your time and budget.
Based on the earlier and perhaps other particulars--and you most likely are already familiar with these--the most common visualization methods/types include:
Line, bar, pie, area, flow, and bubble charts
Data series or a combination of charts
Venn diagrams, data flow diagrams, and entity relationship (ER) diagrams
As I've mentioned earlier, as and when needs arise, newer or lesser known options are becoming more main stream.
These include the following:
Each of the earlier mentioned data visualization types/methods speak to a particular scenario or target audience better than others--it all depends. Learning to make the appropriate choice comes from experience as well as (sometimes) a bit of trial and error.
Due to the popularity of data visualization, there exist many formal training options, (classroom and online) and new and unique training curriculums are becoming available every day.
Coursework includes topics such as:
Channeling an audience
Determining informational hierarchies
Sketching and wire framing
Defining a narrative
We're assuming that you have some background with the topic of data visualization and therefore the earlier deliberations were just enough to refresh your memory and sharpen your appetite for the real purpose of this book.
Let's take a pause here to define big data.
A large assemblage of data and datasets that are so large or complex that traditional data processing applications are inadequate and data about every aspect of our lives has all been used to define or refer to big data.
In 2001, then Gartner analyst Doug Laney introduced the 3Vs concept ( refer to the following link http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf). The 3Vs, according to Doug Laney, are volume, variety, and velocity. The 3Vs make up the dimensionality of big data: volume (or the measurable amount of data), variety (meaning the number of types of data), and velocity (referring to the speed of processing or dealing with that data).
With this concept in mind, all aspects of big data become increasingly challenging and as these dimensions increase or expand they will also encumber the ability to effectively visualize the data.
Look at the following figure and remember that Excel is not a tool to determine whether your data qualifies as big data:
If your data is too big for Microsoft Excel, it still really doesn't necessarily qualify as big data. In fact, gigabytes of data still are manageable with various techniques, enterprise, and even open source tools, especially with the lower cost of storage today. It is important to be able to realistically size the data that you will be using in an analytic or visualization project before selecting an approach or technology (keeping in mind expected data growth rates).
As the following figure illustrates, the aforementioned Volume, Variety, and Velocity have and will continue to lift Big Data into the future:
Let's take a moment to further examine the Vs.
Volume involves determining or calculating how much of something there is, or in the case of big data, how much of something there will be. Here is a thought provoking example:
How fast does moon dust pile up?
As written by Megan Gannon in 2014, (http://www.space.com/23694-moon-dust-mystery-apollo-data.html), a revisited trove of data from NASA's Apollo missions more than 40 years ago is helping scientists answer a lingering lunar question: how fast does moon dust build up? The answer: it would take 1,000 years for a layer of moon dust about a millimeter (0.04 inches) thick to accumulate (big data accumulates much quicker than moon dust!).
With every click of a mouse, big data grows to be petabytes (1,024 terabytes) or even Exabyte's (1,024 petabytes) consisting of billions to trillions of records generated from millions of people and machines.
Although it's been reported (for example, you can refer to the following link: http://blog.sqlauthority.com/2013/07/21/sql-server-what-is-the-maximum-relational-database-size-supported-by-single-instance/) that structured or relational database technology could support applications capable of scaling up to 1 petabyte of storage, it doesn't take a lot of thought to understand with that kind of volume it won't be easy to handle capably, and the accumulation rate of big data isn't slowing any time soon.
It's the case of big, bigger (and we haven't even approached determining), and biggest yet!
Velocity is the rate or pace at which something is occurring. The measured velocity experience can and usually does change over time. Velocities directly affect outcomes.
Previously, we lived and worked in a batch environment, meaning we formulate a question (perhaps what is our most popular product?), submit the question (to the information technology group), and wait--perhaps after the nightly sales are processed (maybe 24 hours later), and finally, we receive an answer. This is a business model that doesn't hold up now with the many new sources of data (such as social media or mobile applications), which record and capture data in real time, all of the time. The answers to the questions asked may actually change within a 24-hour period (such is the case with trending now information that you may have observed when you are online).
Given the industry hot topics such as Internet of Things (IoT), it is safe to say that these pace expectations will only quicken.
Thinking back to our previous mention of relational databases, it is generally accepted that relational databases are considered to be highly structured, although they may contain text in
Data today (and especially when we talk about big data) comes from many kinds of data sources, and the level in which that data is structured varies greatly from data source to data source. In fact, the growing trend is for data to continue to lose structure and to continue to add hundreds (or more?) of new formats and structures (formats that go beyond pure text, photo, audio, video, web, GPS data, sensor data, relational databases, documents, SMS, pdf, flash, and so on) all of the time.
The process of categorization helps us to gain an understanding of the data source.
The industry commonly categorizes big data this way--into the two groups (structured and unstructured)--but the categorizing doesn't stop there.
Some simple research reveals some interesting new terms for subcategorizing these two types of data varieties:
Structured data includes subcategories such as created, provoked, transactional, compiled, and experimental, while unstructured data includes subcategories such as captured and submitted (just to name a few of the currently trending terms for categorizing the types of big data. You may be familiar with or be able to find more).
It's worth taking some time here to speak about these various data formats (varieties) to help drive the point to the reader of the challenges of dealing with the numerous big data varieties:
Created data: This is the data being created for a purpose; such as focus group surveys or asking website users to establish an account on the site (rather than allowing anonymous access).
Provoked data: This is described as data received after some form of provoking, perhaps such as providing someone with the opportunity to express the individual's personal view on a topic, such as customers filling out product review forms.
Transactional data: This is data that is described as database transactions, for example, the record of a sales transaction.
Compiled data: This is data described as information collected (or compiled) on a particular topic such as credit scores.
Experimental data: This is described as when someone experiments with data and/or sources of data to explore potential new insights. For example, combining or relating sales transactions to marketing and promotional information to determine a (potential) correlation.
Captured data: This is the data created passively due to a person's behavior (like when you enter a search term on Google, perhaps the creepiest data of all!).
User-generated data: This is the data generated every second by individuals, such as from Twitter, Facebook, YouTube, and so on (compared to captured data, this is data you willingly create or put out there).
To sum up, big data comes with no common or expected format and the time required to impose a structure on the data has proven to be no longer worth it.
In addition to what we mentioned earlier, there are additional challenging areas that big data brings to the table especially to the task of data visualization, for example, the ability to effectively deal with data quality, outliers, and to display results in a meaningful way, to name a few.
Again, it's worth quickly visiting each of these topics here now.
The value of almost anything and everything is directly proportional to its level of quality and higher quality is equal to higher value.
Data is no different. Data (any data) can only prove to be a valuable instrument if its quality is certain.
The general areas of data quality include:
Consistency (across sources)
The quality of data can be affected by the way it is entered, stored, and managed and the process of addressing data quality (referred to most often as quality assurance, data quality assurance (DQA), requires a routine and regular review and evaluation of the data, and performing on going processes termed profiling and scrubbing (this is vital even if the data is stored in multiple disparate systems making these processes difficult).
Effective profiling and scrubbing of data necessitates the use of flexible, efficient techniques capable of handling complex quality issues hidden deep in the depths of very large and ever accumulating (big data) datasets.
With the complexities of big data (and its levels of volume, velocity, and variety), it should be easy for one to recognize how problematic and restrictive the DQA process is and will continue to become.
The following is a simple figure introducing the concept of an outlier, that is, one lonesome red dot separated from the group:
As per Sham Mustafa, founder and CEO of data scientist marketplace Correlation One:
"Anyone who is trying to interpret data needs to care about outliers. It doesn't matter if the data is financial, sociological, medical, or even qualitative. Any analysis of that data or information must consider the presence and effect of outliers. Outliers (data that is "distant" from the rest of the data) indicating variabilities or errors - need to be identified and dealt with."
For clarification, you might accept the notion that an outlier is an observation point that is distant or vastly different from other observations (or data points) in a sum of data.
Once identified, regularly accepted methods for dealing with these outliers may be (simply?) moving them to another file or replacing the outliers with other more reasonable or appropriate values. This way of outlier processing is perhaps not such a complicated process, but is one that must be seriously thought out and rethought before introducing any process to identify and address outliers in a petabyte or more of data.
Another point to consider is, are the outliers you identify in your data an indicator that the data itself is bad or faulty or are the outliers' random variations caused by new and interesting points or characteristics within your data?
Either way, the presence of outliers in your data will require a valid and (especially in the case of big data) a robust method for dealing with them.
Rather than words or text, the following diagram clearly demonstrates the power of a visualization when conveying information:
A picture is worth a thousand words and Seeing is believing are just two adages that elucidate the powers of data visualization.
As per Millman/Miller Data Visualization: Getting Value from Information 2014:
"The whole point of data visualization is to provide a visual experience."
Successfully conducting business today requires that organizations tap into all the available data stores finding and analyzing relevant information very quickly, looking for indications and insights.
Data visualization is a key technique permitting individuals to perform analysis, identify key trends or events, and make more confident decisions much more quickly. In fact, data visualization has been referred to as the visual representation of business intelligence and industry research analyst Lyndsay Wise said in an article back in 2013:
"Even though there is plenty that users can accomplish now using data visualization, the reality is that we are just at the tip of the iceberg in terms of how people will be using this technology in the future."
Refer to the following link for more information:
The idea of establishing and improving the quality levels of big data might also be classified as the fourth V: veracity. Data that is disparate, large, multiformatted, and quick to accumulate and/or change (also known as big data) causes uncertainty and doubt (can I trust this data?). The uncertainty that comes with big data may cause the perhaps valuable data to be excluded or over looked.
As we've already mentioned, big data visualization forces a rethinking of the massive amounts of both structured and unstructured data (at great velocity) and unstructured data will always contain a certain amount of uncertain and imprecise data. Social media data, for example, is characteristically uncertain.
A method for dealing with big data veracity is by assigning a veracity grade or veracity score for specific datasets to evade making decisions based on analysis of uncertain and imprecise big data.
Although big data may well offer businesses exponentially more opportunities for visualizing their data into actionable insights, it also increases the required effort and expertise to do so (successfully and effectively).
Again, the same challenges are presented; such as accessing the level of detail needed from perhaps unimaginable volumes of levels of data, in an ever-growing variety of different formats--all at a very high speed--is noticeably difficult.
A meaningful display requires you to pay attention to various proven practice philosophies; these concepts include (but are not limited to):
The proper arrangement of related information
Appropriately using color(s)
Correctly defining decimal placements
Limiting the use of 3D effects or ornate gauge designs
The reader should take note that this book is not intending to cover all of the fundamental data visualization techniques, but is focusing on the challenges of big data visualization practices and it is assumed that the reader has general knowledge of and experience with the process of data visualization. However, one who may be interested in the topic should perhaps take some time to review the idea of the Data-Ink Ratio introduced by Edward Tufte. Tufte does an excellent job in introducing and explaining this concept in the best-selling book The Visual Display of Quantitative Information, Edward R. Tufte, January 2001.
Without context, data is meaningless and the same applies to visual displays (or visualizations) of that data.
For example, data sourced from social media may present entirely different insights depending on user demographics (that is, age group, sex, or income bracket), platform (that is, Facebook or Twitter), or audience (those who intend to consume the visualizations).
Acquiring a proper understanding (establishing a context) of the data takes significant domain expertise as well as the ability to properly analyze the data; big data certainty complicates these practices with its seemingly endless number of formats and varieties of both structured and unstructured data.
Even if you are able to assign the appropriate context to your data, the usability or value of the data will be (at least) reduced if the data is not timely. The effort and expense required to source, understand, and visualize data is squandered if the results are stale, obsolete, or potentially invalid by the time the data is available to the intended consumers. For example, when a state government agency is preparing a budget request for the governor, the most up-to-date consensus figures are vital; without accuracy, here, the funds may fall short of the actual needs.
The challenge of speedily crunching numbers exists within any data analysis, but when considering the varieties and volumes of data involved in big data projects, it becomes even more evident.
It may (or may not) be evident to the reader that too much information displayed in one place can cause the viewer to have what is referred to as sensory overload and that simple restrictions such as real estate (the available viewing space on a web page or monitor) can (and most likely will) be detrimental to the value of a visualization trying to depict too many data points or metrics.
In addition, complicated or intricate visuals or those that attempt to aggregate or otherwise source a large number of data sources most likely will be hindered by the experience of slow performance. In other words, the more data you need to process to create or refresh your visualization, the longer wait time there will most likely be, which will increase audience frustration levels and usability and value of the visualization.
Beyond the earlier mentioned pitfalls, when dealing with big data, even creating a simple bar graph visualization can be overwhelmingly difficult since attempting to plot points for analysis with extremely large amounts of information or a large variety of categories of information simply won't work.
Visualizations of data should be used to uncover trends and spot outliers much quicker than using worksheets or reports containing columns and rows of numbers and text, but these opportunities will be lost if care is not taken to address the mentioned challenges.
Users can leverage visualizations such as a column chart, for example, to see where sales may be headed or to identify topics that need attention at a glance or glimpse. But imagine trying to churn through and chart twenty billion records of data! Even if the data could be processed into a visualization, anyone trying to view that number of plots within a single visualization will have a very difficult time just viewing so many data points.
Thankfully, there are various approaches (or strategies) that have come to exist and can be used for preparing effective big data visualizations as well as addressing the hindrances we've mentioned (variety, velocity, volume, and veracity).
Some of the examples include:
You can change the type of the visualization, for example, switching from a column graph to a line chart can allow you to handle more data points within the visualization.
You can use higher-level clustering. In other words, you can create larger, broader stroke groupings of the data to be represented in the visualization (with perhaps linked subcharts or popups allowing a selected grouping to be broken out into subgroupings) rather than trying to visualize an excessive number of groups.
You can remove outliers from the visualization. Outliers typically represent less than 5 percent of a data source, but when you're working with massive amounts of data, viewing that 5 percent of the data is challenging. Outliers can be removed and if appropriate, be presented in a separate data visualization.
You can consider capping, which means setting a threshold for the data you will allow into your visualization. This cuts down on the range or data making for a smaller, more focused image.
These strategies (and others) help, but aren't really sufficient when it comes to working with big data.
The remaining chapters of this book are outlined later in this chapter and I will provide practical approaches and solutions (with examples) to consider for successful big data visualization.
When it comes to the topic of big data, simple data visualization tools with their basic features become somewhat inadequate. The concepts and models necessary to efficiently and effectively visualize big data can be daunting, but are not unobtainable.
Using workable approaches (studied in the following chapters of this book) the reader will review some of the most popular (or currently trending) tools, such as:
This is done in an effort to meet the challenges of big data visualization and support better decision making.
It is expected that our reading audience would be data analysts or those having at least basic knowledge of data analysis and visualization and now are interested in learning about the various alternatives for big data visualization in order to make their analysis more useful, more valuable, and hopefully have some fun doing it!
Readers holding some knowledge of big data platform tools (such as Hadoop) and having exposure to programming languages (such as perhaps R or Python) will make the most of the remaining chapters, but all should benefit.
We've already touched on the 3Vs (plus veracity), which include the challenges of both the storing of the large and ever-growing amounts (volumes) of data as well as being able to rapidly (with velocity) access, manipulate, and manage that data.
Chapter 2, Access, Speed, and Storage with Hadoop, of this book will expound on this topic and introduce Hadoop as the game changing technology to use for this purpose.
Dealing with expanding data sizes may lead to perpetually expanding a machines resources, to cover the expanding size of the data. Typically, this is a short-lived solution.
When dealing with data too large to handle with a single machine's memory (that is, big data) a common approach is to sample the data, meaning that basically you try to construct a smaller dataset from the full dataset that you feel is reasonably representative (of the full dataset). Using Hadoop, you have the ability to run many exploratory data analysis tasks on full datasets, without sampling, with the results efficiently returned to your machine or laptop.
Hadoop removes the restrictions and limitations that hardware levies on the storage of big data by providing the ability to streamline data (from every touch point in any organizational data source, whether the data is structured or unstructured) for your needs across clusters of computers (which means this solution is basically infinitely scalable) using simple programming models.
The Hadoop online product documentation points out:
"Data which was previously too expensive to store, can now be stored and made available for analysis to improve business insights at 1/10 to 1/50 the cost on a per terabyte basis."
Refer to the following link for more information www.mapr.com/why-hadoop/game-changer2016.
We'll cover working examples to demonstrate solutions for effectively storing and accessing big data, but the reader should take note that Hadoop also works well with smaller amounts of data (as well as the infinity large amounts) so you can be sure that any example used in this book will not have to be reworked based upon the actual size (or actual volume) of data you may be using in your future analysis projects.
In an effort to paint a complete picture here (and we'll do this throughout all of the chapters), we will also take some time and consider the how and why of non-Hadoop (or alternate) solutions to the examples given--and considering how well they may compare to a Hadoop solution.
When it comes to performing data analytics, facts can be stupid and stubborn things. They can provide us with the business intelligence metrics we long for, but without predictive analytics based on contextual interpretation, we may find ourselves using skewed quantitative analysis that produces less-than-desirable results.
The appropriate use of context in analytics makes all the difference toward achieving optimal results, a [email protected] staff article, which is available at https://onlinebusiness.american.edu/how-do-we-use-data-for-good-add-context/.
In Chapter 3, Context - Understanding Your Data Using R, of this book, the importance of gaining an understanding of the data you are working with and specifically, the challenges of establishing or adding context to big data will be covered with working examples demonstrating solutions for effectively addressing the issues that are presented.
Adding context to data requires manipulation of that data to review and perhaps reformat, adding calculations, aggregations, or additional columns or re-ordering, and so on.
In Chapter 3, Context - Understanding Your Data Using R, we will introduce the R programming tool as the choice for performing this type of processing and manipulating your data.
R is a language and environment very focused on statistical computing.
R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and so on) and graphical techniques, and it is highly extensible. You can refer to more information on this at www.r-project.org/about.html.
Beyond the perhaps more sophisticated modeling techniques such as performing a time-series analysis, R also supports the need for performing simple tasks such as creating a summary table, which can be used to determine data groupings.
One thing to keep in mind is that R preserves everything in machine memory.
This can become a problem if you are working with big data (even with the introduction of the low resource costs of today).
With R, sampling is a popular method for dealing with big data. In Chapter 3, Context - Understanding Your Data Using R, our focus is on gaining context of data, so sampling is acceptable.
R is great for manipulating and cleaning data, producing probability statistics, as well as actually creating visualizations with data, so it's a good choice for establishing a context for your data.
It has been said that beauty is in the eyes of the beholder, and the same can be said when trying to define data quality. What this means is if the data meets your level of expectations or, at least the minimal of requirements of a particular project, then it has some form or level of quality.
Data can have acceptable quality even if there are known complications with it. These complications can be overcome with processes we'll discuss later or, if appropriate, simply overlooked.
Even though your data may contain acceptable complications, the reader should be sure to make no mistake such that any data visualization created based upon this data will only prove to be a valuable tool if the quality of that data is assured to be at the level required. However, when using large volumes of data, it can become extremely difficult to address the quality of the data.
There are many examples of the effects of poor data quality, such as the following, which was written in an article by Sean Jackson (http://www.actian.com/about-us/blog/never-underestimate-importance-good-data-quality/):
"A business professional could not understand why response rates to campaigns and activities were so low. Nor why they couldn't really use analytics to get any competitive advantage. A quick investigation of their data and systems soon showed that a large section of the data they were using was either out-of-date, badly formatted, or just erroneous."
Data quality solutions must enable you to clean, manage, and make reliable data available across your organization.
Chapter 4, Addressing Big Data Quality, of this book offers working examples demonstrating solutions for effectively assessing and improving the level of quality of big data sources.
Typically, the first step in determining the quality of your data is performing a process referred to as profiling the data (mentioned earlier in this chapter). This is sort of an overall auditing process that helps you examine and determine whether your existing data sources meet the quality expectations or perhaps standards of your intended use or purpose.
Profiling is vitally important in that it can help you identify concerns that may exist within the data that attending to up front (before going on and actually creating a data visualization) will save valuable time (rather than having to process and reprocess the poor qualities of the data later). In fact, more importantly, it can save you from creating and presenting a visualization that contains an inaccurate view of the data.
Data profiling becomes even more critical when working with perhaps unstructured raw data sources (or data that is a mix of structured and unstructured data) that do not have referential integrity or any other quality controls. In addition, single source (data sourced from only a single place) and multisource data (a dataset that is sourced from more than one place) will most likely have additional opportunities for data concerns.
Concerns found in single sources are typically intensified when multiple sources need to be integrated into one dataset for a project. Each source may contain data concerns, but in addition, the same data in different data sources may be represented differently, overlap, or contradict.
Typical profiling tasks include the following:
Identifying fields/columns within the data
Listing field/column attributes and statistics such as column lengths and value distribution percentages
Reviewing field/column value distributions
Reporting of value statistics such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for date and time columns
Identifying all the distinct values in the data
Identifying patterns and pattern distributions within the data
The goal of these tasks (and others) is to (as the name implies) establish your data's profile by determining its characteristics, relationships, and patterns within the data and, hopefully, produce a clearer view of the content and quality of your data, that is, the data profile.
After profiling, one would most likely proceed with performing some form of scrubbing (also sometimes referred to as cleansing or in some cases preparing) of the data (to improve its quality, also mentioned earlier in this chapter).
The processes of cleansing data may be somewhat or even entirely different, depending upon the data's intended use. Because of this, the task of defining what is to be determined an error is the critical first step to be performed before any processing of the data. Even what is done to resolve the defined errors may differ, again based upon the data's intended use.
During the process of cleansing or scrubbing your data, you would perform tasks such as perhaps reformatting fields or adding missing values, and so on.
Generally, scrubbing is made up of the following efforts:
Defining and determining errors within the data--what do you consider an error?
Searching and identifying error instances--once an error is defined, where do they exist in your data?
Correction of the errors--remove them or update them to acceptable values.
Error instance and document error types--or labeling (how was the error determined and what was done to resolve it).
Updating the entry mechanism to avoid future errors--create a process to make sure future occurrences of this type are dealt with.
In Chapter 4, Addressing Big Data Quality, we've elected to continue (from the previous chapter) to leverage the R programming language to accomplish some of the profiling work and also introduce and use the open source data manager utility for manipulating our data and addressing the quality.
Data manager is an excellent utility available as a library of Java code that is aimed at data synchronization work for moving data between different locations and different databases.
Data visualization is when you manually or otherwise organize and display data in a pictorial or graphic format in an attempt to enable your audience to:
See the results of your analysis efforts more clearly
Simplify the complexities within the data you are using
Understand and grasp a point that you are using the data to make
This concept of using pictures--typography, color, contrast, and shape--to communicate or understand data is not new and has been around for literally centuries, from the manual creation of maps and graphs in the 17th century to the invention of the pie chart in the early 1800s.
Today, computers can be used to process large amounts of data lightning fast to make visualizations tremendously more valuable. Going forward, we can expect the data visualization process to continue to evolve, perhaps as more of a mixture of art and science rather than a numbers crunching technology.
An exciting example of the data visualization evolutionary process is how the industry has moved data visualizations past the process of generating and publishing charts and graphs for an audience to review and deliberate on to now having set up an expectation for interactive visualizations.
With interactive visualization, we can take the concept of data visualization much, much further by using technology to allow the audience to interact with the data; giving the user the self-service ability to drill down into the generated pictures, charts, and graphs (to access more or specific details), interactively in real time (or near real time) to change what data is displayed (perhaps a different time frame or event) and how it's processed and/or presented (maybe select a bar graph rather than a pie chart).
This allows visualizations to be much more effective and personalized.
In Chapter 5, Displaying Results with D3, we will go through the topic of displaying the results of analysis on big data using a typical web browser using Data Driven Documents (D3) in a variety of examples. D3 allows the ability to apply prebuilt data visualizations to datasets.
Data Driven Documents is referred within the open community as D3.
These library components give you excellent tools for big data visualization and a data-driven approach to DOM manipulation. D3's functional style allows the reuse of library code modules that you've already built (or others have already built) adding pretty much any particular features you need or want (or don't want) to. This creates a means that can become as powerful as you want it (or have the time to make it) to be, to give a unique style to your data visualizations, manipulate and make it all interactive--exactly how you want or need it to be.
As discussed earlier in this chapter, big data is collecting and accumulating daily, in fact; minute-by-minute and there is a realization that organizations rely on this information for a variety of reasons.
Various types of reporting formats are utilized on this data, including data dashboards.
As with everything, there are various apprehensions as to the most accurate definition of what a data dashboard is.
For example, A. Chiang writes:
"A dashboard is a visual display of the most important information needed to achieve one or more objectives; consolidated and arranged on a single screen so the information can be monitored at a glance."
Refer to the following link for more information: http://www.dashboardinsight.com/articles/digital-dashboards/fundamentals/what-is-a-dashboard.aspx.
Whatever the definition, any dashboard has the capacity for supplying timely, important information for its audience to use in decision making, if it is well designed and constructed.
It is critical that dashboards present data in a relevant, concise, and well-thought-out manner (not just a collection of visual representations in a workbook or spreadsheet) and in addition, dashboards have to have a supporting infrastructure capable of refreshing the dashboard in a well-timed manner as well as including some form of DQA. Making decisions based upon a dashboard with incorrectly presented, stale, or even incorrect data can lead to disaster.
Chapter 6, Dashboard for Big Data - Tableau, of this book offers examination of the topic of effective dashboarding and includes working examples demonstrating solutions for effectively presenting results based upon your big data analysis in a real-time dashboard format using Tableau.
Tableau is categorized as business intelligence software designed to help people see and understand data; more than just a code library, Tableau is considered to be a suite or a family of interactive data visualization products.
Tableau's structure allows us the ability to combine multiple views of data from multiple sources into a single, highly effective dashboard that can provide the data consumers with much richer insights. Tableau also works with a variety of formats of (both structured and unstructured) data and can handle the volumes of big data, literally, petabytes or terabytes, millions or billions of rows, turning that big data into valuable visualizations for targeted audiences.
To address the velocity of today's big data world, you can use Tableau to connect directly to local and cloud data sources, or just import your data for fast in-memory (more on in-memory later in this book) performance.
Another goal of Tableau is self-service analytics (which we mentioned earlier in this chapter and will talk more about later on), where a user can have a dialog with selected data to ask questions (in real time, not in a batch mode) using easy point-and-click analytics to mine big data intuitively and effectively discovering understandings and opportunities that may exist within the dataset or datasets.
Some of the more exciting abilities Tableau offers include:
Real-time drag-and-drop cluster analysis
Cross data source joining
Powerful data connectors
Real-time territory or region data exploration
In Chapter 7, Dealing with Outliers Using Python, we will dive into Outliers.
As was defined earlier in this chapter, an outlier is an observation point that is distant or vastly different from the other observed data points within the data.
Although outliers typically represent (only) about 1 to 5 percent of your data, when you're working with big data, investigating, or even just viewing, 1 to 5 percent of that data is rather difficult.
Outliers, you see, can be determined to be noninfluential or very influential to the point you are trying to make with your data visualization.
The act or process of making this determination is critically important to your analysis, but it is also very problematic when dealing with the larger volumes, many varieties, and velocities of big data. For example, a fundamental step to help make this determination is called the sizing of your samples, which is the main mathematical process of calculating the percentage of outliers to the size of the data sample, which is not so simple a task when the data is in petabytes or terabytes!
Identifying and removing outliers can be tremendously complicated and there are many differences in opinions as to how to go about determining the percentage of outliers that exist in your dataset as well as determining their effect on the data and deciding what to do with them. It is, however, generally accepted that an automated process can be created that can facilitate at least the identification of outliers, possibly even through the use of visualization.
Carrying on, all the approaches for the investigation and adjudication of outliers such as sorting, capping, graphing, and so on require manipulating and processing of the data using a tool that is feature--rich and robust.
This chapter offers working examples demonstrating solutions for effectively and efficiently identifying and dealing with big data outliers (as well as some other dataset anomalies) using Python.
Python is a scripting language that is extremely easy to learn and incredibly readable, since its coding syntax so closely resembles the English language.
According to the article, The 9 most in-demand programming languages of 2016, by Bouwkamp, available at http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016, Python is listed in the top most in-demand programming languages (at the time of writing).
Born as far back as 1989 and created by Guido van Rossum, Python is actually very simple in nature, but it is also considered by the industry to be extremely powerful, fast, and it can be run in almost any environment.
As per www.python.org:
"Open sourced (and free!), Python is part of the winning formula for productivity, software quality, and maintainability at many companies and institutions around the world."
There is a growing interest within the industry to utilize the Python language for data analysis and even for big data analysis and it is the exceptional choice for the data scientist to perform typical day to day activities as it provides libraries, in fact a standard library (even some focusing specifically on big data, such as Pydoop and SciPy) to accomplish almost anything you need or want to do with the data you have or are accumulating, including:
Building websites and web pages
Accessing and manipulating data
Building predictive and explanatory models
Evaluating models on additional data
Integrating models into production systems
As a final note here, Python's standard library is very extensive, offering a wide range of built-in modules that provide access to system functionalities, as well as standardized solutions to solve many problems that occur in everyday programming making this an obvious choice to explore for dealing with big data outliers and related processing.
In Chapter 8, Big Data Operational Intelligence with Splunk, of this book, we concentrate on big data Operational Intelligence.
Operational intelligence (OI) is a type of analytics that attempts to deliver visibility and insight from (usually machine generated) operational or event data, running queries against streaming data feeds in real time, producing analytic results as operational instructions, which can be immediately acted upon by an organization, through manual or automated actions (a clear example of turning datasets into value!).
Sophisticated OI systems also provide the ability to associate metadata with certain metrics, process steps, channels, and so on, found within data. With this ability, it becomes easy to acquire additional related information, for example, machine-generated operational data is typically full of unique identifiers and result or status codes. These codes or identifiers may be efficient for processing and storage, but are not always easily interpreted by human beings. To make this data more readable (and therefore more valuable) we can associate additional information that is more user friendly with the data results--possibly in the form of a status or event description or perhaps a product name or machine name.
Once there is an understanding of the challenges of applying basic analytics and visualization techniques to operational big data, the value of that data can be better or more quickly realized. In this chapter, we offer working examples demonstrating solutions for the valuing of operational or event big data with operational intelligence using Splunk.
So, what is Splunk? H. Klein says:
"Splunk started out as a kind of "Google for Log files". It does a lot more... It stores all your logs and provides very fast search capabilities roughly in the same way Google does for the internet..." -- https://helgeklein.com/blog/2014/09/splunk-work/
Splunk software is a great tool to help unlock hidden value in machine generated, operational data (as well as other types of data). With Splunk, you can collect, index, search, analyze, and visualize all your data in one place, providing an integrated method to organize and extract real-time insights from massive amounts of (big data) machine data from virtually anywhere.
Splunk stores data in flat files, assigning indexes to the files. Splunk doesn't require any database software running in the background to make this happen. Splunk calls these files indexers. Splunk can index any type of time-series data (data with timestamps), making it an optimal choice for big data OI solutions. During data indexing, Splunk breaks data into events based on the timestamps it identifies.
Although using simple search terms will work, (for example, a machine ID) Splunk also offers its own Search Processing Language (SPL). Splunk SPL (think of it as kind of like SQL) is an extremely powerful tool for searching enormous amounts of big data and performing statistical operations on what is relevant within a specific context.
There are multiple versions of Splunk, including a free version that is pretty much fully functional.
In this chapter, we were offered an explanation of just what the term data visualization means and discussed the industry accepted conventional visualization concepts.
In addition, we introduced the challenges of working with big data and outlined the topics and technologies that the rest of this book will present.
In the next chapter, we address volume, speed, and velocity using Hadoop.