Practical Data Analysis Using Jupyter Notebook

By Marc Wintjen
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Fundamentals of Data Analysis

About this book

Data literacy is the ability to read, analyze, work with, and argue using data. Data analysis is the process of cleaning and modeling your data to discover useful information. This book combines these two concepts by sharing proven techniques and hands-on examples so that you can learn how to communicate effectively using data.

After introducing you to the basics of data analysis using Jupyter Notebook and Python, the book will take you through the fundamentals of data. Packed with practical examples, this guide will teach you how to clean, wrangle, analyze, and visualize data to gain useful insights, and you'll discover how to answer questions using data with easy-to-follow steps.

Later chapters teach you about storytelling with data using charts, such as histograms and scatter plots. As you advance, you'll understand how to work with unstructured data using natural language processing (NLP) techniques to perform sentiment analysis. All the knowledge you gain will help you discover key patterns and trends in data using real-world examples. In addition to this, you will learn how to handle data of varying complexity to perform efficient data analysis using modern Python libraries.

By the end of this book, you'll have gained the practical skills you need to analyze data with confidence.

Publication date:
June 2020
Publisher
Packt
Pages
322
ISBN
9781838826031

 
Fundamentals of Data Analysis

Welcome and thank you for reading my book. I'm excited to share my passion for data and I hope to provide the resources and insights to fast-track your journey into data analysis. My goal is to educate, mentor, and coach you throughout this book on the techniques used to become a top-notch data analyst. During this process, you will get hands-on experience using the latest open source technologies available such as Jupyter Notebook and Python. We will stay within that technology ecosystem throughout this book to avoid confusion. However, you can be confident the concepts and skills learned are transferable across open source and vendor solutions with a focus on all things data.

In this chapter, we will cover the following:

  • The evolution of data analysis and why it is important
  • What makes a good data analyst?
  • Understanding data types and why they are important
  • Data classifications and data attributes explained
  • Understanding data literacy
 

The evolution of data analysis and why it is important

To begin, we should define what data is. You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for reference or analysis. The focus of my definition should be the word persistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use. Rather than focus on the formal definition, let's discuss the world of data and how it impacts our daily lives. Whether you are reading a review to decide which product to buy or viewing the price of a stock, consuming information has become significantly easier to allow you to make informed data-driven decisions.

Data has been entangled into products and services across every industry from farming to smartphones. For example, America's Grow-a-Row, a New Jersey farm to food bank charity, donated over 1.5 million pounds of fresh produce to feed people in need throughout the region each year, according to their annual report. America's Grow-a-Row has thousands of volunteers and uses data to maximize production yields during the harvest season.

As the demand for being a consumer of data has increased, so has the supply side, which is characterized as the producer of data. Producing data has increased in scale as the technology innovations have evolved. I'll discuss this in more detail shortly, but this large scale consumption and production can be summarized as big data. A National Institute of Standards and Technology report defined big data as consisting of extensive datasets—primarily in the characteristics of volume, velocity, and/or variability—that require a scalable architecture for efficient storage, manipulation, and analysis.

This explosion of big data is characterized by the 3Vs, which are Volume, Velocity, and Variety,and has become a widely accepted concept among data professionals:

  • Volume is based on the quantity of data that is stored in any format such as image files, movies, and database transactions, which are measured in gigabytes, terabytes, or even zettabytes. To give context, you can store hundreds of thousands of songs or pictures on one terabyte of storage space. Even more amazing than the figures is how much it costs you. Google Drive, for example, offers up to 5 TB (terabytes) of storage for free according to their support site.
  • Velocity is the speed at which data is generated. This process covers how data is both produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity approaches are real time, streams of data where the data flow is in a constant state of movement.
  • Variety is all of the different formats that data can be stored in, including text, image, database tables, and files. This variety has created both challenges and opportunities for analysis because of the different technologies and techniques required to work with the data.

Understanding the 3Vs is important for data analysis because you must become good at being both a consumer and producer of data. The simple questions of how your data is stored, when this file was produced, where the database table is located, and in what format I shouldstore the output of my analysis of the data can all be addressed by understanding the 3Vs.

There is some debate—for which I disagree—that the 3Vs should increase to include Value, Visualization, and Veracity. No worries, we will cover these concepts throughout this book.

This leads us to a formal definition of data analysis which is defined as a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making, as stated in Review of business intelligence through data analysis.

Xia, B. S., & Gong, P. (2015). Review of business intelligence through data analysis. Benchmarking, 21(2), 300-311. doi:10.1108/BIJ-08-2012-0050

What I like about this definition is the focus on solving problems using data without the focus on which technologies are used. To make this possible there have been some significant technological milestones, the introduction of new concepts, and people who have broken down the barriers.

To showcase the evolution of data analysis, I compiled a few tables of key events from the years of 1945 until 2018 that I feel are the most influential. The following table is comprised of innovators such as Dr. E.F. Codd, who created the concept of a database to the launch of the iPhone device that spawned the mobile analytics industry.

The following diagram was collected from multiple sources and centralized in one place as a table of columns and rows and then visualized using this dendrogram chart. I posted the CSV file in the GitHub repository for reference: https://github.com/PacktPublishing/python-data-analysis-beginners-guide. Organizing the information and conforming the data in one place made the data visualization easier to produce and enables further analysis:

That process of collecting, formatting, and storing data in this readable format demonstrates the first step of becoming a producer of data. To make this information easier to consume, I summarize these events by decades in the following table:

Decade

Count of Milestones

1940s

2

1950s

2

1960s

1

1970s

2

1980s

5

1990s

9

2000s

14

2010s

7

From the preceding summary table, you can see that the majority of these milestone events occurred in the 1990s and 2000s. What is insightful about this analysis is that recent innovations have removed the barriers of entry for individuals to work with data. Before the 1990s, the high purchasing costs of hardware and software restricted the field of data analysis to a relatively limited number of careers. Also, the costs associated with access to the underlying data for analysis were great. It typically required higher education and specialized careers in software programming or an actuary.

A visual way to look at this same data would be a trend bar chart, as shown in the following diagram. In this example, the height of the bars represents the same information as in the preceding table and the Count of Milestone events is on the left or the y axis. What is nice about this visual representation of the data is that it is a faster way for the consumer to see the upward pattern of where most events occur without scanning through the results found in the preceding diagram or table:

The evolution of data analysis is important to understand because now you know some of the pioneers who opened doors for opportunities and careers working with data, along with key technology breakthroughs, significantly reducing the time to make decisions regarding data both as consumers and producers.

 

What makes a good data analyst?

I will now break down the contributing factors that make up a good data analyst. From my experience, a good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. To make it easier to remember, use the following acronyms to help to improve your data analyst skills.

Know Your Data (KYD)

Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it. Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team's success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.


KYD is also about data lineage, which is understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward. Refer back to the 3Vs so you can effectively communicate the responses from common questions about the data such as where this data is sourced from or who is responsible for maintaining the data source.

Voice of the Customer (VOC)

The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service. The relevance of this concept remains important today and should be applied to every data project that you participate in. This process is where you should interview the consumers of the data analysis results before even looking at the data. If you are working with business users, listen to what their needs are by writing down the specific points on what business questions are they trying to answer.

Schedule a working session with them where you can engage in a dialog. Make sure you focus on their current pain points such as the time to curate all of the data used to make decisions. Does it take three days to complete the process every month? If you can deliver an automated data product or a dashboard that can reduce that time down to a few mouse clicks, your data analysis skills will make you look like a hero to your business users.

During a tech talk at a local university, I was asked the difference between KYD and VOC. I explained that both are important and focused on communicating and learning more about the subject area or business. The key differences are prepared versus present. KYD is all about doing your homework ahead of time to be prepared before talking to experts. VOC is all about listening to the needs of your business or consumers regarding the data.

Always Be Agile (ABA)

The agile methodology has become commonplace in the industry for application, web, and mobile development Software Development Life Cycle (SDLC). One of the reasons that makes the agile project management process successful is that it creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features.

The agile process involves creating stories with a common theme where a development team completes tasks in 2-3 week sprints. In that process, it is important to understand the what and the why for each story including the business value/the problem you are trying to solve.

The agile approach has ceremonies where the developers and business sponsors come together to capture requirements and then deliver incremental value. That improvement in value could be anything from a new dataset available for access to a new feature added to an app.

See the following diagram for a nice visual representation of these concepts. Notice how these concepts are not linear and should require multiple iterations, which help to improve the communication between all people involved in the data analysis before, during, and after delivery of results:

Finally, I believe the most important trait of a good data analyst is a passion for working with data. If your passion can be fueled by continuously learning about all things data, it becomes a lifelong and fulfilling journey.

 

Understanding data types and their significance

As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms of unstructured, semi-structured, and structured.

Unstructured data

The concept behind unstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.

When working with unstructured data, there will be inconsistencies because of the nature of free text including misspellings, the different classification of dates, and so on. Always have a peer review of the workflow or code used to curate the data.

Semi-structured data

Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:

{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}

This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).

The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.

Structured data

Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, or Database Management Systems (DBMS) and Relational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.

The following diagram is a basic Entity-Relationship (ER) diagram of three tables that would be found in a database:

In this example, each entity would represent physical tables stored in the database, named car, part, and car_part_bridge. The relationship between the car and part is defined by the table called car_part_bridge, which can be classified by multiple names such as bridge, junction, mapping, or link table. The name of each field in the table would be on the left such as part_id, name, or description found in the part table.

The pk label next to the car_id and part_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.

Finally, the text aligned on the right side next to the field name labeled as int or text is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.

Common data types

Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.

Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:

Common data type

Common short name

Sample value

Example usage

Integers

int

1235

Counting occurrences, summing values, or the average of values such as sum (hits)

Booleans

bit

TRUE

Conditional testing such as if sales > 1,000, true else false

Geospatial

float or spatial

40.229290, -74.936707

Geo analytics based on latitude and longitude

Characters/string

char

A

Tagging, binning, or grouping data

Floating-point numbers

float or double

2.1234

Sales, cost analysis, or stock price

Alphanumeric strings

blob or varchar

United States

Tagging, binning, encoding, or grouping data

Time

time, timestamp, date

8/19/2000

Time-series analysis or year-over-year comparison

Technologies change and legacy systems will offer opportunities to see data types that may not be common. The best advice when dealing with new data types is to validate the source systems that are created by speaking to an SME (Subject Matter Expert) or system administrator, or to ask for documentation that includes the active version used to persist the data.

In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.

As you evolve your data analysis skills, helping to resolve data type issues or offer suggestions to improve them will make the quality and accuracy of reporting better throughout the organization.

String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar or blob) can be represented as text such as a word or full sentence. Time is a special data type that can be represented and stored in multiple ways such as 12 PM EST or a date such as 08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.

The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.

 

Data classifications and data attributes explained

Now that we understand more about data types and why they are important, let's break down the different classifications of data and understand the different data attribute types. To begin with a visual, let's summarize all of the possible combinations in the following summary diagram:

In the preceding diagram, the boxes directly below data have the three methods to classify data, which are continuous, categorical, or discrete.

Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.

Categorical (descriptive) data will have values as astringdata type. Categorical data isqualified so it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.

Just because data is defined as categorical, don't assume the values are all alike or consistent. A month can be stored as 1, 2, 3; Jan, Feb, Mar; or January, February, March, or in any combination. You will learn more about how to clean and conform your data for consistent analysis in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.

A discrete data type can be either continuous or categorical depending on how it's used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee. Discrete data is continuous in nature because of its numeric properties but also has limits that make it similar to categorical. Another example would be the numbers on a roulette wheel. There is a limit of whole numbers available on the wheel from 1 to 36, 0, or 00 that a player can bet on, plus the numbers can be categorized as red, black, or green depending on the value.

If only two discrete values exist, such as yes/no or true/false or 1/0, it can also be classified as binary.

Data attributes

Now that we understand how to classify data, let's break down the attribute types available to better understand how you can use them for analysis. The easiest method to break down types is to start with how you plan on using the data values for analysis:

  • Nominal data is defined as data where you can distinguish between different values but not necessarily order them. It is qualitative in nature, so think of nominal data as labels or names as stocks or bonds where math cannot be performed on them because they are string values. With nominal values, you cannot determine whether the word stocks or bonds are better or worse without additional information.
  • Ordinal data is ordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative using labels or names but now the values will have a natural or defined sequence. Similar to nominal data, ordinal data can be counted but not calculated with all statistical methods.

An example is assigning 1 = low, 2 = medium, and 3 = high values. This has a natural sequence but the difference between low and high cannot be quantified by itself. The data assigned to low and high values could be arbitrary or have additional business rules behind it.

Another common example of ordinal data is natural hierarchies such as state, county, and city, or grandfather, father, and son. The relationship between these values are well defined and commonly understood without any additional information to support it. So, a son will have a father but a father cannot be a son.

  • Interval data is like ordinal data, but the distance between data points is uniform. Weight on a scale in pounds is a good example because the difference between the values from 5 to 10, 10 to 15, and 20 to 25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.

Temperature is a good example to demonstrate this paradigm. You can record hourly values and even provide a daily average, but summing the values per day or week would not provide accurate information for analysis. See the following diagram, which provides an hourly temperature for a specific day. Notice the x axis breaks out the hours and the y axis provides the average, which is labeled Avg Temperature, in Fahrenheit. The values between each hour must be an average or mean because an accumulation of temperature would provide misleading results and inaccurate analysis:

  • Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types of integer and float discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative. Also, time could be classified as ratio data,however, I decided tofurther break down this attribute because of how often it is used for data analysis.
Note that there are advanced statistical details about ratio data attributes that are not covered in this book, such as having an absolute or true zero, so I encourage you to learn more about the subject.
  • Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any combination, for example, the time as HH:MM AM/PM, such as 12:03 AM; the year as YYYY, such as 1980; a timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19 14:32:22; or even a date as MM/DD/YY, such as 08/19/00. What's important to recognize when dealing with time data is to identify the intervals between each value so you can accurately measure the difference between them.
It is common during many data analysis projects that you find gaps in the sequence of time data values. For example, you are given a dataset with a range between 08/01/2019 to 08/31/2019 but only 25 distinct date values exist versus 30 days of data. There are various reasons for this occurrence including system outages where log data was lost. How to handle those data gaps will vary depending on the type of analysis you have to perform, including the need to fill in missing results. We will cover some examples in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.
 

Understanding data literacy

Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio as the ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.

Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about the data types and data attributes with the engineering team versus searching for those details aimlessly.

Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.

Reading data

What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.

Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.

For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from the What makes a good data analyst? section that we should start asking questions about the file attachment:

  • What methods were used to create the file (human or machine)?
  • What system(s) and workflow were used to create the file?
  • Who created the file and when was it created?
  • How often does this file refresh and is it manual or automated?

Asking these questions helps you to understand the concept of data lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.

Working with data

I define working withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.

Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.

For example, if you have a timestamp field in your dataset, is it using UTC (Coordinated Universal Time) or EST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.

Analyzing data

Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down to dimensions and measures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).

The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.

One quick solution to help to classify how the data should be used for analysis would be a data dictionary, which is defined as a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.

You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.

Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create a data dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:

Product

Customer

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Product 1

Customer A

$ 1,000.00

$ 2,000.00

$ 6,000.00

Product 1

Customer B

$ 1,000.00

$ 500.00

Product 2

Customer A

$ 1,000.00

Product 2

Customer C

$ 2,000.00

$ 2,500.00

$ 5,000.00

Product 3

Customer A

$ 1,000.00

$ 2,000.00

Product 4

Customer B

$ 1,000.00

$ 3,000.00

Product 5

Customer A

$ 1,000.00

In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields, Record 1 to Record 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeled Estimated Data Type and Dimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:

Field Name

Record 1

Record 2

Record 3

Estimated Data Type

Dimension or Measure

Product

Product 1

Product 1

Product 2

varchar

Dimension

Customer

Customer A

Customer B

Customer A

varchar

Dimension

Quarter 1

$ 1,000.00

float

Measure

Quarter 2

$ 2,000.00

$ 1,000.00

$ 1,000.00

float

Measure

Quarter 3

$ 6,000.00

$ 500.00

float

Measure

Quarter 4

float

Measure

Using this technique can help you to ask the following questions about the data to ensure you understand the results:

  • What year does this dataset represent or is it an accumulation of multiple years?
  • Does each quarter represent a calendar year or fiscal year?
  • Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?

Arguing about the data

Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.

From a consumer's perspective, trust and verify is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.

As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.

Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.

Here are a few to get started:

Let's begin with the Kagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.

Next is the supporting data from FiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.

Another important open source for data would be The World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under a Creative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.

 

Summary

Let's look back at what we learned in this chapter and the skills obtained before we move forward. First, we covered a brief history of data analysis and the technological evolution of data by paying homage to the people and milestone events that made working with data possible using modern tools and techniques. We walked through an example of how to summarize these events using a data visual trend chart that showed how recent technology innovations have transformed the data industry.

We focused on why data has become important to make decisions from both a consumer and producer perspective by discussing the concepts for identifying and classifying data using structured, semi-structured, and unstructured examples and the 3Vsof big data: Volume, Velocity, and Variety.

We answered the question of what makes a good data analyst using the techniques of KYD, VOC, and ABA.

Then, we went deeper into understandingdata types by walking through the differences between numbers (integer and float) versus strings (text, time, dates, and coordinates). This includedbreaking down data classifications (continuous, categorical, and discrete) and understanding data attribute types.

We wrapped up this chapter by introducing the concept of data literacyand its importance to the consumers and producers of data by improving communication between them.

In our next chapter,we will get more hands-on by installing and setting up an environment for data analysis and so begin the journey of applying the concepts learned about data.

 

Further reading

Here are some links that you can refer to for gathering more information about the following topics:

About the Author

  • Marc Wintjen

    Marc Wintjen is a Risk Analytics Architect at Bloomberg LP with over 20 years of professional experience. An evangelist for data literacy, hes known as the Data Mensch by helping others make data driven decisions. His passion for all things data has evolved from SQL and Data Warehousing to Big Data Analytics and Data Visualizations.

    Browse publications by this author
Practical Data Analysis Using Jupyter Notebook
Unlock this book and the full library FREE for 7 days
Start now