Welcome and thank you for reading my book. I'm excited to share my passion for data and I hope to provide the resources and insights to fast-track your journey into data analysis. My goal is to educate, mentor, and coach you throughout this book on the techniques used to become a top-notch data analyst. During this process, you will get hands-on experience using the latest open source technologies available such as Jupyter Notebook and Python. We will stay within that technology ecosystem throughout this book to avoid confusion. However, you can be confident the concepts and skills learned are transferable across open source and vendor solutions with a focus on all things data.
In this chapter, we will cover the following:
- The evolution of data analysis and why it is important
- What makes a good data analyst?
- Understanding data types and why they are important
- Data classifications and data attributes explained
- Understanding data literacy
The evolution of data analysis and why it is important
To begin, we should define what data is. You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for reference or analysis. The focus of my definition should be the word persistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use. Rather than focus on the formal definition, let's discuss the world of data and how it impacts our daily lives. Whether you are reading a review to decide which product to buy or viewing the price of a stock, consuming information has become significantly easier to allow you to make informed data-driven decisions.
Data has been entangled into products and services across every industry from farming to smartphones. For example, America's Grow-a-Row, a New Jersey farm to food bank charity, donated over 1.5 million pounds of fresh produce to feed people in need throughout the region each year, according to their annual report. America's Grow-a-Row has thousands of volunteers and uses data to maximize production yields during the harvest season.
As the demand for being a consumer of data has increased, so has the supply side, which is characterized as the producer of data. Producing data has increased in scale as the technology innovations have evolved. I'll discuss this in more detail shortly, but this large scale consumption and production can be summarized as big data. A National Institute of Standards and Technology report defined big data as consisting of extensive datasets
—primarily in the characteristics of volume, velocity, and/or variability—
that require a scalable architecture for efficient storage, manipulation, and analysis.
This explosion of big data is characterized by the 3Vs, which are Volume, Velocity, and Variety,and has become a widely accepted concept among data professionals:
- Volume is based on the quantity of data that is stored in any format such as image files, movies, and database transactions, which are measured in gigabytes, terabytes, or even zettabytes. To give context, you can store hundreds of thousands of songs or pictures on one terabyte of storage space. Even more amazing than the figures is how much it costs you. Google Drive, for example, offers up to 5 TB (terabytes) of storage for free according to their support site.
- Velocity is the speed at which data is generated. This process covers how data is both produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity approaches are real time, streams of data where the data flow is in a constant state of movement.
- Variety is all of the different formats that data can be stored in, including text, image, database tables, and files. This variety has created both challenges and opportunities for analysis because of the different technologies and techniques required to work with the data.
Understanding the 3Vs is important for data analysis because you must become good at being both a consumer and producer of data. The simple questions of how your data is stored, when this file was produced, where the database table is located, and in what format I shouldstore the output of my analysis of the data can all be addressed by understanding the 3Vs.
This leads us to a formal definition of data analysis which is defined as a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making, as stated in Review of business intelligence through data analysis.
What I like about this definition is the focus on solving problems using data without the focus on which technologies are used. To make this possible there have been some significant technological milestones, the introduction of new concepts, and people who have broken down the barriers.
To showcase the evolution of data analysis, I compiled a few tables of key events from the years of 1945 until 2018 that I feel are the most influential. The following table is comprised of innovators such as Dr. E.F. Codd, who created the concept of a database to the launch of the iPhone device that spawned the mobile analytics industry.
The following diagram was collected from multiple sources and centralized in one place as a table of columns and rows and then visualized using this dendrogram chart. I posted the CSV file in the GitHub repository for reference: https://github.com/PacktPublishing/python-data-analysis-beginners-guide. Organizing the information and conforming the data in one place made the data visualization easier to produce and enables further analysis:
That process of collecting, formatting, and storing data in this readable format demonstrates the first step of becoming a producer of data. To make this information easier to consume, I summarize these events by decades in the following table:
Decade |
Count of Milestones |
1940s |
2 |
1950s |
2 |
1960s |
1 |
1970s |
2 |
1980s |
5 |
1990s |
9 |
2000s |
14 |
2010s |
7 |
From the preceding summary table, you can see that the majority of these milestone events occurred in the 1990s and 2000s. What is insightful about this analysis is that recent innovations have removed the barriers of entry for individuals to work with data. Before the 1990s, the high purchasing costs of hardware and software restricted the field of data analysis to a relatively limited number of careers. Also, the costs associated with access to the underlying data for analysis were great. It typically required higher education and specialized careers in software programming or an actuary.
A visual way to look at this same data would be a trend bar chart, as shown in the following diagram. In this example, the height of the bars represents the same information as in the preceding table and the Count of Milestone events is on the left or the y axis. What is nice about this visual representation of the data is that it is a faster way for the consumer to see the upward pattern of where most events occur without scanning through the results found in the preceding diagram or table:
The evolution of data analysis is important to understand because now you know some of the pioneers who opened doors for opportunities and careers working with data, along with key technology breakthroughs, significantly reducing the time to make decisions regarding data both as consumers and producers.
What makes a good data analyst?
I will now break down the contributing factors that make up a good data analyst. From my experience, a good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. To make it easier to remember, use the following acronyms to help to improve your data analyst skills.
Know Your Data (KYD)
Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it. Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team's success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.
KYD is also about data lineage, which is understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward. Refer back to the 3Vs so you can effectively communicate the responses from common questions about the data such as where this data is sourced from or who is responsible for maintaining the data source.
Voice of the Customer (VOC)
The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service. The relevance of this concept remains important today and should be applied to every data project that you participate in. This process is where you should interview the consumers of the data analysis results before even looking at the data. If you are working with business users, listen to what their needs are by writing down the specific points on what business questions are they trying to answer.
Schedule a working session with them where you can engage in a dialog. Make sure you focus on their current pain points such as the time to curate all of the data used to make decisions. Does it take three days to complete the process every month? If you can deliver an automated data product or a dashboard that can reduce that time down to a few mouse clicks, your data analysis skills will make you look like a hero to your business users.
Always Be Agile (ABA)
The agile methodology has become commonplace in the industry for application, web, and mobile development Software Development Life Cycle (SDLC). One of the reasons that makes the agile project management process successful is that it creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features.
The agile process involves creating stories with a common theme where a development team completes tasks in 2-3 week sprints. In that process, it is important to understand the what and the why for each story including the business value/the problem you are trying to solve.
The agile approach has ceremonies where the developers and business sponsors come together to capture requirements and then deliver incremental value. That improvement in value could be anything from a new dataset available for access to a new feature added to an app.
See the following diagram for a nice visual representation of these concepts. Notice how these concepts are not linear and should require multiple iterations, which help to improve the communication between all people involved in the data analysis before, during, and after delivery of results:
Finally, I believe the most important trait of a good data analyst is a passion for working with data. If your passion can be fueled by continuously learning about all things data, it becomes a lifelong and fulfilling journey.
Understanding data types and their significance
As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms of unstructured, semi-structured, and structured.
Unstructured data
The concept behind unstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.
Semi-structured data
Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:
{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}
This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).
The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.
Structured data
Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, or Database Management Systems (DBMS) and Relational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.
The following diagram is a basic Entity-Relationship (ER) diagram of three tables that would be found in a database:

In this example, each entity would represent physical tables stored in the database, named car, part, and car_part_bridge. The relationship between the car and part is defined by the table called car_part_bridge, which can be classified by multiple names such as bridge, junction, mapping, or link table. The name of each field in the table would be on the left such as part_id, name, or description found in the part table.
The pk label next to the car_id and part_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.
Finally, the text aligned on the right side next to the field name labeled as int or text is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.
Common data types
Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.
Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:
Common data type |
Common short name |
Sample value |
Example usage |
Integers |
int |
1235 |
Counting occurrences, summing values, or the average of values such as sum (hits) |
Booleans |
bit |
TRUE |
Conditional testing such as if sales > 1,000, true else false |
Geospatial |
float or spatial |
40.229290, -74.936707 |
Geo analytics based on latitude and longitude |
Characters/string |
char |
A |
Tagging, binning, or grouping data |
Floating-point numbers |
float or double |
2.1234 |
Sales, cost analysis, or stock price |
Alphanumeric strings |
blob or varchar |
United States |
Tagging, binning, encoding, or grouping data |
Time |
time, timestamp, date |
8/19/2000 |
Time-series analysis or year-over-year comparison |
In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.
String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar or blob) can be represented as text such as a word or full sentence. Time is a special data type that can be represented and stored in multiple ways such as 12 PM EST or a date such as 08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.
The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.
Data classifications and data attributes explained
Now that we understand more about data types and why they are important, let's break down the different classifications of data and understand the different data attribute types. To begin with a visual, let's summarize all of the possible combinations in the following summary diagram:
In the preceding diagram, the boxes directly below data have the three methods to classify data, which are continuous, categorical, or discrete.
Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.
Categorical (descriptive) data will have values as astringdata type. Categorical data isqualified so it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.
A discrete data type can be either continuous or categorical depending on how it's used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee. Discrete data is continuous in nature because of its numeric properties but also has limits that make it similar to categorical. Another example would be the numbers on a roulette wheel. There is a limit of whole numbers available on the wheel from 1 to 36, 0, or 00 that a player can bet on, plus the numbers can be categorized as red, black, or green depending on the value.
Data attributes
Now that we understand how to classify data, let's break down the attribute types available to better understand how you can use them for analysis. The easiest method to break down types is to start with how you plan on using the data values for analysis:
- Nominal data is defined as data where you can distinguish between different values but not necessarily order them. It is qualitative in nature, so think of nominal data as labels or names as stocks or bonds where math cannot be performed on them because they are string values. With nominal values, you cannot determine whether the word stocks or bonds are better or worse without additional information.
- Ordinal data is ordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative using labels or names but now the values will have a natural or defined sequence. Similar to nominal data, ordinal data can be counted but not calculated with all statistical methods.
An example is assigning 1 = low, 2 = medium, and 3 = high values. This has a natural sequence but the difference between low and high cannot be quantified by itself. The data assigned to low and high values could be arbitrary or have additional business rules behind it.
Another common example of ordinal data is natural hierarchies such as state, county, and city, or grandfather, father, and son. The relationship between these values are well defined and commonly understood without any additional information to support it. So, a son will have a father but a father cannot be a son.
- Interval data is like ordinal data, but the distance between data points is uniform. Weight on a scale in pounds is a good example because the difference between the values from 5 to 10, 10 to 15, and 20 to 25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.
Temperature is a good example to demonstrate this paradigm. You can record hourly values and even provide a daily average, but summing the values per day or week would not provide accurate information for analysis. See the following diagram, which provides an hourly temperature for a specific day. Notice the x axis breaks out the hours and the y axis provides the average, which is labeled Avg Temperature, in Fahrenheit. The values between each hour must be an average or mean because an accumulation of temperature would provide misleading results and inaccurate analysis:
- Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types of integer and float discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative. Also, time could be classified as ratio data,however, I decided tofurther break down this attribute because of how often it is used for data analysis.
- Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any combination, for example, the time as HH:MM AM/PM, such as 12:03 AM; the year as YYYY, such as 1980; a timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19 14:32:22; or even a date as MM/DD/YY, such as 08/19/00. What's important to recognize when dealing with time data is to identify the intervals between each value so you can accurately measure the difference between them.
Understanding data literacy
Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio as the ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.
Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about the data types and data attributes with the engineering team versus searching for those details aimlessly.
Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.
Reading data
What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.
Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.
For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from the What makes a good data analyst? section that we should start asking questions about the file attachment:
- What methods were used to create the file (human or machine)?
- What system(s) and workflow were used to create the file?
- Who created the file and when was it created?
- How often does this file refresh and is it manual or automated?
Asking these questions helps you to understand the concept of data lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.
Working with data
I define working withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.
Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.
For example, if you have a timestamp field in your dataset, is it using UTC (Coordinated Universal Time) or EST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.
Analyzing data
Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down to dimensions and measures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).
The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.
One quick solution to help to classify how the data should be used for analysis would be a data dictionary, which is defined as a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.
You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.
Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create a data dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:
Product |
Customer |
Quarter 1 |
Quarter 2 |
Quarter 3 |
Quarter 4 |
Product 1 |
Customer A |
$ 1,000.00 |
$ 2,000.00 |
$ 6,000.00 |
|
Product 1 |
Customer B |
$ 1,000.00 |
$ 500.00 |
||
Product 2 |
Customer A |
$ 1,000.00 |
|||
Product 2 |
Customer C |
$ 2,000.00 |
$ 2,500.00 |
$ 5,000.00 |
|
Product 3 |
Customer A |
$ 1,000.00 |
$ 2,000.00 |
||
Product 4 |
Customer B |
$ 1,000.00 |
$ 3,000.00 |
||
Product 5 |
Customer A |
$ 1,000.00 |
In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields, Record 1 to Record 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeled Estimated Data Type and Dimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:
Field Name |
Record 1 |
Record 2 |
Record 3 |
Estimated Data Type |
Dimension or Measure |
Product |
Product 1 |
Product 1 |
Product 2 |
varchar |
Dimension |
Customer |
Customer A |
Customer B |
Customer A |
varchar |
Dimension |
Quarter 1 |
$ 1,000.00 |
float |
Measure |
||
Quarter 2 |
$ 2,000.00 |
$ 1,000.00 |
$ 1,000.00 |
float |
Measure |
Quarter 3 |
$ 6,000.00 |
$ 500.00 |
float |
Measure |
|
Quarter 4 |
float |
Measure |
Using this technique can help you to ask the following questions about the data to ensure you understand the results:
- What year does this dataset represent or is it an accumulation of multiple years?
- Does each quarter represent a calendar year or fiscal year?
- Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?
Arguing about the data
Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.
From a consumer's perspective, trust and verify
is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.
As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.
Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.
Here are a few to get started:
- Kaggle: https://www.kaggle.com/datasets
- FiveThirtyEight: https://data.fivethirtyeight.com/
- The World Bank:https://data.worldbank.org/
Let's begin with the Kagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.
Next is the supporting data from FiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.
Another important open source for data would be The World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under a Creative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.
Summary
Let's look back at what we learned in this chapter and the skills obtained before we move forward. First, we covered a brief history of data analysis and the technological evolution of data by paying homage to the people and milestone events that made working with data possible using modern tools and techniques. We walked through an example of how to summarize these events using a data visual trend chart that showed how recent technology innovations have transformed the data industry.
We focused on why data has become important to make decisions from both a consumer and producer perspective by discussing the concepts for identifying and classifying data using structured, semi-structured, and unstructured examples and the 3Vsof big data: Volume, Velocity, and Variety.
We answered the question of what makes a good data analyst using the techniques of KYD, VOC, and ABA.
Then, we went deeper into understandingdata types by walking through the differences between numbers (integer and float) versus strings (text, time, dates, and coordinates). This includedbreaking down data classifications (continuous, categorical, and discrete) and understanding data attribute types.
We wrapped up this chapter by introducing the concept of data literacyand its importance to the consumers and producers of data by improving communication between them.
In our next chapter,we will get more hands-on by installing and setting up an environment for data analysis and so begin the journey of applying the concepts learned about data.
Further reading
Here are some links that you can refer to for gathering more information about the following topics:
- America's Grow-a-Row: https://www.americasgrowarow.org/wp-content/uploads/2019/09/AGAR-2018-Annual-Report.pdf
- NIST Big Data Interoperability Framework: https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-1.pdf
- Google Drive FAQ: https://support.google.com/drive/answer/37603?hl=en
- Python Data Analysis for Beginners Guide GitHub repository: https://github.com/mwintjen/Python_Data_Analysis_Beginners_Guide
- Dimensional Modeling Techniques by Dr. Ralph Kimball from his book, The Data Warehouse Toolkit: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/
- IBM Dictionary of Computing: http://portal.acm.org/citation.cfm?id=541721
- Kaggle datasets: https://www.kaggle.com/datasets
- FiveThirtyEight datasets: https://data.fivethirtyeight.com/
- The World Bank Data sources: https://data.worldbank.org/
- The Creative Commons license information: https://creativecommons.org/
- The Data Literacy Project site: https://thedataliteracyproject.org/