EXPLORATORY DATA ANALYSIS (EDA)
According to Wikipedia, EDA involves analyzing datasets to summarize their main characteristics, often with visual methods. EDA also involves searching through data to detect patterns (if there are any) and anomalies, and in some cases, test hypotheses regarding the distribution of the data.
EDA represents the initial phase of data analysis, whereby data is explored in order to determine its primary characteristics. Moreover, this phase involves detecting patterns (if any), and any outstanding issues pertaining to the data. The purpose of EDA is to obtain an understanding of the semantics of the data without performing a deep assessment of the nature of the data. The analysis is often performed through data visualization in order to produce a summary of their most important characteristics. The four types of EDA are listed here:
• univariate nongraphical
• multivariate nongraphical
• univariate graphical
• multivariate graphical
In brief, the two primary methods for data analysis are qualitative data analysis techniques and quantitative data analysis techniques.
As an example of exploratory data analysis, consider the plethora of cell phones that customers can purchase for various needs (work, home, minors, and so forth). Visualize the data in an associated dataset to determine the top ten (or top three) most popular cell phones, which can potentially be performed by state (or province) and country.
An example of quantitative data analysis involves measuring (quantifying) data, which can be gathered from physical devices, surveys, or activities such as downloading applications from a Web page.
Common visualization techniques used in EDA include histograms, line graphs, bar charts, box plots, and multivariate charts.
What Is Data Quality?
According to Wikipedia, data quality refers to “the state of qualitative or quantitative pieces of information” (Wikipedia, 2022). Furthermore, high data quality refers to data whose quality meets the various needs of an organization. In particular, performing data cleaning tasks are the type of tasks that assist in achieving high data quality.
When companies label their data, they obviously strive for a high quality of labeled data, and yet the quality can be adversely affected in various ways, some of which are as follows:
• inaccurate methodology for labeling data
• insufficient data accuracy
• insufficient attention to data management
The cumulative effect of the preceding (and other) types of errors can be significant, to the extent that models underperform in a production environment. In addition to the technical aspects, underperforming models can have an adverse effect on business revenue.
Related to data quality is data quality assurance, which typically involves data cleaning tasks that are discussed later in this chapter, after which data is analyzed to detect potential inconsistencies in the data, and then determine how to resolve those inconsistencies. Another aspect to consider: the aggregation of additional data sources, especially involving heterogenous sources of data, can introduce challenges with respect to ensuring data quality. Other concepts related to data quality include data stewardship and data governance, both of which are discussed in multiple online articles.
Data-Centric AI or Model-Centric AI?
A model-centric approach focuses primarily on enhancing the performance of a given model, and data considered secondary in importance. In fact, during the past ten years or so, the emphasis of AI has been a model-centric approach. Note that during this time span some very powerful models and architectures have been developed, such as the CNN model for image classification in 2012 and the enormous impact (especially in NLP) of models based on the transformer architecture that was developed in 2017.
By contrast, a data-centric approach concentrates on improving data, which relies on several factors, such as the quality of labels for the data as well as obtaining accurate data for training a model.
Given the importance of high-quality data with respect to training a model, it stands to reason that using a data-centric approach instead of a model-centric approach can result in higher quality models in AI. While data quality and model effectiveness are both important, keep in mind that the data-centric approach is becoming increasingly more strategic in the machine learning world. More information can be found on the AI Multiple site: https://research.aimultiple.com/data-centric-ai/
The Data Cleaning and Data Wrangling Steps
The next step often involves data cleaning in order to find and correct errors in the dataset, such as missing data, duplicate data, or invalid data. This task also involves data consistency, which pertains to updating different representations of the same value in a consistent manner. As a simple example, suppose that a Web page contains a form with an input field whose valid input is either Y or N, but users are able to enter Yes, Ys, or ys as text input. Obviously, these values correspond to the value Y, and they must all be converted to the same value in order to achieve data consistency.
Finally, data wrangling can be performed after the data cleaning task is completed. Although interpretations of data wrangling do vary, in this book the term refers to transforming datasets into different formats as well as combining two or more datasets. Hence, data wrangling does not examine the individual data values to determine whether or not they are valid: this step is performed during data cleaning.
Keep in mind that sometimes it’s worthwhile to perform another data cleaning step after the data wrangling step. For example, suppose that two CSV files contain employee-related data, and you merge these CSV files into a third CSV file. The newly created CSV file might contain duplicate values: it’s certainly possible to have two people with the same name (such as John Smith), which obviously needs to be resolved.
ELT and ETL
ELT is an acronym for extract, load, and transform, which is a pipeline-based approach for managing data. Another pipeline-based approach is called ETL (extract, transform, load), which is actually more popular than ELT. However, ELT has the following advantages over ETL:
• ELT requires less computational time.
• ELT is well-suited for processing large datasets.
• ELT is more cost effective than ETL.
ELT involves (1) data extraction from one or more sources, (2) transforming the raw data into a suitable format, and (3) loading the result into a data warehouse. The data in the warehouse becomes available for additional analysis.