DEALING WITH DATA: WHAT CAN GO WRONG?
In a perfect world, all datasets are in pristine condition, with no extreme values, no missing values, and no erroneous values. Every feature value is captured correctly, with no chance for any confusion. Moreover, no conversion is required between date formats, currency values, or languages because of the one universal standard that defines the correct formats and acceptable values for every possible set of data values.
However, you cannot rely on the scenarios in the previous paragraph, which is the reason for the techniques that are discussed in this chapter. Even after you manage to create a wonderfully clean and robust dataset, other issues can arise, such as data drift that is described in the next section.
In fact, the task of cleaning data is not necessarily complete even after a machine learning model is deployed to a production environment. For instance, an online system that gathers terabytes or petabytes of data on a daily basis can contain skewed values that in turn adversely affect the performance of the model. Such adverse effects can be revealed through the changes in the metrics that are associated with the production model.
Datasets
In simple terms, a dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS, a document in a NoSQL database, the output from a Web service, and so forth.
Note that a static dataset consists of fixed data. For example, a CSV file that contains the states of the United States is a static dataset. A slightly different example involves a product table that contains information about the products that customers can buy from a company. Such a table is static if no new products are added to the table. Discontinued products are probably maintained as historical data that can appear in product-related reports.
By contrast, a dynamic dataset consists of data that changes over a period of time. Simple examples include housing prices, stock prices, and time-based data from IoT devices.
A dataset can vary from very small (perhaps a few features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain for a particular dataset, then you might struggle to determine its most important features. In this situation, you consult a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features, such as PCA (Principal Component Analysis).
Before delving into topics such as data preprocessing, data types, and so forth, let’s take a brief detour to introduce the concept of feature importance, which is the topic of the next section.
As you will see, someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset. A dataset can contain various data types, such as:
• audio data
• image data
• numeric data
• text-based data
• video data
• combinations of the above
In this book, we’ll only consider datasets that contain columns with numeric or text-based data types, which can be further classified as follows:
• nominal (string-based or numeric)
• ordinal (ordered values)
• categorical (enumeration)
• interval (positive/negative values)
• ratio (nonnegative values)
The next section contains brief descriptions of the data types that are in the preceding bullet list.