WHERE DO WE FIND DATA?
Data resides in many locations, with different formats, languages, and currencies. An important task involves finding the sources of relevant data and then aggregating that data in a meaningful fashion. Some examples of sources of data are as follows:
• CSV/TSV files
• RBDMS tables
• NoSQL tables
• Web Services
The following subsections briefly describe some of the details that are involved with each of the items in the preceding bullet list.
Working With CSV Files
A CSV file (comma-separated values) or TSV file (tab-separated values) is a common source of data, and other delimiters (semi-colons, “#” symbols, and so forth) can also appear in a text file with data. Moreover, you might need to combine multiple CSV files into a single file that contains the data to perform an accurate analysis.
As a simple example, the following snippet displays a portion of the titanic.csv dataset:
survived,pclass,sex,age,sibsp,parch,fare,embarked,
class,who,adult_male,deck,embark_town,alive,alone 0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,
no,False 1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,
Cherbourg,yes,False 1,3,female,26.0,0,0,7.925,S,Third,woman,False,,
Southampton,yes,True
As you can see, there are many columns (also called “features”) in the preceding set of data. When you perform machine learning, you need to determine which of those columns provide meaningful data. Notice the survived attribute: this is known as the target feature, which contains the values that you are trying to predict correctly. The prediction of who survives is based on identifying the columns (features) that are relevant in making such a prediction.
For example, the sex, age, and class features are most likely relevant for determining whether or not a passenger survived the fate of the Titanic. How do you know if you have selected all the relevant features, and only the relevant features?
There are two main techniques for doing so. In some datasets it’s possible to visually inspect the features of a dataset in order to determine the most important features. Loosely speaking, when you “eyeball” the data to determine the set of relevant features, that’s called feature selection. This approach can be viable when there is a relatively small number of features in the dataset (i.e., ten or fewer features).
On the other hand, it’s very difficult to visually determine the relevant features in a dataset that contains 5,000 columns. Fortunately, you can use an algorithm such as PCA (Principal Component Analysis) to determine which features are significant. The use of such an algorithm (and there are others as well) is called feature extraction.
Moreover, it’s important to enlist the aid of a so-called domain expert (which might be you) who can assist in determining the most important features of a dataset, and also determine if there are any missing features that are important in the selection of features.
Working With RDBMS Data
An RDBMS (relational database management system) stores data in a structured manner by utilizing database tables whose structure is defined by you. For example, suppose that you have an online store that sells products, and you want to keep track of customers, purchase orders, and inventory.
One approach involves defining a customer’s table, which has the following (simplified) type of structure:
CREATE TABLE customers ( cust_id INTEGER, first_name VARCHAR(20), last_name VARCHAR(20), home_address VARCHAR(20), city VARCHAR(20), state VARCHAR(20), zip_code VARCHAR(10));
Next, you can use SQL (structured query language) statements in order to insert data into the customers table, as shown here:
INSERT INTO customers VALUES (1000,'John','Smith','123 Main St',
'Fremont','CA','94123'); INSERT INTO customers VALUES (2000,'Jane','Jones','456 Front St',
'Fremont','CA','95015');
In a real application you obviously need real data, which you can gather from a Web registration page that enables users to register for your Web application (we’ll skip those details).
If you use an RDBMS such as MySQL, you can define a database and database tables, such as the customers table described previously. The following SQL statement displays the structure of the customers table that was defined previously:
mysql> DESCRIBE customers; +---------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------+----------+------+-----+---------+-------+ | cust_id | int | YES | | NULL | | | name | char(30) | YES | | NULL | | | address | char(30) | YES | | NULL | | | email | char(30) | YES | | NULL | | +---------+----------+------+-----+---------+-------+ 4 rows in set (0.03 sec)
After manually inserting data with a SQL INSERT statement, you can also select the data from the customers table via a SQL SELECT statement, as shown here (the simulated data is different from the previous data):

In simplified terms, an RDBMS involves the following tasks:
• Define the relevant tables
• Insert meaningful data into the tables
• Select useful data from the tables
One way to insert data involves programmatically loading data from CSV files into the database tables. An RDBMS provides many useful features, which includes exporting the data from all the tables in a database, and the export file can be a single SQL file that contains all the SQL statements that are required for creating the relevant tables and inserting existing data (i.e., data that you already inserted) into those tables.
You also need a purchase orders table to keep track of which customers have made purchases from your store. An example of the structure of a purchase orders table is shown here:
mysql> DESCRIBE purch_orders; +-----------+------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+------+------+-----+---------+-------+ | cust_id | int | YES | | NULL | | | purch_id | int | YES | | NULL | | | line_item | int | YES | | NULL | | +-----------+------+------+-----+---------+-------+ 3 rows in set (0.01 sec)
Notice that each row in the purch_orders table contains a cust_id and a purch_id column: that’s because a purchase order is associated with a customer, and a customer can place one or more purchase orders. In database parlance, the customers table has a one-to-many relationship with the purchase orders table, and every row in the latter table must have an associated row in the customers table (and those that do not are called “orphans”).
In fact, there is also a one-to-many relationship between the purchase orders table and the item_desc table, where the latter contains information about each product that was purchased in a given purchase order. Note that each row in a purchase order is called a line item.
Working With NoSQL Data
A NoSQL database is useful when the data that you manage does not have a fixed structure. Examples of popular NoSQL databases are MongoDB and Cassandra.
Instead of defining a fixed structure for tables, you can populate a NoSQL database dynamically with documents, where documents belong to a collection instead of a table. Obviously, documents can have different lengths and contain different text, which can be conveniently stored and accessed in a collection in a NoSQL database.