Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Data Literacy With Python

You're reading from   Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Mercury_Learning
ISBN-13 9781836640097
Length 271 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Mercury Learning and Information Mercury Learning and Information
Author Profile Icon Mercury Learning and Information
Mercury Learning and Information
Oswald Campesato Oswald Campesato
Author Profile Icon Oswald Campesato
Oswald Campesato
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface
1. Chapter 1: Working With Data FREE CHAPTER 2. Chapter 2: Outlier and Anomaly Detection 3. Chapter 3: Cleaning Datasets 4. Chapter 4: Introduction to Statistics 5. Chapter 5: Matplotlib and Seaborn 6. Index
Appendix A: Introduction to Python 1. Appendix B: Introduction to Pandas

WHERE DO WE FIND DATA?

Data resides in many locations, with different formats, languages, and currencies. An important task involves finding the sources of relevant data and then aggregating that data in a meaningful fashion. Some examples of sources of data are as follows:

CSV/TSV files

RBDMS tables

NoSQL tables

Web Services

The following subsections briefly describe some of the details that are involved with each of the items in the preceding bullet list.

Working With CSV Files

A CSV file (comma-separated values) or TSV file (tab-separated values) is a common source of data, and other delimiters (semi-colons, “#” symbols, and so forth) can also appear in a text file with data. Moreover, you might need to combine multiple CSV files into a single file that contains the data to perform an accurate analysis.

As a simple example, the following snippet displays a portion of the titanic.csv dataset:

survived,pclass,sex,age,sibsp,parch,fare,embarked, 
class,who,adult_male,deck,embark_town,alive,alone 0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,
no,False 1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,
Cherbourg,yes,False 1,3,female,26.0,0,0,7.925,S,Third,woman,False,,
Southampton,yes,True

As you can see, there are many columns (also called “features”) in the preceding set of data. When you perform machine learning, you need to determine which of those columns provide meaningful data. Notice the survived attribute: this is known as the target feature, which contains the values that you are trying to predict correctly. The prediction of who survives is based on identifying the columns (features) that are relevant in making such a prediction.

For example, the sex, age, and class features are most likely relevant for determining whether or not a passenger survived the fate of the Titanic. How do you know if you have selected all the relevant features, and only the relevant features?

There are two main techniques for doing so. In some datasets it’s possible to visually inspect the features of a dataset in order to determine the most important features. Loosely speaking, when you “eyeball” the data to determine the set of relevant features, that’s called feature selection. This approach can be viable when there is a relatively small number of features in the dataset (i.e., ten or fewer features).

On the other hand, it’s very difficult to visually determine the relevant features in a dataset that contains 5,000 columns. Fortunately, you can use an algorithm such as PCA (Principal Component Analysis) to determine which features are significant. The use of such an algorithm (and there are others as well) is called feature extraction.

Moreover, it’s important to enlist the aid of a so-called domain expert (which might be you) who can assist in determining the most important features of a dataset, and also determine if there are any missing features that are important in the selection of features.

Working With RDBMS Data

An RDBMS (relational database management system) stores data in a structured manner by utilizing database tables whose structure is defined by you. For example, suppose that you have an online store that sells products, and you want to keep track of customers, purchase orders, and inventory.

One approach involves defining a customer’s table, which has the following (simplified) type of structure:

CREATE TABLE customers (
cust_id INTEGER,
first_name VARCHAR(20),
last_name VARCHAR(20),
home_address VARCHAR(20),
city VARCHAR(20),
state VARCHAR(20),
zip_code VARCHAR(10));

Next, you can use SQL (structured query language) statements in order to insert data into the customers table, as shown here:

INSERT INTO customers
VALUES (1000,'John','Smith','123 Main St', 
'Fremont','CA','94123'); INSERT INTO customers VALUES (2000,'Jane','Jones','456 Front St',
'Fremont','CA','95015');

In a real application you obviously need real data, which you can gather from a Web registration page that enables users to register for your Web application (we’ll skip those details).

If you use an RDBMS such as MySQL, you can define a database and database tables, such as the customers table described previously. The following SQL statement displays the structure of the customers table that was defined previously:

mysql> DESCRIBE customers;
+---------+----------+------+-----+---------+-------+
| Field   | Type     | Null | Key | Default | Extra |
+---------+----------+------+-----+---------+-------+
| cust_id | int      | YES  |     | NULL    |       |
| name    | char(30) | YES  |     | NULL    |       |
| address | char(30) | YES  |     | NULL    |       |
| email   | char(30) | YES  |     | NULL    |       |
+---------+----------+------+-----+---------+-------+
4 rows in set (0.03 sec)

After manually inserting data with a SQL INSERT statement, you can also select the data from the customers table via a SQL SELECT statement, as shown here (the simulated data is different from the previous data):

image

In simplified terms, an RDBMS involves the following tasks:

Define the relevant tables

Insert meaningful data into the tables

Select useful data from the tables

One way to insert data involves programmatically loading data from CSV files into the database tables. An RDBMS provides many useful features, which includes exporting the data from all the tables in a database, and the export file can be a single SQL file that contains all the SQL statements that are required for creating the relevant tables and inserting existing data (i.e., data that you already inserted) into those tables.

You also need a purchase orders table to keep track of which customers have made purchases from your store. An example of the structure of a purchase orders table is shown here:

mysql> DESCRIBE purch_orders;
+-----------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| cust_id   | int  | YES  |     | NULL    |       |
| purch_id  | int  | YES  |     | NULL    |       |
| line_item | int  | YES  |     | NULL    |       |
+-----------+------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Notice that each row in the purch_orders table contains a cust_id and a purch_id column: that’s because a purchase order is associated with a customer, and a customer can place one or more purchase orders. In database parlance, the customers table has a one-to-many relationship with the purchase orders table, and every row in the latter table must have an associated row in the customers table (and those that do not are called “orphans”).

In fact, there is also a one-to-many relationship between the purchase orders table and the item_desc table, where the latter contains information about each product that was purchased in a given purchase order. Note that each row in a purchase order is called a line item.

Working With NoSQL Data

A NoSQL database is useful when the data that you manage does not have a fixed structure. Examples of popular NoSQL databases are MongoDB and Cassandra.

Instead of defining a fixed structure for tables, you can populate a NoSQL database dynamically with documents, where documents belong to a collection instead of a table. Obviously, documents can have different lengths and contain different text, which can be conveniently stored and accessed in a collection in a NoSQL database.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Literacy With Python
You have been reading a chapter from
Data Literacy With Python
Published in: Jul 2024
Publisher: Mercury_Learning
ISBN-13: 9781836640097
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon