Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

8. RDBMS and SQL

Overview

This chapter will introduce you to the basics of using an RDBMS to query a database using Python and convert data from SQL and then store it in a pandas DataFrame. It will explain the concepts of databases, including their creation, manipulation, and control, and how to transform tables into pandas DataFrames. By the end of this chapter, you will learn some basic SQL commands. This knowledge will make you adept at adding, updating, retrieving, and deleting data from databases; another valuable skill in a budding data wrangling expert's repertoire.

Introduction

This chapter of our data journey is focused on Relational Database Management System (RDBMS) and Structured Query Language (SQL). In the previous chapter, we stored and read data from a file. In this chapter, we will read structured data, design access to the data, and create query interfaces for databases.

For years, the RDBMS format has been the conventional way to store data. An RDBMS is one of the safest ways to store, manage, and retrieve data. It is backed by a solid mathematical foundation (relational algebra and calculus) and exposes an efficient and intuitive declarative language – SQL – for easy interaction. Almost every language has a rich set of libraries to interact with different RDBMS, and the tricks and methods of using them are well tested and well understood.

Scaling an RDBMS is a pretty well-understood task, and there is a group of well trained, experienced professionals to do this job (DBAs, or database administrators).

So, it...

Refresher of RDBMS and SQL

An RDBMS is a piece of software that manages data (represented for the end user in tabular form) on physical hard disks and is built using Codd's relational model. Most of the databases that we encounter today are RDBMS. In recent years, there has been a huge industry shift toward a newer kind of database management system, called NoSQL (MongoDB, CouchDB, Riak, and so on). These systems, while they do follow some of the rules of RDBMS in certain aspects, in most cases they reject or modify them.

How Is an RDBMS Structured?

The RDBMS structure consists of three main elements, namely the storage engine, the query engine, and log management. Here is a diagram that demonstrates the structure of an RDBMS:

Figure 8.2: RDBMS structure

The following are the main concepts of any RDBMS structure:

  • Storage engine: This is the part of the RDBMS that is responsible for storing data in an efficient way and also retrieving it,...

Relation Mapping in Databases

We have been working with a single table and altering it, as well as reading back the data. However, the real power of an RDBMS comes from the handling of relationships among different objects (tables). In this section, we are going to create a new table called comments and link it with the user table in a 1: N relationship. This means that one user can have multiple comments. The way we are going to do this is by adding the user table's primary key as a foreign key in the comments table. This will create a 1: N relationship.

When we link two tables, we need to specify to the database engine what should be done if the parent row is deleted, which has many children in the other table. As we can see in the following diagram, we are asking what happens at the place of the question marks when we delete row1 of the user table:

Figure 8.6: Illustration of relations

In a non-RDBMS situation, this situation can quickly become difficult...

Joins

Now, we will learn how to exploit the relationship we just built. This means that if we have the primary key from one table, we can recover all the data needed from that table and also all the linked rows from the child table. To achieve this, we will use something called a join.

A join is basically a way to retrieve linked rows from two tables using any kind of primary key – foreign key relation that they have. There are many types of join, including INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER, and CROSS. They are used in different situations. However, most of the time, in simple 1: N relations, we end up using an INNER join. In Chapter 1, Introduction to Data Wrangling with Python, we learned about sets. We can view an INNER join as an intersection of two sets. The following diagram illustrate the concepts:

Figure 8.7: A diagram representing the intersection join

Here, A represents one table, and B represents another. The meaning of having...

Retrieving Specific Columns from a JOIN Query

In the previous exercise, we saw that we can use a JOIN to fetch the related rows from two tables. However, if we look at the results, we will see that it returned all the columns, thus combining both tables. This is not very concise. What about if we only want to see the emails and the related comments, and not all the data?

There is some nice shorthand code that lets us do this:

import sqlite3
with sqlite3.connect("../lesson.db") as conn:
    cursor = conn.cursor()
    cursor.execute("PRAGMA foreign_keys = 1")
    sql = """
    SELECT comments.* FROM comments \
    JOIN user ON comments.user_id = user.email \
    WHERE user.email='bob@example.com' \
    """
    rows = cursor.execute(sql)
    for...

Summary

We have come to the end of the database chapter. We have learned how to connect to SQLite using Python. We have brushed up on the basics of relational databases and how to open and close a database. We then learned how to export this relational database into Python DataFrames.

In the next chapter, we will be performing data wrangling on datasets that are used in business use cases. We will use different types of datasets and then clean and process the data in a meaningful way. We will be able to apply all the skills and tricks we have learned so far in this book to process data and get valuable insights from them.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar