Reader small image

You're reading from  Cracking the Data Science Interview

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781805120506
Edition1st Edition
Concepts
Right arrow
Authors (2):
Leondra R. Gonzalez
Leondra R. Gonzalez
author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

Aaren Stubberfield
Aaren Stubberfield
author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield

View More author details
Right arrow

Querying Databases with SQL

In this chapter, you’ll learn the essential aspects of databases, starting with a broad overview, then diving deep into the fundamental language of SQL, exploring crucial concepts such as subqueries, JOIN, CASE WHEN, window functions, aggregations, and how to tackle complex queries.

Our goal is to provide you with the knowledge and tools necessary to tackle any database-related question in a technical interview effectively. This is crucial for those preparing for a technical interview because understanding databases is a foundational skill for data scientists; equipped with the knowledge shared in these chapters, you’ll be ready to face any database question confidently and proficiently.

So, in this chapter, we will cover the following topics:

  • Introducing relational databases
  • Mastering SQL basics
  • Aggregating with GROUP BY and HAVING
  • Creating fields with CASE WHEN
  • Analyzing subqueries and CTEs
  • Merging tables...

Introducing relational databases

A database is a critical component in data-driven businesses and organizations, and data scientists need to understand its structure, functions, and underlying language. This section aims to introduce you to relational databases, focusing on the common language of SQL.

A relational database (also known as a SQL database) is a type of database that organizes data into tables, where each table has rows and columns. Each table represents a specific entity type, such as Customers or Products. Much like DataFrames, each row represents a unique record (or records), and each column represents a field (or attribute) of the data. This relational model introduced a standard way to represent and query data independent of any specific application. You can see an example of a relational database in Figure 5.1:

Figure 5.1: Relational data model example – star schema

Figure 5.1: Relational data model example – star schema

What makes relational databases so powerful is their ability...

Mastering SQL basics

As a data scientist, mastering the basics of SQL is crucial. Luckily for you, the basics are pretty easy to grasp, even for non-technical learners. This is because, at this stage, SQL generally reads like English sentences. To get you started, this section focuses on three fundamental components of SQL: the SELECT, WHERE, and ORDER BY statements.

The SELECT statement

The SELECT statement is the foundation of any SQL query and is used to retrieve data from a database. The general syntax is as follows:

SELECT column1, column2, ..., columnN
FROM table_name;

The syntax lists the different columns you want to return, separated by a comma. Since databases hold numerous tables, the query code specifies which table to select the columns using the FROM statement. Lastly, the semi-colon (;) is used to mark the end of a query.

Note

It is standard to create a new line of the query for each main clause (which is capitalized). Here, we started a new line once...

Aggregating data with GROUP BY and HAVING

Aggregation is a concept with which you should already be familiar thanks to the discussion of Python using pandas in Chapter 3. Just like in Python, aggregation in SQL is about summarizing or grouping data in a way that makes it more useful, understandable, and manageable. GROUP BY and HAVING are two crucial components in SQL that help accomplish this.

The GROUP BY statement

Much like how grouping is performed in Python using pandas, the GROUP BY statement in SQL is used with aggregate functions (such as COUNT, SUM, AVG, MAX, and MIN) to group the result set by one or more columns. Thus, using GROUP BY should be familiar to you! The syntax is as follows:

SELECT column1, column2, columnN aggregate_function(columnX)
FROM table
GROUP BY columns(s);

Aggregate values are best managed by using aliases. An alias is simply a nickname for a calculated or aggregated field or temporary table. Simply use the term AS, like so:

SELECT column1...

Creating fields with CASE WHEN

The CASE WHEN statement is a straightforward technique for creating new fields using conditional logic. It allows you to specify multiple conditions and define actions or outcomes for each condition. The CASE WHEN statement is often used to transform data, create calculated columns, or perform conditional aggregations. The syntax of the CASE WHEN statement is as follows:

CASE WHEN
condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE else_result
END As alias;

Here is an example where we create a new field that will detail if a student passed or failed, based on their scores:

SELECT student_id, student_name, exam_score,
CASE WHEN exam_score >= 60 THEN 'Pass'
ELSE 'Fail'
END AS result
FROM students;

This query creates a new field called result, and populates it with "Pass" when the student scored at least 60 on their exa; otherwise, it populates it "Fail". The results...

Analyzing subqueries and CTEs

SQL subqueries, also known as nested queries or inner queries, are queries that are embedded within the context of another SQL query. They are powerful tools for performing complex data manipulations that require one or more intermediary steps – that is, they are used for performing data manipulation operations that require multiple steps or depend on the result of an intermediary query.

That might sound complex and, indeed, subqueries can easily get very complicated. But once you know the rules of engagement, you’ll soon see that they’re very doable. Before implementing a subquery, ask yourself the following:

  • Where am I starting?
  • Where am I going?

If you can answer these two questions, you’ve won half the battle. The other half is determining what steps need to take place to get from point A (existing data) to point B (desired data). In this section, we will learn how to navigate multi-step queries with...

Merging tables with joins

SQL joins are used to combine rows from two or more tables based on a related column between them, providing a complete view of the data. We previously hinted at these related columns as primary keys and foreign keys.

As a refresher, a primary key is a column (or a combination of columns) in a database table that uniquely identifies each row in that table. A foreign key, on the other hand, is a column or a combination of columns in a table that establishes a link or a relationship to the primary key of another table.

As we dive into SQL joins, we will put our knowledge of primary and foreign keys to work!

Note

When discussing SQL joins, we will mostly focus on joining two tables to simplify the concepts. Traditionally, two joined tables are referred to as the left table and the right table.

Inner joins

INNER JOIN selects records that have matching values in both tables. Figure 5.8 best demonstrates the logic of this join type:

...

Calculating window functions

SQL window functions are an additional tool in your toolkit. Unlike aggregate functions, which return a single result per group of rows, window functions return a single result for each row, based on the context of that row within a window of related rows.

OVER, ORDER BY, PARTITION, and SET

Window functions have the following basic syntax:

<function> (<expression>)
OVER (
[PARTITION BY <expression_list>]
[ORDER BY <expression_list>] [ROWS|RANGE <frame specification>])

There are a few key concepts to understand here, so let’s break them down:

  • The OVER keyword is what differentiates a window function from a regular function; once you see it, you know you’re in window function land. The OVER clause defines the window or subset of rows within a query result set that the window function operates on. In short, it provides a way to partition the result set into logical groups and allows the window...

Approaching complex queries

Writing complex SQL queries can be a challenging task, especially when dealing with multiple tables, complex filtering conditions, and intricate calculations. However, by following a step-by-step approach, you can break down the problem into smaller, manageable parts and gradually build up to the final query.

Here are some systematic guidelines on how to approach complex queries:

  • Step 1 – define the objective: Begin by clearly defining the objective of your query. What specific information are you trying to retrieve or calculate? What is the desired output?
  • Step 2 – identify the tables: Determine which tables contain the necessary data for your objective and identify their respective keys. This helps us identify our starting point. If multiple tables are involved, consider the relationships between them and how they should be joined. Determine the key(s) in each table.
  • Step 3 – determine the filtering criteria: Identify...

Summary

In this chapter, we learned the basics of databases and SQL, which are topics that many data scientists encounter in interviews. In fact, as a data scientist, you will almost certainly be quizzed on this topic during interviews. We touched on basic querying concepts, subqueries, joins, window functions, evaluation order, aggregation, filtration, and how to approach complex problems. However, SQL is yet another topic that commands an entire book on its own.

Rest assured that in most cases, there is more than one way to solve a problem, but there are often limited optimal ways to do so. Thus, be sure to spend adequate time practicing the concepts discussed in this chapter. Try not to memorize queries; instead, familiarize yourself with the common use cases explained in this chapter. Follow the aforementioned steps to break down complex problems, but be aware that the order of these steps is not set in stone. Once mastered, you will be able to identify the right query for any...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cracking the Data Science Interview
Published in: Feb 2024Publisher: PacktISBN-13: 9781805120506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield