You're reading from Cracking the Data Science Interview

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781805120506

Edition1st Edition

Concepts

Data Science

Authors (2):

Leondra R. Gonzalez

Aaren Stubberfield

View More author details

Querying Databases with SQL

In this chapter, you’ll learn the essential aspects of databases, starting with a broad overview, then diving deep into the fundamental language of SQL, exploring crucial concepts such as subqueries, JOIN, CASE WHEN, window functions, aggregations, and how to tackle complex queries.

Our goal is to provide you with the knowledge and tools necessary to tackle any database-related question in a technical interview effectively. This is crucial for those preparing for a technical interview because understanding databases is a foundational skill for data scientists; equipped with the knowledge shared in these chapters, you’ll be ready to face any database question confidently and proficiently.

So, in this chapter, we will cover the following topics:

Introducing relational databases
Mastering SQL basics
Aggregating with GROUP BY and HAVING
Creating fields with CASE WHEN
Analyzing subqueries and CTEs
Merging tables...

Introducing relational databases

A database is a critical component in data-driven businesses and organizations, and data scientists need to understand its structure, functions, and underlying language. This section aims to introduce you to relational databases, focusing on the common language of SQL.

A relational database (also known as a SQL database) is a type of database that organizes data into tables, where each table has rows and columns. Each table represents a specific entity type, such as Customers or Products. Much like DataFrames, each row represents a unique record (or records), and each column represents a field (or attribute) of the data. This relational model introduced a standard way to represent and query data independent of any specific application. You can see an example of a relational database in Figure 5.1:

Figure 5.1: Relational data model example – star schema

What makes relational databases so powerful is their ability...

Mastering SQL basics

As a data scientist, mastering the basics of SQL is crucial. Luckily for you, the basics are pretty easy to grasp, even for non-technical learners. This is because, at this stage, SQL generally reads like English sentences. To get you started, this section focuses on three fundamental components of SQL: the SELECT, WHERE, and ORDER BY statements.

The SELECT statement

The SELECT statement is the foundation of any SQL query and is used to retrieve data from a database. The general syntax is as follows:

SELECT column1, column2, ..., columnN
FROM table_name;

The syntax lists the different columns you want to return, separated by a comma. Since databases hold numerous tables, the query code specifies which table to select the columns using the FROM statement. Lastly, the semi-colon (;) is used to mark the end of a query.

Note

It is standard to create a new line of the query for each main clause (which is capitalized). Here, we started a new line once...

Aggregating data with GROUP BY and HAVING

Aggregation is a concept with which you should already be familiar thanks to the discussion of Python using pandas in Chapter 3. Just like in Python, aggregation in SQL is about summarizing or grouping data in a way that makes it more useful, understandable, and manageable. GROUP BY and HAVING are two crucial components in SQL that help accomplish this.

The GROUP BY statement

Much like how grouping is performed in Python using pandas, the GROUP BY statement in SQL is used with aggregate functions (such as COUNT, SUM, AVG, MAX, and MIN) to group the result set by one or more columns. Thus, using GROUP BY should be familiar to you! The syntax is as follows:

SELECT column1, column2, columnN aggregate_function(columnX)
FROM table
GROUP BY columns(s);

Aggregate values are best managed by using aliases. An alias is simply a nickname for a calculated or aggregated field or temporary table. Simply use the term AS, like so:

SELECT column1...

Creating fields with CASE WHEN

The CASE WHEN statement is a straightforward technique for creating new fields using conditional logic. It allows you to specify multiple conditions and define actions or outcomes for each condition. The CASE WHEN statement is often used to transform data, create calculated columns, or perform conditional aggregations. The syntax of the CASE WHEN statement is as follows:

CASE WHEN
condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE else_result
END As alias;

Here is an example where we create a new field that will detail if a student passed or failed, based on their scores:

SELECT student_id, student_name, exam_score,
CASE WHEN exam_score >= 60 THEN 'Pass'
ELSE 'Fail'
END AS result
FROM students;

This query creates a new field called result, and populates it with "Pass" when the student scored at least 60 on their exa; otherwise, it populates it "Fail". The results...

Analyzing subqueries and CTEs

SQL subqueries, also known as nested queries or inner queries, are queries that are embedded within the context of another SQL query. They are powerful tools for performing complex data manipulations that require one or more intermediary steps – that is, they are used for performing data manipulation operations that require multiple steps or depend on the result of an intermediary query.

That might sound complex and, indeed, subqueries can easily get very complicated. But once you know the rules of engagement, you’ll soon see that they’re very doable. Before implementing a subquery, ask yourself the following:

Where am I starting?
Where am I going?

If you can answer these two questions, you’ve won half the battle. The other half is determining what steps need to take place to get from point A (existing data) to point B (desired data). In this section, we will learn how to navigate multi-step queries with...

Merging tables with joins

SQL joins are used to combine rows from two or more tables based on a related column between them, providing a complete view of the data. We previously hinted at these related columns as primary keys and foreign keys.

As a refresher, a primary key is a column (or a combination of columns) in a database table that uniquely identifies each row in that table. A foreign key, on the other hand, is a column or a combination of columns in a table that establishes a link or a relationship to the primary key of another table.

As we dive into SQL joins, we will put our knowledge of primary and foreign keys to work!

Note

When discussing SQL joins, we will mostly focus on joining two tables to simplify the concepts. Traditionally, two joined tables are referred to as the left table and the right table.

Inner joins

INNER JOIN selects records that have matching values in both tables. Figure 5.8 best demonstrates the logic of this join type:

...

Calculating window functions

SQL window functions are an additional tool in your toolkit. Unlike aggregate functions, which return a single result per group of rows, window functions return a single result for each row, based on the context of that row within a window of related rows.

OVER, ORDER BY, PARTITION, and SET

Window functions have the following basic syntax:

<function> (<expression>)
OVER (
[PARTITION BY <expression_list>]
[ORDER BY <expression_list>] [ROWS|RANGE <frame specification>])

There are a few key concepts to understand here, so let’s break them down:

The OVER keyword is what differentiates a window function from a regular function; once you see it, you know you’re in window function land. The OVER clause defines the window or subset of rows within a query result set that the window function operates on. In short, it provides a way to partition the result set into logical groups and allows the window...

Approaching complex queries

Writing complex SQL queries can be a challenging task, especially when dealing with multiple tables, complex filtering conditions, and intricate calculations. However, by following a step-by-step approach, you can break down the problem into smaller, manageable parts and gradually build up to the final query.

Here are some systematic guidelines on how to approach complex queries:

Step 1 – define the objective: Begin by clearly defining the objective of your query. What specific information are you trying to retrieve or calculate? What is the desired output?
Step 2 – identify the tables: Determine which tables contain the necessary data for your objective and identify their respective keys. This helps us identify our starting point. If multiple tables are involved, consider the relationships between them and how they should be joined. Determine the key(s) in each table.
Step 3 – determine the filtering criteria: Identify...

Summary

In this chapter, we learned the basics of databases and SQL, which are topics that many data scientists encounter in interviews. In fact, as a data scientist, you will almost certainly be quizzed on this topic during interviews. We touched on basic querying concepts, subqueries, joins, window functions, evaluation order, aggregation, filtration, and how to approach complex problems. However, SQL is yet another topic that commands an entire book on its own.

Rest assured that in most cases, there is more than one way to solve a problem, but there are often limited optimal ways to do so. Thus, be sure to spend adequate time practicing the concepts discussed in this chapter. Try not to memorize queries; instead, familiarize yourself with the common use cases explained in this chapter. Follow the aforementioned steps to break down complex problems, but be aware that the order of these steps is not set in stone. Once mastered, you will be able to identify the right query for any...

The rest of the chapter is locked

You have been reading a chapter from

Cracking the Data Science Interview

Published in: Feb 2024Publisher: PacktISBN-13: 9781805120506

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages