You're reading from Cracking the Data Science Interview

Product type Book

Published in Feb 2024

Publisher Packt

ISBN-13 9781805120506

Pages 404 pages

Edition 1st Edition

Languages

Concepts

Data Science

Authors (2):

Leondra R. Gonzalez

Aaren Stubberfield

View More author details

Table of Contents (21) Chapters

Preface

Part 1: Breaking into the Data Science Field

Chapter 1: Exploring Today’s Modern Data Science Landscape

Chapter 2: Finding a Job in Data Science

Part 2: Manipulating and Managing Data

Chapter 3: Programming with Python

Chapter 4: Visualizing Data and Data Storytelling

Chapter 5: Querying Databases with SQL

Chapter 6: Scripting with Shell and Bash Commands in Linux

Chapter 7: Using Git for Version Control

Part 3: Exploring Artificial Intelligence

Chapter 8: Mining Data with Probability and Statistics

Chapter 9: Understanding Feature Engineering and Preparing Data for Modeling

Chapter 10: Mastering Machine Learning Concepts

Chapter 11: Building Networks with Deep Learning

Chapter 12: Implementing Machine Learning Solutions with MLOps

Part 4: Getting the Job

Chapter 13: Mastering the Interview Rounds

Chapter 14: Negotiating Compensation

Index

Why subscribe?

Other Books You May Enjoy

Mining Data with Probability and Statistics

In this chapter, you will be introduced to the vital world of statistics, which serves as the foundation of applied data science. An understanding of these concepts is crucial for drawing meaningful conclusions and making informed decisions and predictions from data. This knowledge is not just an intellectual exercise; it equips you with essential tools to excel in advanced data science interviews by allowing you to uncover hidden insights within datasets.

This chapter will guide you through the essential aspects of classical statistics, including the analysis of populations and samples, measures of central tendency and variability, and the intriguing realms of probability and conditional probability. You’ll also explore probability distributions, the central limit theorem (CLT), experimental design, hypothesis testing, and confidence intervals. This chapter concludes with a focus on regression and correlation, giving you comprehensive...

Describing data with descriptive statistics

Descriptive statistics are values that summarize the characteristics of a dataset. Before working on a project, data scientists use descriptive statistics to better understand the dataset they are working with. Think of it like exploring a treasure chest of information, with descriptive statistics as your guide to finding important details.

In your technical interview, you will be expected to be able to understand and use descriptive statistics. In this section, we will look at how to measure the central tendency of our dataset, then explore measures of variability or how dispersed and how much spread our dataset has.

Measuring central tendency

We are exposed to measures of centrality every day. For instance, if you live in the US, you might have heard that home prices in the state of California of the US are, on average, higher than in the state of Ohio. Of course, this doesn’t mean that every home in California is more expensive...

Introducing populations and samples

Statistics is the art of extracting meaningful insights from data, and it all begins with a thorough understanding of populations and samples. In this section, we will explore the fundamental concepts that underpin statistical analysis by distinguishing between populations and samples.

Understanding these concepts is important because they form the basis for generalizing observations from a subset of data to a larger group. By investigating the intricacies of populations and samples, you will gain the necessary tools to make sound inferences and draw reliable conclusions from the data you encounter. So, let‘s embark on this enlightening journey and uncover the foundations of statistical analysis.

Defining populations and samples

In the realm of statistics, a population refers to the entire group of individuals, objects, or events that we are interested in studying. For instance, if we wanted to research the average height of all adults...

Understanding the Central Limit Thereom (CLT)

Now that we’ve learned about sampling, now’s the time to introduce one of the most important concepts in classical statistics – the Central Limit Thereom (CLT).

The CLT

Measuring the center of data is not as simple as just calculating the mean, median, or mode. The CLT states that regardless of the original population distribution’s shape, when we repeatedly take samples from that population and each sample is sufficiently large, the distribution of the sample means will approximate a normal distribution. This approximation becomes more accurate as the size of each sample becomes larger. This theorem plays a crucial role in measuring centrality by allowing us to make reliable estimates using these measures. In turn, the CLT enables us to estimate the population mean with greater accuracy, making the mean a powerful tool for summarizing data. It also indirectly influences the estimation of the median and...

Shaping data with sampling distributions

If you’ve ever taken an introductory statistics course, you were probably taught that theoretical distributions (such as the ones we will discuss in this section) are a way to describe the central tendency and variability of a given numeric variable. Depending on the situation, it’s often more appropriate to use one distribution over the other. Although this is an accurate summary of probability distributions, it’s important to understand why we use them, and how you should think about them in a data science context (instead of that of a social sciences context, which is often how traditional introductory statistics classes are taught).

Probability distributions

Probability distributions are fundamental concepts in statistics and probability theory that describe the likelihood of various outcomes in a random experiment or process. In the world of data science, these distributions play a crucial role in modeling and...

Testing hypotheses

In this section, we will review hypothesis testing, which is a statistical method that’s used to make inferences about population parameters based on sample data. It involves formulating two competing hypotheses – the null hypothesis ( <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>H</mi><mn>0</mn></mrow></mrow></math> ) and the alternative hypothesis () – and then using sample data to determine which hypothesis is more likely to be true.

The null hypothesis, or what I like to call “business as usual,” is the default assumption or status quo for any given scenario. It’s also often considered the “least interesting” scenario. For example, if I want to test whether or not changing my sneakers makes me a better runner, the sneakers not affecting my running abilities is the null hypothesis since there is no significant difference, effect, or relationship between the variables. Oftentimes, researchers are interested in rejecting the null hypothesis.

The alternative hypothesis is the opposite...

Understanding Type I and Type II errors

In hypothesis testing, there is always a chance of making errors:

A Type I error occurs when we reject the null hypothesis when it is true (this is also known as a false positive)
A Type II error occurs when we fail to reject the null hypothesis when it is false (this is also known as a false negative):

Figure 8.7: Type I error vs. Type II

Understanding the nuances and implications of Type I and Type II errors is fundamental to hypothesis testing. In Figure 8.7, we see that Type I Error occurs at the intersection of the null hypothesis being true, and the action of rejecting the null hypothesis. This is similar to a pregnancy test coming back positive when the woman is not in fact pregnant (also known as a false positive result).

Simiarly, Type II Errors occur when the null hypothesis is false, but incorrectly fails to reject the null hypothesis. This is like having a pregnancy test that tells...

Summary

In this chapter, we dove into the core fundamentals of data mining with statistics, which are often assessed during data science interviews. We reviewed the basics of probability, how to describe data using different measures of centrality and variability, how to estimate variables with population sampling, the relevance of the CLT and the assumption of normality, and reviewed probability distributions and hypothesis testing. By learning about these principles, you will be able to identify and describe relevant data statistics and make testable hypotheses. You will also avoid being fooled by misused statistics that manipulate our understanding of data.

Be aware that some interviewers will ask theoretical questions while others will want you to work out the solution to a problem. In either case, statistics is the backbone of many machine learning algorithms and experimentation designs, which are prominent in data science in all industries.

In the next chapter, we will...