Reader small image

You're reading from  Cracking the Data Science Interview

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781805120506
Edition1st Edition
Concepts
Right arrow
Authors (2):
Leondra R. Gonzalez
Leondra R. Gonzalez
author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

Aaren Stubberfield
Aaren Stubberfield
author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield

View More author details
Right arrow

Programming with Python

Starting from this chapter, we will now transition into preparing you for the technical portion of data science job interviews. For this reason, this second part of the book is best used as a study/quick reference guide as you prepare for your interviews. Therefore, feel free to skip or review chapters according to your studying needs.

In each of the following chapters, we will review key concepts and provide sample problems. Thus, it is important that you are at least familiar with introductory programming concepts, preferably with functional programming. This includes, but is not limited to, syntax, data types, variables and assignments, control flow, and packages such as pandas and numpy for data wrangling.

By the end of this chapter in particular, you will have a handle on expected Python questions within a data science interview, and know how to tackle them logically. Additionally, you will be more comfortable and confident with thinking through...

Using variables, data types, and data structures

In Python, variables are the building blocks of any code. It’s simply a value of some given type assigned to an object. For example, if I set a variable called x equal to 10, the variable x now holds that value (until it is changed). In short, variables are used to store data. Unlike some other programming languages, such as Java, the variable type does not need explicit declaration in Python. The declaration or type of a variable is determined automatically when you assign a value to it (although you can and should change data types as needed). There are several built-in data types in Python. Here are some common ones:

  • Numeric types: There are numerous types of numeric data types, including int (integers), float (floating-point numbers), and complex (complex numbers). Numeric variables in Python are used to store numerical data:
    • Integers represent whole numbers without any fractional or decimal part. They can be positive...

Indexing in Python

To access values within a data object, we use indexing. Indexing is the process of accessing individual elements within a data structure. In this case, the data structure is a list, but as you will soon learn, indexing is applicable to many data structures.

Note

Each element or item within a data structure is assigned a unique index or position, starting from a specific value. In Python, this value is 0. This means that the first position in any data structure in Python is located at index 0, followed by the second position, which is located at index 1, and so on.

Indexing allows you to retrieve or manipulate specific elements within the data structure by specifying their index. It provides a way to refer to elements individually rather than accessing the entire data structure as a whole.

The basic syntax for indexing a list or tuple in Python is as follows:

list_or_tuple_name[index_position]

The list_or_tuple_name object is the name of the list...

Using string operations

String operations are very common when working with Python and text data. Therefore, this section will review how to initialize a string, string indexing/slicing, and some common string methods.

Note

We will not review string regular expressions, as this is a large topic with significant depth. Check out Mastering Python Regular Expressions by Victor Romero and Felix L. Luis for more instructions on this topic.

Initializing a string

Python allows for string initialization (creation) in several ways. Two ways include single quotes ('') and double quotes (""):

# Single quotes
s = 'Hello, World!'
print(s)  # prints: Hello, World!
# Double quotes
s = "Hello, World!"
print(s)  # prints: Hello, World!

Single and double quotes are basically interchangeable. The only difference comes into play when you have a quote mark (single or double) inside a string. For example, one common scenario is...

Using Python control statements, loops, and list comprehensions

Control statements are used for various tasks. For example, they’re used to filter data based on certain conditions, perform a calculation on each item in a list, iterate through rows in a dataframe, and more. Additionally, list comprehensions are widely used in data science as they provide efficiency and legibility. It’s often used in data cleaning and preprocessing tasks, feature engineering, and more.

Control statements in Python allow you to control the flow of your program’s execution based on certain conditions or loops. The main types of control statements are conditional statements (such as if, elif, and else) and loop statements (such as for and while).

Meanwhile, list comprehensions are a sort of short-hand approach to writing loop statements. More specifically, they are a shorter, more concise syntax for creating a list based on the values of an existing list.

Conditional statements...

Using user-defined functions

Sometimes, you may need to create your own function to perform very specific operations. This is common in the data science world, especially as it relates to data cleaning, preprocessing, and modeling activities.

In this section, we will discuss user-defined functions, which are functions created by the programmer to perform specific tasks. They are not unlike mathematical functions, which (usually) take some inputs and (often) produce some outputs. User-defined functions are designed to take 0 or more inputs, do some specific computation(s) (we’ll just call it stuff), and produce an output.

This process is especially helpful when performing repeated tasks. In fact, the rule of thumb is to use it if you have to do a task more than once. In more advanced cases, user functions are also helpful for code reusability, organization, readability, and maintainability.

Breaking down the user-defined function syntax

When used effectively, user...

Handling files in Python

In Python, the built-in open function is used to open a file, and it returns a file object. Once a file is opened, you can read its contents using the read method. However, an important aspect to consider while managing files is ensuring they are closed after use, allowing for the setup and teardown of computational resources. One way to accomplish this is by using context managers.

Context managers are an object that manages the context of a block of code, typically with a with statement. It’s particularly useful for setting up and tearing down computational resources, such as efficiently opening and closing files. In short, the with keyword, which automatically closes the file once the nested block of code is executed, is more efficient and reduces the risk of a file not being properly closed.

The syntax to open files using context managers is as follows:

with open(<file_name.csv>) as file_object:
    # Code block...

Wrangling data with pandas

Data wrangling is one of the most important topics in data science interviews. For starters, data is often not presented in an analysis-ready format, which makes it necessary for data modeling preprocessing and addressing data quality concerns. Thus, data scientists can spend upward of 80% of their time cleaning and wrangling data [1].

Furthermore, data wrangling skills demonstrate your comfort and fluency with computer programming. Having the ability to use functions, loops, indexing, aggregation, filtering, and forming calculations will serve you well in your data science journey, enabling you to complete work quickly and efficiently. It is also fundamental for extract, transform, load (ETL) activities, querying data, data modeling, descriptive statistics, reporting, and a host of other data tasks.

In this section, we will review a couple of common data wrangling challenges, including handling missing data, filtering data, merging, and aggregating...

Summary

In this chapter, we covered many Python programming fundamentals you would need for your technical interview. First, we covered Python variable data types and string operations, including string indexing. Afterward, we reviewed Python list comprehensions and control statements, including loops. Then we focused on some aspects of Python classes, indexing, merging, sorting, data aggregation, and handling missing data.

It is incredibly important to be proficient in the area of data wrangling and manipulation, which comprises a large part of data science interviews and assessments. Although it comprises a large part, data wrangling is tested proportional to its presence in data science jobs.

In the next chapter, we will move our focus from Python fundamentals to data visualization and storytelling.

References

  • [1] A Comparative Study of Data Cleaning Tools by Chen, Z., Oni, S., Hoban, S., & Jademi, O., from International Journal of Data Warehousing and Mining (IJDWM) (2019).
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cracking the Data Science Interview
Published in: Feb 2024Publisher: PacktISBN-13: 9781805120506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield