Reader small image

You're reading from  Cracking the Data Science Interview

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781805120506
Edition1st Edition
Concepts
Right arrow
Authors (2):
Leondra R. Gonzalez
Leondra R. Gonzalez
author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

Aaren Stubberfield
Aaren Stubberfield
author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield

View More author details
Right arrow

Scripting with Shell and Bash Commands in Linux

In this brief chapter, we’ll delve into shell and Bash scripting with Linux, covering basic navigation control statements, functions, data processing and pipelines, and database operations. Additionally, you’ll learn how to leverage the cron command for task scheduling and, importantly, how to run Python programs from the command line.

Although the likelihood of being tested on Linux commands during a data science interview is rare, you’ll be better prepared to utilize data science-adjacent technologies that leverage the command line. In this chapter, we will cover the following topics:

  • Introduction to operating systems
  • Navigating system directories
  • Filing and directory manipulation
  • Scripting with Bash
  • Introducing control statements
  • Creating functions
  • Processing data and pipelines
  • Using cron

Introducing operating systems

An operating system (OS) is a software program that acts as an intermediary between computer hardware and user applications. You’re probably familiar with Windows, Android, and iOS, which are all different types of operating systems with their own unique features and applications.

Linux is an open source OS known for its Unix-like architecture, allowing users to configure and modify the system according to their specific needs. Like other Unix-based systems, it arranges files and directories in a hierarchical structure. The root directory is at the very top of this hierarchy, denoted by a forward slash (/).

The root directory is the top-level directory in an OS filesystem’s tree-like hierarchy and is the starting point for all other directories and files. For example, if you see a file path such as /home/user/file.txt, the leading forward slash indicates that it is referencing a location relative to the root directory. That location...

Filing and directory manipulation

Managing files and directories is a fundamental skill when working in a Unix-based environment. As a data scientist, you’ll frequently need to create, delete, move, and copy files and directories. Knowing how to use these commands in your daily activities may become a core skill, depending on the systems you are using. However, in a technical interview, these topics might occasionally come up. Therefore, we will only quickly review a few core operations here.

The following list will explain these operations and discuss how to manipulate file and directory contents:

  • Creating files: To create a new file, use the touch command followed by the name of the file you want to create. For instance, to create a file named analysis.py, you would use the following command:
    touch analysis.py
  • Creating directories: To create a new directory, use the mkdir command. For example, to create a directory named new_data, use the following:
    mkdir new_data...

Scripting with Bash

Bash (Bourne Again SHell) is one specific shell implementation that has gained widespread popularity and is the default shell for many Linux distributions. Bash scripts can automate repetitive tasks, handle file and text manipulation, control job scheduling, and much more.

Note

While Bash is a specific shell, the term “shell” is more generic and encompasses other shell implementations.

A Bash script is a plain text file that contains a series of commands. These scripts can be used to automate entire workflows and complex processes that you’d otherwise have to perform command by command on the command line.

To create a Bash script, use a text editor to write your script, save it with any name, and give it the .sh extension. For example, you might name your script, script.sh. You can also use Vim like so:

Figure 6.6: Creating a Bash script

Figure 6.6: Creating a Bash script

In Figure 6.6, we are creating a Bash script using vi, and then...

Introducing control statements

Control statements, including conditional statements and loops, are an integral part of shell scripting, allowing you to incorporate decision-making and repetitive tasks in your scripts. As a data scientist, you might use control statements when automating data preprocessing, running different analyses based on certain conditions, or when building complex pipelines. This section will introduce the most commonly used control statements in Bash scripting.

Just like other programming languages, Bash provides conditional statements to control the flow of execution. The most common conditional statements in Bash are if, if-else, and if-elif-else.

Let’s take a look at a simple if statement:

#!/bin/bash
x=10
if [ $x -gt 5 ]
then
  echo "x is greater than 5"
fi

In this script, if the value of x is greater than 5, the message x is greater than 5 is printed to the console.

As you can see, control statements are often paired...

Creating functions

Functions in Bash are blocks of reusable code that perform a certain action. They help structure scripts and avoid repetitive code, making scripts easier to maintain and debug. In data science, you might use Bash functions to perform recurring tasks such as loading data, processing files, or managing resources.

A function in Bash is declared with the following syntax:

function_name() {
  # Code here
}

function_name is the name of the function, which you’ll use to call it. The code inside the curly braces {} is the body of the function.

Here’s an example of a function that prints a greeting:

greet() {
  echo "Hello, $1"
}

This greet function prints “Hello” followed by the first argument passed to it. The $1 part is a special variable that refers to the first argument.

Once a function is defined, it can be called by its name. For example, to call the greet function, you would write the following...

Processing data and pipelines

As a data scientist, you often need to handle and process large datasets. Bash provides powerful tools for data processing and creating pipelines, which are sequences of processes chained by their standard streams. This allows the output of one command to be passed as input to the next. Several commands in Bash are incredibly useful for data processing. Here are a few examples:

  • cat: Concatenates and displays the content of files.
  • cut: Removes sections from lines of files.
  • sort: Sorts lines in text files.
  • uniq: Removes duplicate lines from a sorted file.
  • head filename and tail filename: These commands output the first and last 10 lines of a file, respectively. You can specify the number of lines by adding -n, as in head -n 20 filename.

Here’s an example of using cat, sort, and uniq to display the unique lines in a file:

cat filename | sort | uniq

The cat function displays the contents of the file. The pipe (|)...

Using cron

cron is a powerful feature in Unix-like operating systems that allows users to schedule tasks (called cron jobs) to run automatically at specific times or on specific days. As a data scientist, you might use cron to automate tasks such as retrieving data, cleaning data, or running scripts at regular intervals.

The crontab (cron table) command allows you to create, edit, manage, and remove cron jobs. Here’s an example of how you might use the crontab command to view your current cron jobs:

crontab -l

The -l option tells crontab to list the current user’s cron jobs.

To edit your cron jobs, you would use the -e option:

crontab -e

This command opens the current user’s crontab file in the default text editor. If no crontab file exists for the user, this command creates one.

A cron job is defined by a line in the crontab file, which consists of six fields:

*     *     *   ...

Summary

In this chapter, we covered a broad range of topics related to basic shell and Bash scripting and command-Line operations for data scientists.

We began with an overview of navigating within the file structure and directory on a local computer or a virtual machine from the command line, explaining the use of basic commands for directory navigation. Then, we moved on to file and directory manipulation. In the subsequent sections, we delved into Bash scripting topics, discussing control statements and the use of Bash functions to create reusable pieces of code. We highlighted data processing and pipelines, demonstrating how to chain commands together to process text data. We also covered cron jobs for scheduling tasks and provided an overview of its syntax.

Gaining fluency in Bash scripts and basic shell commands will prepare you to engage with a variety of other CLI technologies commonly used in data science such as interfacing with the cloud providers (i.e.: AWS, Azure...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cracking the Data Science Interview
Published in: Feb 2024Publisher: PacktISBN-13: 9781805120506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Leondra R. Gonzalez

Leondra R. Gonzalez is a data scientist at Microsoft and Chief Data Officer for tech startup CulTRUE, with 10 years of experience in tech, entertainment, and advertising. During her academic career, she has completed educational opportunities with Google, Amazon, NBC, and AT&T.
Read more about Leondra R. Gonzalez

author image
Aaren Stubberfield

Aaren Stubberfield is a senior data scientist for Microsoft's digital advertising business and the author of three popular courses on Datacamp. He graduated with an MS in Predictive Analytics and has over 10 years of experience in various data science and analytical roles focused on finding insights for business-related questions.
Read more about Aaren Stubberfield