You're reading from Hands-On Data Science with the Command Line

Product typeBook

Published inJan 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789132984

Edition1st Edition

Languages

Python

Tools

UNIX

Concepts

Data Science

Authors (3):

Jason Morris

Chris McCubbin

Raymond Page

View More author details

Shell Workflows, and Data Acquisition and Massaging

In this chapter, we're going to work on an actual dataset and do some basic analysis. We'll learn how to download files straight from the command line, determine what type of file it is, and parse the data using a number of commands. We'll also cover how to perform non-interactive detached processing and review some common terminal multiplexers that enable us to prettify the command line as well as organize detached processing.

In this chapter, we'll cover the following topics:

How to download a dataset using the command line
Using built-in tools to inspect the data and its type
How to perform a word count in bash
Analyzing a dataset with some simple commands
Detached processing
Terminal multiplexers

Download the data

Now that we have an understanding of the command line, let's do something cool with it! Say we had a couple datasets full of book reviews from Amazon, and we wanted to only view the reviews about Packt Publishing. First, let's go ahead and grab the data (if you are using the Docker container, the data is located in /data):

curl -O https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz && curl -O https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz

You should see the following:

We are introducing a couple of new commands and features here to download the files. First, we call the curl command to download the file. You can run curl --help to view all of the options available, or man curl, but we wanted to download a remote file and save it as the original...

Using the file command

Once the data is done downloading, let's take a look and see what we've got. Go ahead and run ls -al amazon* to make sure the files actually downloaded:

If you have anything else in this directory named amazon, that will show up as well. Now that the files are downloaded, let's introduce a new command, called file. Go ahead and run the following file amazon* command:

Wow, without any parameters set, the file command was able to figure out that this is a compressed archive. You'll use the file command a lot to determine the type of files you're working with. Let's decompress the files so we can work with them. This might take a little bit, depending on the speed of your system.

To do so, run the following:

zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz >> amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv...

Performing a word count

Now that we have some data to work with, let's combine the two files together into a single file. To do so, perform the following:

cat *.tsv > reviews.tsv

This is what you should see once you run the preceding command:

Excellent. Let's say we wanted to count how many words or lines are in this file. Let's introduce the wc command. wc is short for (you guessed it) word count. Let's quickly man wc to see the options available:

Looks like wc can count the lines and also the words of a file. Let's see how many lines our file actually has:

wc -l reviews.tsv

The following is what you should see once you run the preceding command:

That's a lot of lines! What about words? Run the following:

wc -w reviews.tsv

This looks like a great dataset to use. It's not big data by any means, but there's a lot of cool stuff we...

Introduction to cut

Let's break the command down before you run it. The cut command removes sections from each line of a file. The -d parameter tells cut we are working with a tsv (tab separated values), and the -f parameter tells cut what fields we are interested in. Since product_title is the sixth field in our file, we started with that:

cut -d$'\t' -f 6,8,13,14 reviews.tsv | more

Unlike most programs, cut starts at 1 instead of 0.

Let’s see the results:

Much better! Let's go ahead and save this as a new file:

cut -d$'\t' -f 6,8,13,14 reviews.tsv > stripped_reviews.tsv

The following is what you should see once you run the preceding command:

Let's see how many times the word Packt shows up in this dataset:

grep -i Packt stripped_reviews.tsv | wc -w

The following is what you should see once you run the preceding command:

Let&apos...

Detached processing

Detached processing runs a command in the background. This means that terminal control is immediately returned to the shell process while the detached process runs in the background. With job control, these back grounded processes can be resumed in the foreground or killed directly.

How to background a process

Remember when we used the double ampersand to conditionally execute two commands that run one after another? By using a single ampersand, you can fork a process in the background and let it run. Let's use the command to save to a new file and run in the background:

cat all_reviews.csv | awk -F ","  '{print $4}' | grep -i Packt > background_words.txt &

This will take...

Summary

In this chapter, we only scratched the surface on what we can do with the command line. We were able to download a dataset, save it, inspect the file type, and perform some simple analytics. The word count example is considered the "Hello, World" of data science and we saw just how easy it is to perform in bash.

We then took your shell customization to the next level by using terminal multiplexers and background processes. Think of it like using an IDE, but for the command line. It will make working with bash a lot easier.

Being able to control processes and workflows will improve productivity. Detached processing ensures programs can complete without interruption. The terminal multiplexer provides a means of maximizing the use of screen real-estate, while also providing a detached processing environment, which is a double win for all.

In the next chapter, we...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Science with the Command Line

Published in: Jan 2019Publisher: PacktISBN-13: 9781789132984

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Jason Morris

Jason Morris is a systems and research engineer with over 19 years of experience in system architecture, research engineering, and large data analysis. His primary focus is machine learning with TensorFlow, CUDA, and Apache Spark. Jason is also a speaker and a consultant for designing large-scale architectures, implementing best security practices on the cloud, creating near real-time image detection analytics with deep learning, and developing serverless architectures to aid in ETL. His most recent roles include solution architect, big data engineer, big data specialist, and instructor at Amazon Web Services. He is currently the Chief Technology Officer of Next Rev Technologies and his favorite command line program is netcat
Read more about Jason Morris

Chris McCubbin

Chris McCubbin is a data scientist and software developer with 20 years experience in developing complex systems and analytics. He co-founded the successful big data security startup Sqrrl, since acquired by Amazon. He has also developed smart swarming systems for drones, social network analysis systems in MapReduce and big data security analytic platforms using the Apache projects Accumulo and Spark. He has been using the Unix command line starting on IRIX platforms in college and his favorite command line program is find.
Read more about Chris McCubbin

Raymond Page

Raymond Page is a computer engineer specializing in site reliability. His experience with embedded development engendered a passion for removing the pervasive bloat from web technologies and cloud computing. His favorite command is cat.
Read more about Raymond Page

Other recommended products

Related to this chapter

Learning Linux Shell Scripting

Linux has been one of the widely adopted and popular OS when it comes to leveraging scripting and automating common tasks. With this book, readers will get to grips with shell scripting, automating repetitive tasks, text processing, regular expressions, pattern matching, backup and restore, and much more. The end goal of this book is to get you up and running with the most common commands that will ease your daily administration tasks.

BookMay 2018332 pages

Command Line Fundamentals

From the bash shell to the traditional UNIX programs, and from redirection and pipes to automating tasks, Command Line Fundamentals teaches you all you need to know about how command lines work.

BookDec 2018314 pages

Learning AWK Programming

AWK is one of the powerful utility which exists in all Unix and Unix-like distributions. This book covers the structure and the control flow of AWK program, regular expressions, and use of different operators to carry out various text processing and mining tasks. This book is packed with unique practical examples to practice AWK programming.

BookMar 2018416 pages

Bash Quick Start Guide

Bash and shell script programming is central to using Linux, but it has many peculiar properties that are hard to understand and unfamiliar to many programmers, with a lot of misleading and even risky information online. Bash Quick Start tackles these problems head on, and shows you the best practices of shell script programming.

BookSep 2018186 pages

Learn Linux Quickly

If you think Linux is a sophisticated operating system that only hackers and geeks know how to use, this book will surprise you! With Learn Linux Quickly, you’ll see how easy it is to get started with Linux. This book teaches you Linux in an engaging and enjoyable way, helping you to enhance your skills as you explore the power of Linux.

BookAug 2020338 pages5

Mastering Linux Shell Scripting

Shell scripting is a quick method to prototype a complex application or a problem by automating tasks when working on Linux-based systems. This book will make use of both simple one-line commands and command sequences and complex problems can be solved with ease, from text processing to backing up sysadmin tools.

BookApr 2018284 pages

Working with Linux - Quick Hacks for the Command Line

Linux is everywhere and with such broad usage, the demand for Linux specialists is ever growing. For the engineers out there, this means being able to develop, interconnect, and maintain Linux environments. This book is for developers who already know the basics of Linux and want to sharpen their skills. The readers will be able to double their terminal productivity and work more efficiently and effectively.

BookMay 2017222 pages

Fundamentals of Linux

Linux is a Unix-like computer operating system assembled under the model of free and open-source software development and distribution. This book will teach all the important command-line tools and utilities using real-world examples. You'll learn everything you need to know as a new Linux system administrator.

BookJun 2018234 pages

Bash Cookbook

One of the most powerful tools that can be used almost every day is the Bash shell, but its true utility remains untapped by most users. Using a collection of recipes and intuitive lessons, the Bash Cookbook walks you through a series of exercises designed to teach you how to effectively use the Bash shell to create and execute your own scripts.

BookJul 2018264 pages

Linux Shell Scripting Cookbook

The shell is one of the most powerful tools on a computer system, yet a large number of users are unaware of how much they can accomplish with it. Have you ever imagined how to produce databases and web scripts or automate monotonous admin tasks such as creating backups or monitoring? Well, look no further - you have come to the right place!

BookMay 2017552 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages