Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Data Science with the Command Line

You're reading from  Hands-On Data Science with the Command Line

Product type Book
Published in Jan 2019
Publisher Packt
ISBN-13 9781789132984
Pages 124 pages
Edition 1st Edition
Languages
Authors (3):
Jason Morris Jason Morris
Profile icon Jason Morris
Chris McCubbin Chris McCubbin
Profile icon Chris McCubbin
Raymond Page Raymond Page
Profile icon Raymond Page
View More author details

Loops, Functions, and String Processing

Sometimes, magic one-liners are insufficient for manipulating data. Loops and conditionals enable us to iterate over data in interesting ways without sticking to default behavior.

Bash views non-binary files and streams as collections of characters. We commonly think of these characters as groups of strings separated by some kind of whitespace. It makes sense that some of the most useful and common tools in the command-line universe are the ones that search and manipulate these strings.

The following topics will be covered in this chapter:

  • for loops
  • while loops
  • File test conditionals
  • Numeric comparisons
  • String case statements
  • Using regular expressions and grep to search and filter
  • String transformations using awk, sed, and tr
  • Sorting lists of strings with sort and uniq

Along the way, we'll see how we can pipe the results of one program...

Once, twice, three times a lady loops

Few command-line tools have implicit looping and conditionals built into them. Often, tasks will only operate on each line of an input stream and then terminate. The shell provides just enough control flow and conditionals to solve many complex problems, making up for any deficiencies that command-line tools have for operating on data.

The almighty for loop is a common loop idiom, however bash's for loop might feel a little unfamiliar to users of more traditional languages. The for loop allows you to iterate over a list of words, and assign each one to a variable for processing. For example, (pun intended):

Often, we want a more traditional range of numbers in our for loops. The POSIX method of generating a number range is to use the seq command, as in seq -- $(seq 1 1 5), which will generate numbers from 1 (the first argument) to 5...

It's the end of the world as we know it while and until

Let's explore two more options for assisting with iteration. The while construct allows for the repetitive execution of a list or set of commands as long as the command that controls the while loop exits successfully. Let's see an example:

Let's say I wanted to print the "hello!" string four times in a script—no more and no less. We can do so with the following:

Let's save and run this script to see what happens.

Don't forget to chmod -x these scripts to make them executable.

Executing the script produces the following:

Notice that, in the script, we created a variable called i="0". This sets the i variable to zero. Do you see the while [ $i -lt 4 ] block? This allows us to run the loop as the i variable is less than the 4 integer. Go ahead and play around with this...

The simple case

Frequently, string comparison is done using the test operator, [. This is ill-advised in bash, as there's a much more convenient format for string comparison, using the case statement. Here's a simple example:


testcase() {
for VAR; do
case “${VAR}” in
'') echo “empty”;;
a) echo “a”;;
b) echo “b”;;
c) echo “c”;;
*) echo “not a, b, c”;;
esac
done
}
testcase '' foo a bar b c d

The testcase function lets us test the case statement by wrapping it in a for loop that assigns each function argument to the VAR variable, then executes the case statement. With the foo a bar b c d arguments, we can expect the following output:

 empty
not a, b, c
a
not a, b, c
b
c
d

Pay no heed to the magician redirecting your attention

Looping is great for working over sequences of data in an iterative fashion, but sometimes, when you're doing all that work, you get lots of irrelevant output. Enter our little magician: the output redirection operator, >. This operator directs output to a specified file or file descriptor. We've talked about file descriptors, they are integers that the OS uses to identify a file handle that has been opened, and by default there are three opened for every process: stdin, stdout, and stderr. The default file descriptors, denoted by fd#, are fd0 for standard input, fd1 for standard output, and fd2 for standard error. The > operator by default, redirects stdout, the equivalent of 1>, unless it's preceded by an integer file-descriptor. Let's see some examples of output redirection, before we get lost...

Regular expressions and grep

One key task you will face over and over is matching particular patterns of text. The match might be as simple as finding one instance of a specific string in a body of text, or it could be much more complicated. A great tool for matching text is the language of regular expressions. A regular expression is an abstract way of expressing certain types of string-matching patterns.

Contrary to popular belief, regular expressions can't match everything you might want to match. They're limited to certain types of matches, and depending on the particular flavor of regular expression implementation, they could have a little more or a little less power. As an academic exercise, one might try to characterize exactly what you can match and what you can't. It's a very interesting endeavor that cuts to the very core of theoretical computer science...

awk, sed, and tr

In this section, we will be looking at awk, sed, and tr.

awk

awk (including the gnu implementation, gawk) is designed for streaming text processing, data extraction, and reporting. An awk program is structured as a set of patterns that are matched, and actions to take when those patterns are matched:

pattern {action}
pattern {action}
pattern {action}

For each record (usually each line of text passed to awk), each pattern is tested to see whether the record matches, and if so, the action is taken. Additionally, each record is automatically split into a list of fields by a delimiter (any run of whitespace by default). The default action, if none is given, is to print the record. The default pattern is...

Summary

In this chapter, we covered the breadth of bash's control structures and dived into input/output redirection. These features can be leveraged to enhance your command-line functions and enable small scripts that process data in loops without having to resort to a full-fledged programming language for some simple data processing.

We also looked at a lot of ways to slice and dice characters and strings. While many use cases may be covered using string manipulation alone, often we'll want to delve a little deeper into the data represented by these streams to extract useful information.

In the next chapter, we'll look at doing this by using the command line and data streams as a database.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Science with the Command Line
Published in: Jan 2019 Publisher: Packt ISBN-13: 9781789132984
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}