Data Science at the Command Line and Setting It Up
"In the beginning... was the command line" Years ago, we didn't have fancy frameworks that handled our distributed computing for us, or applications that could read files intelligently and give us accurate results. If we did, it was very expensive or only worked for a small problem set, very few people had access to this technology, and it was mostly proprietary.
For newcomers to the world of data science, you might have used the command line for a small number of things. Maybe you moved a file from one place to another using mv, or read a file using cat. Or you might have never used the command line at all, or at least not for data science. In this book, we hope to show you a number of tools and ways you can perform some everyday tasks that you can do locally, without using today's buzzword framework.
We created this book for the folks who have little to no experience with the command line, and perform a lot of data extraction, modelling, parsing, and analyzing. This doesn't mean that if you do have a lot of command-line experience (a lot of DevOps and systems folks do), you shouldn't read this book. In fact, you might pick up a couple commands and techniques that you haven't used before.
In this chapter, we will cover the following topics:
- The history of the command line
- Language-focused shells
- Why use the command line?
We will also walk through the setup and configuration of the command line with the following operating systems:
- Windows 10
- Mac OS X
- Ubuntu Linux
If you are running a different operating system, we suggest obtaining an instance from a cloud provider or using the Docker container that's provided in this book.
History of the command line
Since the very first electronic machines, people have strived to communicate with them the same way that we humans talk to each other. But since natural-language processing was beyond the technological grasp of early computer systems, engineers relatively quickly replaced the punch cards, dials, and knobs of early computing machines with teletypes: typewriter-like machines that enabled keyed input and textual output to a display. Teletypes were replaced fairly quickly with video monitors, enabling a world of graphical displays. A novelty of the time, teletypes served a function that was missing in graphical environments, and thus terminal emulators were born for serving as the modern interface to the command line. The programs behind the terminals started out as an ingrained part of the computer itself: resident monitor programs that were able to start a job, detect when it was done, and clean up.
As computers grew in complexity, so did the programs controlling them. Resident monitors gave way to operating systems that were able to share time between multiple jobs. In the early 1960s, Louis Pouzin had the brilliant idea to use the commands being fed to the computer as a kind of program, a shell around the operating system.
"After having written dozens of commands for CTSS, I reached the stage where I felt that commands should be usable as building blocks for writing more commands, just like subroutine libraries. Hence, I wrote RUNCOM, a sort of shell that drives the execution of command scripts, with argument substitution. The tool became instantly popular, as it became possible to go home in the evening and leaving long runcoms to execute overnight."
Scripting in this way, and the reuse of tooling, would become an ingrained trope in the exciting new world of programmable computing. Pouzin's concepts for a programmable shell made their way into the design and philosophy of Multics in the 1960s and its Bell Labs successor, Unix.
In the Bell System Technical Journal from 1978, Doug McIlroy wrote the following regarding the Unix system:
- Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.
- Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.
- Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them.
This is the core of the Unix philosophy and the key tenets that make the command line not just a way to launch programs or list files, but a powerful group of community-built tools that can work together to process data in a clean, simple manner. In fact, McIlroy follows up with this great example of how this had led to success with data processing, even back in 1978:
Having access to simple yet powerful components, programmers needed an easy way to construct, reuse, and execute more complicated commands and scripts to do the processing specific to their needs. Enter the early fully-featured command line shell: the Bourne shell. Developed by Stephen Bourne (also at Bell Labs) in the late 1970s for Unix's System 7, the Bourne shell was designed from the start with programmers like us in mind: it had all the scripting tools needed to put the community-developed single-purpose tools to good use. It was the right tool, in the right place, at the right time; almost all Unix systems today are based upon System 7 and nearly all still include the original Bourne shell as an option. In this book, we will use a descendant of the venerable Bourne shell, known as Bash, which is a rewrite of the Bourne shell released in 1989 for the GNU project that incorporated the best features of the Bourne shell itself along with several of its earlier spinoffs.
We don't want to BaSH other shells, but...
In this book, we decided to focus on using the Bourne-again shell (bash) for multiple reasons. First, it's the most popular shell and you can find it everywhere. In fact, for the majority of Linux distributions, bash is the default shell. It's a great first shell to learn and very easy to work with. There's a number of examples and resources available to help you with bash if you ever get stuck. It's also safe to say that since it's so popular, you can find it on almost any system available today. From a bare-metal installation in a data center to an instance running in the cloud, bash is there, installed, and waiting for input.
There are a number of other shells you can choose from, such as the Z shell (zsh). The Z shell is fairly new (and by new I mean released in 1990, which is new in shell land) and provides a number of powerful features. Other notable shells are tcsh, ksh, and fish. The C Shell (tcsh), the Korn Shell (ksh), and the Friendly Interactive Shell (fish) are still widely used today. FreeBSD has made tcsh its default shell for the root user and ksh is still used for a lot of Solaris operating systems. Fish is also a great starter shell with a lot of features to help the user navigate the shell without feeling lost.
While these shells are still very powerful and stable, we will be focusing on using bash, as we want to focus on consistency across multiple platforms and help you learn a very active and popular shell that's been around for 30 years.