Jupyter as a Data Laboratory: Part 1

Marijn van Vliet

May 18th, 2016

This is part one of a two-part piece on Jupyter, a computing platform used my many scientists to perform their data analysis and modeling. This first part will help you understand what Jupyter is, and the second part will cover why it represents a leap forward in scientific computing.

Jupyter: a data laboratory

If you think that scientists, famous for being careful and precise, always produce well-documented, well-tested, and beautiful code, you are dead wrong. More often than not, a scientist's local code folder is an ill-organized heap of horrible spaghetti code that will give any seasoned software developer nightmares. But the scientist will sleep soundly. That is because usually, the sort of programming that scientists do is a lot different from software development. They tend to write programming code for a whole different purpose, with a whole different mindset, and with a whole different approach to computing. If you have never done scientific computing before—by which I mean you have never used your computer to analyze measurement data or to "do science"—then leave your TDD, SCRUM, agile, and so on at the door and come join me for a little excursion into Jupyter.

The programming language is your user interface

Over the years, programmers have created applications to cover most computing needs of most users. In domains such as content creation, communication, and entertainment, chances are good that someone already wrote an application that does what you want to do. If you're lucky, there's even a friendly GUI to help guide you through the process. But in science, the point is usually to try something that nobody has done before. Hence, any application used for data analysis needs to be flexible. The application has to enable the user to do, well, anything imaginable, with a dataset; and the GUI paradigm breaks down. Instead of presenting the user with a list of available options, it becomes more efficient to just ask the user what needs to be accomplished. When driven to the extreme, you end up dropping the whole concept of an application and working directly with a programming language.

So it is understandable that when you start Jupyter, you are staring at a mostly blank screen with a blinking cursor. Realize that behind that blinking cursor sits the considerable computational power your computer—most likely a multicore processor, gigabytes of RAM, and terabytes of storage space, awaiting your command. In many domains, a programming language is used to create an application, which in turn presents you with an interface to do the operation you wanted to do in the first place. In scientific computing, however, the programming language is your interface.

The ingredients of a data laboratory

I think of Jupyter as a data laboratory. The heart of a data laboratory is a REPL (a read-eval-print loop, which allows you to enter lines of programming code that immediately get executed, and the result is displayed on the screen). The REPL can be regarded as a workbench, and loading a chunk of data into working memory can be regarded as placing a sample on it, ready to be examined. Jupyter offers several advanced REPL environments, most notably IPython, which runs on your terminal and also ships with its own tricked out terminal to display inline graphics and offer easier copy-paste. However, the most powerful REPL that Jupyter offers runs in your browser, allowing you to use multiple programming languages at the same time and embed inline markdown, images, videos, and basically anything the browser can render.

The REPL allows access to the underlying programming language. Since the language acts as our primary user interface, it needs to get out of our way as much as possible. This generally means it should be high-level with terse syntax and not be too picky about correctness. And of course, it must support an interpreted mode to allow a quick back-and-forth between a line of code and the result of the computation. Of the multitude of programming languages supported by Jupyter, it ships with Python by default, which fulfills the above requirements nicely.

In order to work with the data efficiently (for example, to get it onto your workbench in the first place), you'll want software libraries (which can be regarded as shelves that hold various tools like saws, magnifiers, and pipettes). Over the years, scientists have contributed a lot of useful libraries to the Python ecosystem, so you can have your pick of favorite tools. Since the APIs that are exposed by these libraries are as much a part of the user interface as the programming language, a lot of thought gets put into them.

While executing single lines or blocks of code to interactively examine your data is essential, the final ingredient of the data laboratory is the text editor. The editor should be intimately connected to the REPL and allow for a seamless transmission of text between the two. The typical workflow is to first try a step of the data analysis live in the REPL and, when it seems to work, write it down into a growing analysis script. More complicated algorithms are written in the editor first in an iterative fashion, testing the implementation by executing the code in the REPL. Jupyter's notebook environment is notable in this regard, as it blends the REPL and the editor together.

Go check it out

If you are interested in learning more about Jupyter, I recommend installing it and checking out this wonderful collection of interesting Jupyter notebooks.

About the author

Marijn van Vliet is a postdoctoral researcher at the Department of Neuroscience and Biomedical Engineering of Aalto University in Finland. He received his PhD in biomedical sciences in 2015.

comments powered by Disqus