Containerized Data Science with Docker

Darwin Corn

July 04th, 2016

So, you're itching to begin your journey into data science but you aren't sure where to start. Well, I'm glad you’ve found this post since I will give the details in a step-by-step fashion as to how I circumvented the unnecessarily large technological barrier to entry and got my feet wet, so to speak.

Containerization in general and Docker in particular have taken the IT world by storm in the last couple of years by making LXC containers more than just VM alternatives for the enterprising sysadmin. Even if you're coming at this post from a world devoid of IT, the odds are good that you've heard of Docker and their cute whale mascot. Of course, now that Microsoft is on board, the containerization bandwagon and a consortium of bickering stakeholders have formed, so you know that container tech is here to stay. I know, FreeBSD has had the concept of 'jails' for almost two decades now. But thanks to Docker, container tech is now usable across the big three of Linux, Windows and Mac (if a bit hack-y in the case of the latter two), and today we're going to use its positives in an exploration into the world of data science.

Now that I have your interest piqued, you're wondering where the two intersect. Well, if you're like me, you've looked at the footprint of R-studio and the nightmare maze of dependencies of IPython and “noped” right out of there. Thanks to containers, these problems are solved! With Docker, you can limit the amount of memory available to the container, and the way containers are constructed ensures that you never have to deal with troubleshooting broken dependencies on update ever again.

So let's install Docker, which is as straightforward as using your package manager in Linux, or downloading Docker Toolbox if you're using a Mac or Windows PC, and running the installer. The instructions that follow will be tailored to a Linux installation, but are easily adapted to Windows or Mac as well. On those two platforms, you can even bypass these CLI commands and use Kitematic, or so I hear.

Now that you have Docker installed, let's look at some use cases for how to use it to facilitate our journey into data science. First, we are going to pull the Jupyter Notebook container so that you can work with that language-agnostic tool.

# docker run --rm -it -p 8888:8888 -v "$(pwd):/notebooks" jupyter/notebook

The -v "$(pwd):/notebooks" flag will mount the current directory to the /notebooks directory in the container, allowing you to save your work outside the container. This will be important because you’ll be using the container as a temporary working environment. The --rm flag ensures that the container is destroyed when it exits. If you rerun that command to get back to work after turning off your computer for instance, the container will be replaced with an entirely new one. That flag allows it access to the folder on the local filesystem, ensuring that your work survives the casually disposable nature of development containers. Now go ahead and navigate to http://localhost:8888, and let's get to work.

You did bring a dataset to analyze in a notebook, right? The actual nuts and bolts of data science are beyond the scope of this post, but for a quick intro to data and learning materials, I've found Kaggle to be a great resource. While we're at it, you should look at that other issue I mentioned previously—that of the application footprint. Recently a friend of mine convinced me to use R, and I was enjoying working with the language until I got my hands on some real data and immediately felt the pain of an application not designed for endpoint use. I ran a regression and it locked up my computer for minutes! Fortunately, you can use a container to isolate it and only feed it limited resources to keep the rest of the computer happy.

# docker run -m 1g -ti --rm r-base

This command will drop you into an interactive R CLI that should keep even the leanest of modern computers humming along without a hiccup. Of course, you can also use the -c and --blkio-weight flags to restrict access to the CPU and HDD resources respectively, if limiting it to the GB of RAM wasn't enough.

So, a program installation and a command or two (or a couple of clicks in the Kitematic GUI), and we're off and running using data science with none of the typical headaches.

About the Author

Darwin Corn is a systems analyst for the Consumer Direct Care Network. He is a mid-level professional with diverse experience in the information technology world.