Running Your Spark Job Executors In Docker Containers

Bernardo Gomez Palacio

May 27th, 2016

The following post showcases a Dockerized Apache Spark application running in a Mesos cluster. In our example, the Spark Driver as well as the Spark Executors will be running in a Docker image based on Ubuntu with the addition of the SciPy Python packages. If you are already familiar with the reasons for using Docker as well as Apache Mesos, feel free to skip the next section and jump right to the post, but if not, please carry on.

Rational

Today, it’s pretty common to find engineers and data scientist who need to run big data workloads in a shared infrastructure. In addition, the infrastructure can potentially be used not only for such workloads, but also for other important services required for business operations.

A very common way to solve such problems is to virtualize the infrastructure and statically partition it in such a way that each development or business group in the company has its own resources to deploy and run their applications on. Hopefully, the maintainers of such infrastructure and services have a DevOps mentality and have automated, and continuously work on automating, the configuration and software provisioning tasks on such infrastructure.

The problem is, as Benjamin Hindman backed by the studies done at the University of California, Berkeley, points out, static partitioning can be highly inefficient on the utilization of such infrastructure. This has prompted the development of Resource Schedulers that abstract CPU, memory, storage, and other computer resources away from machines, either physically or virtually, to enable the execution of applications across the infrastructure to achieve a higher utilization factor, among other things.

The concept of sharing infrastructure resources is not new for applications that entail the analysis of large datasets, in most cases, through algorithms that favor parallelization of workloads. Today, the most common frameworks to develop such applications are Hadoop Map Reduce and Apache Spark. In the case of Apache Spark, it can be deployed in clusters managed by Resource Schedulers such as Hadoop YARN or Apache Mesos. Now, since different applications are running inside a shared infrastructure, it’s common to find applications that have different sets of requirements across the software packages and versions of such packages they depend on to function. As an operation engineer, or infrastructure manager, you can force your users to a predefine a set of software libraries, along with their versions, that the infrastructure supports. Hopefully, if you follow that path you also establish a procedure to upgrade such software libraries and add new ones.

This tends to require an investment in time and might be frustrating to engineers and data scientist that are constantly installing new packages and libraries to facilitate their work. When you decide to upgrade, you might as well have to refactor some applications that might have been running for a long time but have heavy dependencies on previous versions of the packages that are part of the upgrade. All in all, it’s not simple.

Linux Containers, and specially Docker, offer an abstraction such that software can be packaged into lightweight images that can be executed as containers. The containers are executed with some level of isolation, and such isolation is mainly provided by cgroups. Each image can define the type of operating system that it requires along with the software packages. This provides a fantastic mechanism to pass the burden of maintaining the software packages and libraries out of infrastructure management and operations to the owners of the applications. With this, the infrastructure and operations teams can run multiple isolated applications that can potentially have conflicting software libraries within the same infrastructure. Apache Spark can leverage this as long as it’s deployed with an Apache Mesos cluster that supports Docker.

In the next sections, we will review how we can run Apache Spark applications within Docker containers.

Tutorial

For this post, we will use a CentOS 7.2 minimal image running on VirtualBox. However, in this tutorial, we will not include the instructions to obtain such a CentOS image, make it available in VirtualBox, or configure its network interfaces.

Additionally, we will be using a single node to keep this exercise as simple as possible. We can later explore deploying a similar setup in a set of nodes in the cloud; but for the sake of simplicity and time, our single node will be running the following services:

  • A Mesos master
  • A Mesos slave
  • A Zookeeper instance
  • A Docker daemon

Step 1: The Mesos Cluster

To install Apache Mesos in your cluster, I suggest you follow the Mesosphere getting started guidelines. Since we are using CentOS 7.2, we first installed the Mesosphere YUM repository as follows:

# Add the repository
    sudo rpm -Uvh http://repos.mesosphere.com/el/7/noarch/RPMS/
      mesosphere-el-repo-7-1.noarch.rpm

We then install Apache Mesos and the Apache Zookeeper packages.

sudo yum -y install mesos mesosphere-zookeeper

Once the packages are installed, we need to configure Zookeeper as well as the Mesos master and slave.

Zookeeper

For Zookeeper, we need to create a Zookeeper Node Identity. We do this by setting the numerical identifying inside the /var/lib/zookeeper/myid file.

echo "1" > /var/lib/zookeeper/myid

Since by default Zookeeper binds to all interfaces and exposes its services through port 2181, we do not need to change the /etc/zookeeper/conf/zoo.cfg file. Refer to the Mesosphere getting started guidelines if you have a Zookeeper ensemble, more than one node running Zookeeper. After that, we can start the Zookeeper service:

sudo service zookeeper restart

Mesos master and slave

Before we start to describe the Mesos configuration, we must note that the location of the Mesos configuration files that we will talk about now is specific to Mesosphere's Mesos package. If you don't have a strong reason to build your own Mesos packages, I suggest you use the ones that Mesosphere kindly provides.

We need to tell the Mesos master and slave about the connection string they can use to reach Zookeeper, including their namespace. By default, Zookeeper will bind to all interfaces; you might want to change this behavior. In our case, we will make sure that the IP address that we want to use to connect to Zookeeper can be resolved within the containers. The nodes public interface IP 192.168.99.100, and to do this, we do the following:

echo "zk://192.168.99.100:2181/mesos" > /etc/mesos/zk

Now, since in our setup we have several network interfaces associated with the node that will be running the Mesos master, we will pick an interface that will be reachable within the Docker containers that will eventually be running the Spark Driver and Spark Executors. Knowing that the IP address that we want to bind to is 192.168.99.100, we do the following:

echo "192.168.99.100" > /etc/mesos-master/ip

We do a similar thing for the Mesos slave. Again, consider that in our example the Mesos slave is running on the same node as the Mesos master and we will bind it to the same network interface.

echo "192.168.99.100" > /etc/mesos-slave/ip
echo "192.168.99.100" > /etc/mesos-slave/hostname

The IP defines the IP address that the Mesos slave will bind to and the hostname defines the hostname that the slave will use to report its availability, and therefore, it is the value that the Mesos frameworks, in our case Apache Spark, will use to connect to it.

Let’s start the services:

systemctl start mesos-master
systemctl start mesos-slave

By default, the Mesos master will bind to port 5050 and the Mesos slave to port 5051. Let’s confirm this, assuming that you have installed the net-utils package:

netstat -pleno | grep -E "5050|5051"
    tcp        0      0 192.168.99.100:5050     0.0.0.0:*   LISTEN    
  0          127336     22205/mesos-master   off (0.00/0/0)
    tcp        0      0 192.168.99.100:5051     0.0.0.0:*   LISTEN    
  0          127453     22242/mesos-slave    off (0.00/0/0)

Let’s run a test:

MASTER=$(mesos-resolve cat /etc/mesos/zk) \
LIBPROCESSIP=192.168.99.100 \
mesos-execute --master=$MASTER \
                  --name="cluster-test" \
                  --command="echo 'Hello World' &&  sleep 5 && echo 'Good Bye'"

Step 2: Installing Docker

We followed the Docker documentation on installing Docker in CentOS. I suggest that you do the same. In a nutshell, we executed the following:

sudo yum update
sudo tee /etc/yum.repos.d/docker.repo <<-'EOF'
[dockerrepo]
name=Docker Repository
baseurl=https://yum.dockerproject.org/repo/main/centos/$releasever/
enabled=1
gpgcheck=1
gpgkey=https://yum.dockerproject.org/gpg
EOF
sudo yum install docker-engine
sudo service docker start

If the preceding code succeeded, you should be able to do a docker ps as well as a docker search ipython/scipystack successfully.

Step 3: Creating a Spark image

Let’s create the Dockerfile that will be used by the Spark Driver and Spark Executor. For our example, we will consider that the Docker image should provide the SciPy stack along with additional Python libraries. So, in a nutshell, the Docker image must have the following features:

  • The version of libmesos should be compatible with the version of the Mesos master and slave. For example, /usr/lib/libmesos-0.26.0.so
  • It should have a valid JDK
  • It should have the SciPy stack as well as Python packages that we want
  • It should have a version of Spark, we will choose 1.6.0.

The Dockerfile below will provide the requirements that we mention above. Note that installing Mesos through the Mesosphere RPMs will install Open JDK, in this case version 1.7.

Dockerfile:

# Version 0.1
FROM ipython/scipystack
MAINTAINER Bernardo Gomez Palacio "bernardo.gomezpalacio@gmail.com"
ENV REFRESHEDAT 2015-03-19

ENV DEBIANFRONTEND noninteractive

RUN apt-get update
RUN apt-get dist-upgrade -y

# Setup
RUN sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
RUN export OSDISTRO=$(lsbrelease -is | tr '[:upper:]' '[:lower:]') && \
export OSCODENAME=$(lsbrelease -cs) && \
echo "deb http://repos.mesosphere.io/${OSDISTRO} ${OSCODENAME} main" | \
tee /etc/apt/sources.list.d/mesosphere.list &&\
apt-get -y update

RUN apt-get -y install mesos

RUN apt-get install -y python libnss3 curl

RUN curl http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz \
| tar -xzC /opt && \
mv /opt/spark* /opt/spark

RUN apt-get clean

# Fix pypspark six error.
RUN pip2 install -U six
RUN pip2 install msgpack-python
RUN pip2 install avro

COPY spark-conf/* /opt/spark/conf/
COPY scripts /scripts

ENV SPARKHOME /opt/spark

ENTRYPOINT ["/scripts/run.sh"]

Let’s explain some very important files that will be available in the Docker image according to the Dockerfile mentioned earlier:

The spark-conf/spark-env.sh, as mentioned in the Spark docs, will be used to set the location of the Mesos libmesos.so:

export MESOSNATIVEJAVALIBRARY=${MESOSNATIVEJAVALIBRARY:-/usr/lib/libmesos.so}
export SPARKLOCALIP=${SPARKLOCALIP:-"127.0.0.1"}
export SPARKPUBLICDNS=${SPARKPUBLICDNS:-"127.0.0.1"}

The spark-conf/spark-defaults.conf serves as the definition of the default configuration for our Spark jobs within the container, the contents are as follows:

spark.master  SPARKMASTER
spark.mesos.mesosExecutor.cores   MESOSEXECUTORCORE
spark.mesos.executor.docker.image SPARKIMAGE
spark.mesos.executor.home /opt/spark
spark.driver.host CURRENTIP
spark.executor.extraClassPath /opt/spark/custom/lib/*
spark.driver.extraClassPath   /opt/spark/custom/lib/*

Note that the use of environment variables such as SPARKMASTER and SPARKIMAGE are critical since this will allow us to customize how the Spark application interacts with the Mesos Docker integration.

We have Docker's entry point script. The script, showcased below, will populate the spark-defaults.conf file.

Now, let’s define the Dockerfile entry point such that it lets us define some basic options that will get passed to the Spark command, for example, spark-shell, spark-submit or pyspark:

#!/bin/bash

SPARKMASTER=${SPARKMASTER:-local}
MESOSEXECUTORCORE=${MESOSEXECUTORCORE:-0.1}
SPARKIMAGE=${SPARKIMAGE:-sparkmesos:lastet}
CURRENTIP=$(hostname -i)

sed -i 's;SPARKMASTER;'$SPARKMASTER';g' /opt/spark/conf/spark-defaults.conf
sed -i 's;MESOSEXECUTORCORE;'$MESOSEXECUTORCORE';g' /opt/spark/conf/spark-defaults.conf
sed -i 's;SPARKIMAGE;'$SPARKIMAGE';g' /opt/spark/conf/spark-defaults.conf
sed -i 's;CURRENTIP;'$CURRENTIP';g' /opt/spark/conf/spark-defaults.conf

export SPARKLOCALIP=${SPARKLOCALIP:-${CURRENTIP:-"127.0.0.1"}}
export SPARKPUBLICDNS=${SPARKPUBLICDNS:-${CURRENTIP:-"127.0.0.1"}}


if [ $ADDITIONALVOLUMES ];
then
echo "spark.mesos.executor.docker.volumes: $ADDITIONALVOLUMES" >> /opt/spark/conf/spark-defaults.conf
fi

exec "$@"

Let’s build the image so we can start using it.

docker build -t sparkmesos . && \
docker tag -f sparkmesos:latest sparkmesos:latest

Step 4: Running a Spark application with Docker.

Now that the image is built, we just need to run it. We will call the PySpark application:

docker run -it --rm \
-e SPARKMASTER="mesos://zk://192.168.99.100:2181/mesos" \
-e SPARKIMAGE="sparkmesos:latest" \
-e PYSPARKDRIVERPYTHON=ipython2 \
sparkmesos:latest /opt/spark/bin/pyspark

To make sure that SciPy is working, let's write the following to the PySpark shell:

from scipy import special, optimize
import numpy as np

f = lambda x: -special.jv(3, x)
sol = optimize.minimize(f, 1.0)
x = np.linspace(0, 10, 5000)
x

Now, let’s try to calculate PI as an example:

docker run -it --rm \
-e SPARKMASTER="mesos://zk://192.168.99.100:2181/mesos" \
-e SPARKIMAGE="sparkmesos:latest" \
-e PYSPARKDRIVERPYTHON=ipython2 \
sparkmesos:latest /opt/spark/bin/spark-submit --driver-memory 500M \
--executor-memory 500M \
/opt/spark/examples/src/main/python/pi.py 10

Conclusion and further notes

Although we were able to run a Spark application within a Docker container leveraging Apache Mesos, there is more work to do. We need to explore containerized Spark applications that spread across multiple nodes along with providing a mechanism that enables network port mapping.

References

About the Author

Bernardo Gomez Palacio is a consulting member of technical staff, Big Data Cloud Services at Oracle Cloud. He is an electronic systems engineer but has worked for more than 12 years developing software and more than 6 years on DevOps. Currently, his work is that of developing infrastructure to aid the creation and deployment of big data applications. He is a supporter of open source software and has a particular interest in Apache Mesos, Apache Spark, Distributed File Systems, and Docker Containerization & Networking. His opinions are his own and do not reflect the opinions of his employer.