Home

Data

Mastering Mesos

By Dipa Dubhashi , Akhil Das

Book

eBook $47.99 $32.99

Print $60.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $47.99 $32.99

Print $60.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Apache Mesos is open source cluster management software that provides efficient resource isolations and resource sharing distributed applications or frameworks. This book will take you on a journey to enhance your knowledge from amateur to master level, showing you how to improve the efficiency, management, and development of Mesos clusters. The architecture is quite complex and this book will explore the difficulties and complexities of working with Mesos. We begin by introducing Mesos, explaining its architecture and functionality. Next, we provide a comprehensive overview of Mesos features and advanced topics such as high availability, fault tolerance, scaling, and efficiency. Furthermore, you will learn to set up multi-node Mesos clusters on private and public clouds. We will also introduce several Mesos-based scheduling and management frameworks or applications to enable the easy deployment, discovery, load balancing, and failure handling of long-running services. Next, you will find out how a Mesos cluster can be easily set up and monitored using the standard deployment and configuration management tools. This advanced guide will show you how to deploy important big data processing frameworks such as Hadoop, Spark, and Storm on Mesos and big data storage frameworks such as Cassandra, Elasticsearch, and Kafka.

Publication date:: May 2016
Publisher: Packt
Pages: 352
ISBN: 9781785886249
Download code from GitHub

Chapter 1. Introducing Mesos

Apache Mesos is open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be easily built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS".

It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects.

We will cover the following topics in this chapter:

Introduction to the datacenter OS and architecture of Mesos
Introduction to frameworks
Attributes, resources and resource scheduling, allocation, and isolation
Monitoring and APIs provided by Mesos
Mesos in production

Introduction to the datacenter OS and architecture of Mesos

Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyperscale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy.

Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced.

Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources.

Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support.

The idea behind a datacenter OS is that of intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure.

The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources.

From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change.

Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues.

In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals.

Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next generation Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications.

It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core.

The architecture of Mesos

Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework.

Some of the important features of Mesos are:

Scalability: It can elastically scale to over 50,000 nodes
Resource isolation: This is achieved through Linux/Docker containers
Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks
High availability: This is through Apache ZooKeeper
Monitoring Interface: A Web UI for monitoring the cluster state

Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are:

Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs
It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile

Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve a good degree of data locality for each framework sharing the same cluster.

An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators.

Introduction to frameworks

A Mesos framework sits between Mesos and the application and acts as a layer to manage task scheduling and execution. As its implementation is application-specific, the term is often used to refer to the application itself. Earlier, a Mesos framework could interact with the Mesos API using only the libmesos C++ library, due to which other language bindings were developed for Java, Scala, Python, and Go among others that leveraged libmesos heavily. Since v0.19.0, the changes made to the HTTP-based protocol enabled developers to develop frameworks using the language they wanted without having to rely on the C++ code. A framework consists of two components: a) Scheduler and b) Executor.

Scheduler is responsible for making decisions on the resource offers made to it and tracking the current state of the cluster. Communication with the Mesos master is handled by the SchedulerDriver module, which registers the framework with the master, launches tasks, and passes messages to other components.

The second component, Executor, is responsible, as its name suggests, for the execution of tasks on slave nodes. Communication with the slaves is handled by the ExecutorDriver module, which is also responsible for sending status updates to the scheduler.

The Mesos API, discussed later in this chapter, allows programmers to develop their own custom frameworks that can run on top of Mesos. Some other features of frameworks, such as authentication, authorization, and user management, will be discussed at length in Chapter 6, Mesos Frameworks.

Frameworks built on Mesos

A list of some of the services and frameworks built on Mesos is given here. This list is not exhaustive, and support for new frameworks is added almost every day. You can also refer to http://mesos.apache.org/documentation/latest/frameworks/ apart from the following list:

Long-running services

Aurora: This is a service scheduler that runs on top of Mesos, enabling you to run long-running services that take advantage of the scalability, fault-tolerance, and resource isolation of Mesos.
Marathon: This is a private PaaS built on Mesos. It automatically handles hardware or software failures and ensures that an app is "always on".
Singularity: This is a scheduler (the HTTP API and web interface) for running Mesos tasks, such as long-running processes, one-off tasks, and scheduled jobs.
SSSP: This is a simple web application that provides a "Megaupload" white label to store and share files in S3.

Big data processing

Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler lets you run Chapel programs on Mesos.
Dark is a Python clone of Spark, a MapReduce-like framework written in Python and running on Mesos.
Exelixi is a distributed framework used to run genetic algorithms at scale.
Hadoop Running Hadoop on Mesos distributes MapReduce jobs efficiently across an entire cluster.
Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations—for example, matrix, graph, and network algorithms.
MPI is a message-passing system designed to function on a wide variety of parallel computers.
Spark is a fast and general-purpose cluster computing system that makes parallel jobs easy to write.
Storm is a distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop does for batch processing.

Batch scheduling

Chronos is a distributed job scheduler that supports complex job topologies. It can be used as a more fault-tolerant replacement for cron.
Jenkins is a continuous integration server. The Mesos-Jenkins plugin allows it to dynamically launch workers on a Mesos cluster, depending on the workload.
JobServer is a distributed job scheduler and processor that allows developers to build custom batch processing Tasklets using a point and click Web UI.

Data storage

Cassandra is a performant and highly available distributed database. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Elasticsearch is a distributed search engine. Mesos makes it easy for it to run and scale.

The attributes and resources of Mesos

Mesos describes the slave nodes present in the cluster by the following two methods:

Attributes

Attributes are used to describe certain additional information regarding the slave node, such as its OS version, whether it has a particular type of hardware, and so on. They are expressed as key-value pairs with support for three different value types—scalar, range, and text—that are sent along with the offers to frameworks. Take a look at the following code:

attributes : attribute ( ";" attribute )*

attribute : text ":" ( scalar | range | text )

Resources

Mesos can manage three different types of resources: scalars, ranges, and sets. These are used to represent the different resources that a Mesos slave has to offer. For example, a scalar resource type could be used to represent the amount of CPU on a slave. Each resource is identified by a key string, as follows:

resources : resource ( ";" resource )*

resource : key ":" ( scalar | range | set )

key : text ( "(" resourceRole ")" )?

resourceRole : text | "*"

Predefined uses and conventions

The Mesos master predefines how it handles the following list of resources:

cpus
mem
disk
ports

In particular, a slave without the cpu and mem resources will never have its resources advertised to any frameworks. Also, the master's user interface interprets the scalars in mem and disk in terms of MB. For example, the value 15000 is displayed as 14.65GB.

Examples

Here are some examples of configuring the Mesos slaves:

resources='cpus:24;mem:24576;disk:409600;ports:[21000-24000];bugs:{a,b,c}'
attributes='rack:abc;zone:west;os:centos5;level:10;keys:[1000-1500]'

In this case, we have three different types of resources, scalars, a range, and a set. They are called cpus, mem, and disk, and the range type is ports.

A scalar called cpus with the value 24
A scalar called mem with the value 24576
A scalar called disk with the value 409600
A range called ports with values 21000 through 24000 (inclusive)
A set called bugs with the values a, b, and c

In the case of attributes, we will end up with three attributes:

A rack attribute with the text value abc
A zone attribute with the text value west
An os attribute with the text value centos5
A level attribute with the scalar value 10
A keys attribute with range values 1000 through 1500 (inclusive)

Two-level scheduling

Mesos has a two-level scheduling mechanism to allocate resources to and launch tasks on different frameworks. In the first level, the master process that manages slave processes running on each node in the Mesos cluster determines the free resources available on each node, groups them, and offers them to different frameworks based on organizational policies, such as priority or fair sharing. Organizations have the ability to define their own sharing policies via a custom allocation module as well.

In the second level, each framework's scheduler component that is registered as a client with the master accepts or rejects the resource offer made depending on the framework's requirements. If the offer is accepted, the framework's scheduler sends information regarding the tasks that need to be executed and the number of resources that each task requires to the Mesos master. The master transfers the tasks to the corresponding slaves, which assign the necessary resources to the framework's executor component, which manages the execution of all the required tasks in containers. When the tasks are completed, the containers are dismantled, and the resources are freed up for use by other tasks.

The following diagram and explanation from the Apache Mesos documentation (http://mesos.apache.org/documentation/latest/architecture/) explains this concept in more detail:

Let's have a look at the pointers mentioned in the preceding diagram:

1: Slave 1 reports to the master that it has four CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that Framework 1 should be offered all the available resources.
2: The master sends a resource offer describing these resources to Framework 1.
3: The framework's scheduler replies to the master with information about two tasks to run on the slave using two CPUs and 1 GB RAM for the first task and one CPU and 2 GB RAM for the second task.
4: The master sends the tasks to the slave, which allocates appropriate resources to the framework's executor, which in turn launches the two tasks. As one CPU and 1 GB of RAM are still free, the allocation module may now offer them to Framework 2. In addition, this resource offers process repeats when tasks finish and new resources become free.

Mesos also provides frameworks with the ability to reject resource offers. A framework can reject the offers that do not meet its requirements. This allows frameworks to support a wide variety of complex resource constraints while keeping Mesos simple at the same time. A policy called delay scheduling, in which frameworks wait for a finite time to get access to the nodes storing their input data, gives a fair level of data locality albeit with a slight latency tradeoff.

If the framework constraints are complex, it is possible that a framework might need to wait before it receives a suitable resource offer that meets its requirements. To tackle this, Mesos allows frameworks to set filters specifying the criteria that they will use to always reject certain resources. A framework can set a filter stating that it can run only on nodes with at least 32 GB of RAM space free, for example. This allows it to bypass the rejection process, minimizes communication overheads, and thus reduces overall latency.

Resource allocation

The resource allocation module contains the policy that the Mesos master uses to determine the type and quantity of resource offers that need to be made to each framework. Organizations can customize it to implement their own allocation policy—for example, fair sharing, priority, and so on—which allows for fine-grained resource sharing. Custom allocation modules can be developed to address specific needs.

The resource allocation module is responsible for making sure that resources are shared in a fair manner among competing frameworks. The choice of algorithm used to determine the sharing policy has a great bearing on the efficiency of a cluster manager. One of the most popular allocation algorithms, max-min fairness, and its weighted derivative are described in the following section.

Max-min fair share algorithm

Imagine a set of sources (1, 2, ..., m) that has resource demands x₁, x₂, ..., x_m. Let the total number of resources be R. We will initially give R/m of the resource to each of the m sources. Now, starting with the source with the least demand, we will compare the allocation to the actual demand. If initial allocation (R/m) is more than the demand requirements of source 1, we will redistribute the excess resources equally among the remaining sources. We will then compare the new allocation to the actual demand of the source with the second-lowest demand and continue the process as before. The process ends when each source gets allocated resources that are less than or equal to its actual demand. If any source gets allocated resources less than what it actually needs, the algorithm ensures that no other source can get more resources than such a source. Such an allocation is called a max-min fair share allocation because it maximizes the minimum share of sources whose demands are not met.

Consider the following example:

How to compute the max-min fair allocation for a set of four sources, S1, S2, S3, and S4, with demands 2, 2.5, 4, and 5, respectively, when the resource has an overall capacity of 10.

Following the methodology described earlier, to solve this, we will tentatively divide the resource into four portions of size 2.5 each. Next, we will compare this allocation with the actual demand of the source with the least demand (in this case, S1). As the allocation is greater than the actual demand, the excess 0.5 is divided equally among the remaining three sources, S2, S3, and S4, giving them 2.666 each. Continuing the process, we will note that the new allocation is greater than the actual demand of source S2. The excess 0.166 is again divided evenly among the remaining two sources S3 and S4, giving them 2.666 + 0.084 = 2.75 each. The allocation for each of the sources is now less than or equal to the actual demand, so the process is stopped here. The final allocation is, therefore, S1 – 2, S2 – 2.5, S3 – 2.75, and S4 – 2.75.

This works well in a homogenous environment—that is, one where resource requirements are fairly proportional between different competing users, such as a Hadoop cluster. However, scheduling resources across frameworks with heterogeneous resource demands poses a more complex challenge. What is a suitable fair share allocation policy if user A runs tasks that require two CPUs and 8 GB RAM each and user B runs tasks that require four CPUs and 2 GB RAM each? As can be noted, user A's tasks are RAM-heavy, while user B's tasks are CPU-heavy. How, then, should a set of combined RAM and CPU resources be distributed between the two users?

The latter scenario is a common one faced by Mesos, designed as it is to manage resources primarily in a heterogeneous environment. To address this, Mesos has the Dominant Resource Fairness algorithm (DRF) as its default resource allocation policy, which is far more suitable for heterogeneous environments. The algorithm and its role in efficient resource allocation will be discussed in more detail in the next chapter.

Resource isolation

One of the key requirements of a cluster manager is to ensure that the allocation of resources to a particular framework does not have an impact on any active running jobs of some other framework. Provision for isolation mechanisms on slaves to compartmentalize different tasks is thus a key feature of Mesos. Containers are leveraged for resource isolation with a pluggable architecture. The Mesos slave uses the Containerizer API to provide an isolated environment to run a framework's executor and its corresponding tasks. The Containerizer API's objective is to support a wide range of implementations, which implies that custom containerizers and isolators can be developed. When a slave process starts, the containerizer to be used to launch containers and a set of isolators to enforce the resource constraints can be specified.

The Mesos Containerizer API provides a resource isolation of framework executors using Linux-specific functionality, such as control groups and namespaces. It also provides basic support for POSIX systems (only resource usage reporting and not actual isolation). This important topic will be explored at length in subsequent chapters.

Mesos also provides network isolation at a container level to prevent a single framework from capturing all the available network bandwidth or ports. This is not supported by default, however, and additional dependencies need to be installed and configured in order to activate this feature.

Monitoring in Mesos

In this section, we will take a look at the different metrics that Mesos provides to monitor the various components.

Monitoring provided by Mesos

Mesos master and slave nodes provide rich data that enables resource utilization monitoring and anomaly detection. The information includes details about available resources, used resources, registered frameworks, active slaves, and task state. This can be used to create automated alerts and develop a cluster health monitoring dashboard. More details can be found here:

http://mesos.apache.org/documentation/latest/monitoring/.

Network statistics for each active container are published through the /monitor/statistics.json endpoint on the slave.

Types of metrics

Mesos provides two different kinds of metrics: counters and gauges. These can be explained as follows:

Counters: This is used to measure discrete events, such as the number of finished tasks or invalid status updates. The values are always whole numbers.
Gauges: This is used to check the snapshot of a particular metric, such as the number of active frameworks or running tasks at a particular time.

The Mesos API

Mesos provides an API to allow developers to build custom frameworks that can run on top of the underlying distributed infrastructure. The detailed steps involved in developing bespoke frameworks leveraging this API and the new HTTP API will be explored in detail in Chapter 6, Mesos Frameworks.

Messages

Mesos implements an actor-style message-passing programming model to enable nonblocking communication between different Mesos components and leverages protocol buffers for the same. For example, a scheduler needs to tell the executor to utilize a certain number of resources, an executor needs to provide status updates to the scheduler regarding the tasks that are executed, and so on. Protocol buffers provide the required flexible message delivery mechanism to enable this communication by allowing developers to define custom formats and protocols that can be used across different languages. For more details regarding the messages that are passed between different Mesos components, refer to https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto

API details

A brief description of the different APIs and methods that Mesos provides is provided in the following section:

Executor API

A brief description of the Executor API is given below. For more details, visit http://mesos.apache.org/api/latest/java/org/apache/mesos/Executor.html.

registered: This can be registered via the following code:
```
void registered(ExecutorDriver driver,
  ExecutorInfo executorInfo,
  FrameworkInfo frameworkInfo,
  SlaveInfo slaveInfo)
```
This code is invoked once the executor driver is able to successfully connect with Mesos. In particular, a scheduler can pass some data to its executors through the ExecutorInfo.getData() field.
The following are the parameters:
- driver: This is the executor driver that was registered and connected to the Mesos cluster
- executorInfo: This describes information about the executor that was registered
- frameworkInfo: This describes the framework that was registered
- slaveInfo: This describes the slave that will be used to launch the tasks for this executor
reregistered: This can be reregistered as follows:
```
void reregistered(ExecutorDriver driver,
  SlaveInfo slaveInfo)
```
This code is invoked when the executor reregisters with a restarted slave.
The following are the parameters:
- driver: This is the executor driver that was reregistered with the Mesos master
- slaveInfo: This describes the slave that will be used to launch the tasks for this executor
disconnected: This can be disconnected via the following code:
```
void disconnected(ExecutorDriver driver)
```
The preceding code is invoked when the executor gets "disconnected" from the slave—for example, when the slave is restarted due to an upgrade).
The following is the parameter:
- driver: This is the executor driver that was disconnected.
launchTask: Take a look at the following code:
```
void launchTask(ExecutorDriver driver,
  TaskInfo task)
```
The preceding code is invoked when a task is launched on this executor (initiated via SchedulerDriver.launchTasks(java.util.Collection<OfferID>, java.util.Collection<TaskInfo>, Filters). Note that this task can be realized with a thread, a process, or some simple computation; however, no other callbacks will be invoked on this executor until this callback returns.
The following are the parameters:
- driver: This is the executor driver that launched the task
- task: This describes the task that was launched
killTask: Run the following code:
```
void killTask(ExecutorDriver driver,
  TaskID taskId)
```
This is invoked when a task running within this executor is killed via SchedulerDriver.killTask (TaskID). Note that no status update will be sent on behalf of the executor, and the executor is responsible for creating a new TaskStatus protobuf message (that is, with TASK_KILLED) and invoking ExecutorDriver.sendStatusUpdate (TaskStatus).
The following are the parameters:
- driver: This is the executor driver that owned the task that was killed
- taskId: This is the ID of the task that was killed
frameworkMessage: Run the following code:
```
void frameworkMessage(ExecutorDriver driver,
  byte[] data)
```
This is invoked when a framework message arrives for this executor. These messages are the best effort; do not expect a framework message to be retransmitted in any reliable fashion.
The following are the parameters:
- driver: This is the executor driver that received the message
- data: This is the message payload
shutdown: Execute the following code:
```
void shutdown(ExecutorDriver driver)
```
This is invoked when the executor terminates all of its currently running tasks. Note that after Mesos determines that an executor has terminated, any tasks that the executor did not send Terminal status updates for (for example, TASK_KILLED, TASK_FINISHED, TASK_FAILED, and so on), and a TASK_LOST status update will be created.
The following is the parameter:
- driver: This is the executor driver that should terminate.
error: Run the following:
```
void error(ExecutorDriver driver,
  java.lang.String message)
```
The previous code is invoked when a fatal error occurs with the executor and/or executor driver. The driver will be aborted BEFORE invoking this callback.
The following are the parameters:
- driver: This is the executor driver that was aborted due to this error
- message: This is the error message

The Executor Driver API

A brief description of the Executor Driver API is given below. For more details, visit http://mesos.apache.org/api/latest/java/org/apache/mesos/ExecutorDriver.html.

start: Run the following line:
```
Status start()
```
The preceding code starts the executor driver. This needs to be called before any other driver calls are made.
The state of the driver after the call is returned.
stop: Run the following line:
```
Status stop()
```
This stops the executor driver.
The state of the driver after the call is the return.
abort: Run the following line:
```
Status abort()
```
This aborts the driver so that no more callbacks can be made to the executor. The semantics of abort and stop are deliberately separated so that the code can detect an aborted driver (via the return status of join(); refer to the following section) and instantiate and start another driver if desired (from within the same process, although this functionality is currently not supported for executors).
The state of the driver after the call is the return.
join: Run the following:
```
Status join()
```
This waits for the driver to be stopped or aborted, possibly blocking the current thread indefinitely. The return status of this function can be used to determine whether the driver was aborted (take a look at mesos.proto for a description of status).
The state of the driver after the call is the return.
run: Take a look at the following line of code:
```
Status run()
```
This starts and immediately joins (that is, blocks) the driver.
The state of the driver after the call is the return.
sendStatusUpdate: Here's the code to execute:
```
Status sendStatusUpdate(TaskStatus status)
```
This sends a status update to the framework scheduler, retrying as necessary until an acknowledgement is received or the executor is terminated (in which case, a TASK_LOST status update will be sent). Take a look at Scheduler.statusUpdate(org.apache.mesos.SchedulerDriver, TaskStatus) for more information about status update acknowledgements.
The following is the parameter:
- status: This is the status update to send.
The state of the driver after the call is the return.
sendFrameworkMessage: Run the following code:
```
Status sendFrameworkMessage(byte[] data)
```
This sends a message to the framework scheduler. These messages are sent on a best effort basis and should not be expected to be retransmitted in any reliable fashion.
The parameters are as follows:
- data: This is the message payload.
The state of the driver after the call is the return.

The Scheduler API

A brief description of the Scheduler API is given below. For more details, visit http://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html.

registered: This can be registered via the following code:
```
void registered(SchedulerDriver driver,
  FrameworkID frameworkId,
  MasterInfo masterInfo)
```
The preceding is invoked when the scheduler successfully registers with a Mesos master. A unique ID (generated by the master) is used to distinguish this framework from others, and MasterInfo with the IP and port of the current master are provided as arguments.
The following are the parameters:
- driver: This is the scheduler driver that was registered
- FrameworkID: This is the FrameworkID generated by the master
- MasterInfo: This is the information about the current master, including the IP and port.
reregistered: The preceding code can be reregistered as follows:
```
void reregistered(SchedulerDriver driver,
  MasterInfo masterInfo)
```
The preceding code is invoked when the scheduler reregisters with a newly elected Mesos master. This is only called when the scheduler is previously registered. MasterInfo containing the updated information about the elected master is provided as an argument.
The parameters are as follows:
- driver: This is the driver that was reregistered
- MasterInfo: This is the updated information about the elected master
resourceOffers: Execute the following code:
```
void resourceOffers(SchedulerDriver driver,
  java.util.List<Offer> offers)
```
The preceding code is invoked when resources are offered to this framework. A single offer will only contain resources from a single slave. Resources associated with an offer will not be reoffered to this framework until either; (a) this framework rejects these resources (refer to SchedulerDriver.launchTasks(java.util.Collection<OfferID>, java.util.Collection<TaskInfo>, Filters)), or (b) these resources are rescinded (refer to offerRescinded(org.apache.mesos.SchedulerDriver, OfferID)). Note that resources may be concurrently offered to more than one framework at a time, depending on the allocator being used. In this case, the first framework to launch tasks using these resources will be able to use them, while the other frameworks will have these resources rescinded. (Alternatively, if a framework has already launched tasks with these resources, these tasks will fail with a TASK_LOST status and a message saying as much).
The following are the parameters:
- driver: This is the driver that was used to run this scheduler
- offers: These are the resources offered to this framework
offerRescinded: Run the following code:
```
void offerRescinded(SchedulerDriver driver,
  OfferID offerId)
```
This is invoked when an offer is no longer valid (for example, the slave is lost or another framework is used resources in the offer). If, for whatever reason, an offer is never rescinded (for example, a dropped message, failing over framework, and so on), a framework that attempts to launch tasks using an invalid offer will receive a TASK_LOST status update for these tasks (take a look at resourceOffers(org.apache.mesos.SchedulerDriver, java.util.List<Offer>)).
The following are the parameters:
- driver: This is the driver that was used to run this scheduler
- offerID: This is the ID of the offer that was rescinded
statusUpdate: Take a look at the following code:
```
void statusUpdate(SchedulerDriver driver,
  TaskStatus status)
```
The preceding code is invoked when the status of a task changes (for example, a slave is lost, so the task is lost; a task is finished, and an executor sends a status update saying so; and so on). If, for whatever reason, the scheduler is aborted during this callback or the process exits, then another status update will be delivered. (Note, however, that this is currently not true if the slave sending the status update is lost or fails during this time.)
The parameters are as follows:
- driver: This is the driver that was used to run this scheduler
- status: This is the status update, which includes the task ID and status
frameworkMessage: Take a look at the following code:
```
void frameworkMessage(SchedulerDriver driver,
  ExecutorID executorId,
  SlaveID slaveId,
  byte[] data)
```
The preceding code is invoked when an executor sends a message. These messages are sent on a best effort basis and should not be expected to be retransmitted in any reliable fashion.
The parameters are as follows:
- driver: This is the driver that received the message
- ExecutorID: This is the ID of the executor that sent the message
- SlaveID: This is the ID of the slave that launched the executor
- data: This is the message payload
disconnected: Run the following:
```
void disconnected(SchedulerDriver driver)
```
This is invoked when the scheduler becomes disconnected from the master (for example, the master fails and another takes over).
The following is the parameter:
- driver: This is the driver that was used to run this scheduler
slaveLost: Execute the following code:
```
void slaveLost(SchedulerDriver driver,
  SlaveID slaveId)
```
This is invoked when a slave is determined unreachable (for example, machine failure or network partition). Most frameworks need to reschedule any tasks launched on this slave on a new slave.
The following are the parameters:
- driver: This is the driver that was used to run this scheduler
- SlaveID: This is the ID of the slave that was lost
executorLost: Run the following:
```
void executorLost(SchedulerDriver driver,
  ExecutorID executorId,
  SlaveID slaveId,
  int status)
```
The preceding is invoked when an executor is exited or terminated. Note that any running task will have the TASK_LOST status update automatically generated.
The following are the parameters:
- driver: This is the driver that was used to run this scheduler
- ExecutorID: This is the ID of the executor that was lost
- slaveID: This is the ID of the slave that launched the executor
- status: This is the exit status of the executor
error: Run the following code:
```
void error(SchedulerDriver driver,
  java.lang.String message)
```
The preceding is invoked when there is an unrecoverable error in the scheduler or driver. The driver will be aborted before invoking this callback.
The following are the parameters:
- driver: This is the driver that was used to run this scheduler
- message: This is the error message

The Scheduler Driver API

A brief description of the Scheduler Driver API is given below. For more details, visit http://mesos.apache.org/api/latest/java/org/apache/mesos/SchedulerDriver.html

start: Run the following code:
```
Status start()
```
This starts the scheduler driver. It needs to be called before any other driver calls are made.
The preceding returns the state of the driver after the call.
stop: Execute the following code:
```
Status stop(boolean failover)
```
This stops the scheduler driver. If the failover flag is set to false, it is expected that this framework will never reconnect to Mesos. So, Mesos will unregister the framework and shut down all its tasks and executors. If failover is true, all executors and tasks will remain running (for some framework-specific failover timeout), allowing the scheduler to reconnect (possibly in the same process or from a different process—for example, on a different machine).
The following is the parameter:
- failover: This is whether framework failover is expected
This returns the state of the driver after the call.
Stop: Run the following line:
```
Status stop()
```
This stops the scheduler driver assuming no failover. This will cause Mesos to unregister the framework and shut down all its tasks and executors.
This returns the state of the driver after the call.
abort: Execute the following code:
```
Status abort()
```
This aborts the driver so that no more callbacks can be made to the scheduler. The semantics of abort and stop are deliberately separated so that code can detect an aborted driver (via the return status of join(); refer to the following section) and instantiate and start another driver if desired from within the same process.
This returns the state of the driver after the call.
join: Run the following:
```
Status join()
```
This waits for the driver to be stopped or aborted, possibly blocking the current thread indefinitely. The return status of this function can be used to determine whether the driver was aborted (take a look at mesos.proto for a description of Status).
This returns the state of the driver after the call.
run: Execute the following:
```
Status run()
```
This starts and immediately joins (that is, blocks) the driver.
It returns the state of the driver after the call.
requestResources: Take a look at the following:
```
Status requestResources(java.util.Collection<Request> requests)
```
This requests resources from Mesos (take a look at mesos.proto for a description of Request and how, for example, to request resources from specific slaves). Any resources available are offered to the framework via the Scheduler.resourceOffers(org.apache.mesos.SchedulerDriver, java.util.List<Offer>) callback asynchronously.
The following is the parameter:
- requests: These are the resource requests.
It returns the state of the driver after the call.
launchTasks: Use the following code:
```
Status launchTasks(java.util.Collection<OfferID> offerIds,
  java.util.Collection<TaskInfo> tasks,
  Filters filters)
```
The preceding code launches the given set of tasks on a set of offers. Resources from offers are aggregated when more than one is provided. Note that all the offers must belong to the same slave. Any resources remaining (that is, not used by the tasks or their executors) will be considered declined. The specified filters are applied on all unused resources (take a look at mesos.proto for a description of Filters). Invoking this function with an empty collection of tasks declines offers in their entirety (refer to declineOffer(OfferID, Filters)).
The following are the parameters:
- offerIds: This is the collection of offer IDs
- tasks: This is the collection of tasks to be launched
- filters: This is the filters to set for any remaining resources.
It returns the state of the driver after the call.
killTask: Execute the following code:
```
Status killTask(TaskID taskId)
```
This kills the specified task. Note that attempting to kill a task is currently not reliable. If, for example, a scheduler fails over while it attempts to kill a task, it will need to retry in the future. Likewise, if unregistered/disconnected, the request will be dropped (these semantics may be changed in the future).
The following is the parameter:
- taskId: This is the ID of the task to be killed
It returns the state of the driver after the call.
declineOffer: Run the following code:
```
Status declineOffer(OfferID offerId,
  Filters filters)
```
This declines an offer in its entirety and applies the specified filters on the resources (take a look at mesos.proto for a description of Filters). Note that this can be done at any time, and it is not necessary to do this within the Scheduler.resourceOffers(org.apache.mesos.SchedulerDriver, java.util.List<Offer>) callback.
The following are the parameters:
- offerId: This is the ID of the offer to be declined
- filters: These are the filters to be set for any remaining resources
It returns the state of the driver after the call.
reviveOffers: Execute the following:
```
Status reviveOffers()
```
This removes all the filters previously set by the framework (via launchTasks(java.util.Collection<OfferID>, java.util.Collection<TaskInfo>, Filters)). This enables the framework to receive offers from these filtered slaves.
It returns the state of the driver after the call.
sendFrameworkMessage: Take a look at the following:
```
Status sendFrameworkMessage(ExecutorID executorId,
  SlaveID slaveId,
  byte[] data)
```
This sends a message from the framework to one of its executors. These messages are sent on a best effort basis and should not be expected to be retransmitted in any reliable fashion.
The parameters are:
- executorId: This is the ID of the executor to send the message to
- slaveId: This is the ID of the slave that runs the executor
- data: This is the message
It returns the state of the driver after the call.
reconcileTasks: Take a look at the following code:
```
Status reconcileTasks(java.util.Collection<TaskStatus> statuses)
```
This allows the framework to query the status for nonterminal tasks. This causes the master to send back the latest task status for each task in statuses if possible. Tasks that are no longer known will result in a TASK_LOST update. If statuses is empty, the master will send the latest status for each task currently known.
The following are the parameters:
- statuses: This is the collection of nonterminal TaskStatus protobuf messages to reconcile.
It returns the state of the driver after the call.

Mesos in production

Mesos is in production at several companies such as Apple, Twitter, and HubSpot and has even been used by start-ups such as Mattermark and Sigmoid. This broad appeal is a validation of Mesos' tremendous utility. Apple, for example, powers its consumer-facing, mission–critical, popular Siri application through a large Mesos cluster (allegedly spanning tens of thousands of nodes). One such case study (published on the Mesosphere website) is discussed here.

Case study on HubSpot

Following case study on HubSpot can be found here https://mesosphere.com/mesos-case-study-hubspot/. An excerpt from this link is given below:

HubSpot uses Apache Mesos to run a mixture of web services, long-running processes, and scheduled jobs that comprise their SaaS application. Mesos allows HubSpot to dynamically deploy services, which in turn reduces developer friction and time to deploy, increases reliability, achieves better resource utilization, and reduces hardware costs.

Mesos provides the core infrastructure to build a next-generation deployment system similar to what Heroku provides as a product. On top of Mesos, HubSpot built their own scheduler that is capable of executing both long-running services and scheduled jobs and is the interface through which the development team can view the state of their applications inside the cloud. Building a scheduler framework enables HubSpot to better understand the core concepts inside Mesos, be comfortable with failure modes, and customize user experience.

The cluster environment

Over 150 services run inside Mesos at HubSpot. HubSpot utilizes many hundreds of servers inside Amazon EC2, and the Mesos cluster comprises about 30% of these resources and is aggressively ramping up as more and more services are migrated to Mesos. As Mesos can easily handle large or small server footprints, hundreds of smaller servers are replaced with dozens of larger ones.

Benefits

Mesos provides numerous benefits to both the development team and the company. At HubSpot, developers own the operation of their applications. With Mesos, developers can deploy services faster and with less maintenance. Here are some of the other benefits:

Developers get immediate access to cluster resources, whether it be to scale or introduce new services.
Developers no longer need to understand the process of requisitioning hardware or servers, and it is easier to scale up the resource requirements inside Mesos than it is to recreate servers with more or less CPUs and memory.
Hardware failures are more transparent to developers as services are automatically replaced when tasks are lost or they fail. In other words, developers are no longer paged because of a simple hardware failure.
Scheduled tasks (cron jobs) are now exposed via a web interface and are not tied to a single server, which may fail at any time, taking the cron job with it.

Mesos also simplifies the technology stack required to requisition hardware and manage it from an operations perspective. HubSpot can standardize server footprints and simplify the base image upon which Mesos slaves are executed.

Lastly, resource utilization is improved, which directly corresponds with reducing costs. Services, which previously ran on overprovisioned hardware now use the exact amount of resources requested.

Additionally, the QA environment runs at 50% of its previous capacity as the HubSpot scheduler ensures that services are restarted when they fail. This means that it is no longer necessary to run multiple copies of services inside QA for high availability.

Challenges

A core challenge behind adoption is introducing a new deployment technology to a group of 100 engineers who are responsible for managing their applications on a daily basis. HubSpot mitigated this challenge by building a UI around Mesos and utilizing Mesos to make the deployment process as simple and rewarding as possible.

Looking ahead

HubSpot sees Mesos as a core technology behind future migrations into other datacenters. As both a virtualization and deployment technology, Mesos has proven to be a rewarding path forward. Additionally, HubSpot hopes to eventually leverage Mesos to dynamically scale out processes based on load, shrink and grow the cluster size relative to demand, and assist developers with resource estimation.

Tip

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look. The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Mesos. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Summary

In this chapter, we introduced Mesos, dived deep into its architecture, and discussed some important topics, such as frameworks, resource allocation, and resource isolation. We also discussed the two-level scheduling approach that Mesos employs and provided a detailed overview of its API. The HubSpot case study at the end was to show how it is used in production and that it is ready for prime time. The objective was to explain what Mesos is and why it is required and provide a high-level overview of how it works.

In the next chapter, we will deep dive into its important features and understand how it contributes to scaling, efficiency, high availability, and extendibility.

About the Authors

Dipa Dubhashi

Dipa Dubhashi is an alumnus of the prestigious Indian Institute of Technology and heads product management at Sigmoid. His prior experience includes consulting with ZS Associates besides founding his own start-up. Dipa specializes in envisioning enterprise big data products, developing their roadmaps, and managing their development to solve customer use cases across multiple industries. He advises several leading start-ups as well as Fortune 500 companies about architecting and implementing their next-generation big data solutions. Dipa has also developed a course on Apache Spark for a leading online education portal and is a regular speaker at big data meetups and conferences.
Browse publications by this author
Akhil Das

Akhil Das is a senior software developer at Sigmoid primarily focusing on distributed computing, real-time analytics, performance optimization, and application scaling problems using a wide variety of technologies such as Apache Spark and Mesos, among others. He contributes actively to the Apache Spark project and is a regular speaker at big data conferences and meetups, MesosCon 2015 being the most recent one.
Browse publications by this author

obsolete software and system installation method. no help to correct this.

Very good book, great structure, lots of code examples, good hands on to build up a mesos environment from scratch!