In this article by, Dharmesh Kakadia, author of the book Apache Mesos Essentials, explains how Mesos works internally in detail. We will start off with cluster scheduling and fairness concepts, understanding the Mesos architecture, and we will move on towards resource isolation and fault tolerance implementation in Mesos. In this article, we will cover the following topics:

The Mesos architecture
Resource allocation

(For more resources related to this topic, see here.)

The Mesos architecture

Modern organizations have a lot of different kinds of applications for different business needs. Modern applications are distributed and they are deployed across commodity hardware. Organizations today run different applications in siloed environments, where separate clusters are created for different applications. This static partitioning of cluster leads to low utilization, and all the applications will duplicate the effort of dealing with distributed infrastructures. Not only is this a wasted effort, but it also undermines the fact that distributed systems are hard to build and maintain. This is challenging for both developers and operators. For developers, it is a challenge to build applications that scale elastically and can handle faults that are inevitable in large-scale environment. Operators, on the other hand, have to manage and scale all of these applications individually in siloed environments.

The preceding situation is like trying to develop applications without having an operating system and managing all the devices in a computer. Mesos solves the problems mentioned earlier by providing a data center kernel. Mesos provides a higher-level abstraction to develop applications that treat distributed infrastructure just like a large computer. Mesos abstracts the hardware infrastructure away from the applications from the physical infrastructure.

Mesos makes developers more productive by providing an SDK to easily write data center scale applications. Now, developers can focus on their application logic and do not have to worry about the infrastructure that runs it. Mesos SDK provides primitives to build large-scale distributed systems, such as resource allocation, deployment, and monitoring isolation. They only need to know and implement what resources are needed, and not how they get the resources. Mesos allows you to treat the data center just as a computer.
Mesos makes the infrastructure operations easier by providing elastic infrastructure. Mesos aggregates all the resources in a single shared pool of resources and avoids static partitioning. This makes it easy to manage and increases the utilization.

The data center kernel has to provide resource allocation, isolation, and fault tolerance in a scalable, robust, and extensible way. We will discuss how Mesos fulfills these requirements, as well as some other important considerations of modern data center kernel:

Scalability: The kernel should be scalable in terms of the number of machines and number of applications. As the number of machines and applications increase, the response time of the scheduler should remain acceptable.
Flexibility: The kernel should support a wide range of applications. It should also support diverse frameworks currently running on the cluster and future frameworks as well. The framework should also be able to cope up with the heterogeneity in the hardware, as most clusters are built over time and have a variety of hardware running.
Maintainability: The kernel would be one of the very important pieces of modern infrastructure. As the requirements evolve, the scheduler should be able to accommodate new requirements.
Utilization and dynamism: The kernel should adapt to the changes in resource requirements and available hardware resources and utilize resources in an optimal manner.
Fairness: The kernel should be fair in allocating resources to the different users and/or frameworks. We will see what it means to be fair in detail in the next section.

understanding-mesos-internals-img-0

The design philosophy behind Mesos was to define a minimal interface to enable efficient resource sharing across frameworks and defer the task scheduling and execution to the frameworks. This allows the frameworks to implement diverse approaches toward scheduling and fault tolerance. It also makes the Mesos core simple, and the frameworks and core can evolve independently. The preceding figure shows the overall architecture (http://mesos.apache.org/documentation/latest/mesos-architecture) of a Mesos cluster. It has the following entities:

The Mesos masters
The Mesos slaves
Frameworks
Communication
Auxiliary services

We will describe each of these entities and their role, followed by how Mesos implements different requirements of the data center kernel.

The Mesos slave

The Mesos slaves are responsible for executing tasks from frameworks using the resources they have. The slave has to provide proper isolation while running multiple tasks. The isolation mechanism should also make sure that the tasks get resources that they are promised, and not more or less.

The resources on slaves that are managed by Mesos can be described using slave resources and slave attributes. Resources are elements of slaves that can be consumed by a task, while we use attributes to tag slaves with some information. Slave resources are managed by the Mesos master and are allocated to different frameworks. Attributes identify something about the node, such as the slave having a specific OS or software version, it's part of a particular network, or it has a particular hardware, and so on. The attributes are simple key-value pairs of strings that are passed along with the offers to frameworks. Since attributes cannot be consumed by a running task, they will always be offered for that slave. Mesos doesn't understand the slave attribute, and interpretation of the attributes is left to the frameworks. More information about resource and attributes in Mesos can be found at https://mesos.apache.org/documentation/attributes-resources.

A Mesos resource or attribute can be described as one of the following types:

Scalar values are floating numbers
Range values are a range of scalar values; they are represented as [minValue-maxValue]
Set values are arbitrary strings
Text values are arbitrary strings; they are applicable only to attributes

Names of the resources can be an arbitrary string consisting of alphabets, numbers, "-", "/", ".", "-". The Mesos master handles the cpus, mem, disk, and ports resources in a special way. A slave without the cpus and mem resources will not be advertised to the frameworks. The mem and disk scalars are interpreted in MB. The ports resource is represented as ranges. The list of resources a slave has to offer to various frameworks can be specified as the resources flag. Resources and attributes are separated by a semicolon. For example:

--resources='cpus:30;mem:122880;disk:921600;ports:[21000-29000];bugs:{a,b,c}'
--attributes='rack:rack-2;datacenter:europe;os:ubuntuv14.4'

This slave offers 30 cpus, 102 GB mem, 900 GB disk, ports from 21000 to 29000, and have bugs a, b, and c. The slave has three attributes: rack with value rack-2, datacenter with value europe, and os with value ubuntu14.4.

Mesos does not yet provide direct support for GPUs, but does support custom resource types. This means that if we specify gpu(*):8 as part of --resources, then it will be part of the resource that offers to frameworks. Frameworks can use it just like other resources. Once some of the GPU resources are in use by a task, only the remaining resources will be offered. Mesos does not yet have support for GPU isolation, but it can be extended by implementing a custom isolator. Alternately, we can also specify which slaves have GPUs using attributes, such as --attributes="hasGpu:true".

The Mesos master

The Mesos master is primarily responsible for allocating resources to different frameworks and managing the task life cycle for them. The Mesos master implements fine-grained resource sharing using resource offers. The Mesos master acts as a resource broker for frameworks using pluggable policies. The master decides to offer cluster resources to frameworks in the form of resource offers based on them.

Resources offer represents a unit of allocation in the Mesos world. It's a vector of resource available on a node. An offer represents some resources available on a slave being offered to a particular framework.

Frameworks

Distributed applications that run on top of Mesos are called frameworks. Frameworks implement the domain requirements using the general resource allocation API of Mesos. A typical framework wants to run a number of tasks. Tasks are the consumers of resources and they do not have to be the same. A framework in Mesos consists of two components: a framework scheduler and executors. Framework schedulers are responsible for coordinating the execution. An executor provides the ability to control the task execution. Executors can realize a task execution in many ways. An executor can choose to run multiple tasks, by spawning multiple threads, in an executor, or it can run one task in each executor. Apart from the life cycle and task management-related functions, the Mesos framework API also provides functions to communicate with framework schedulers and executors.

Communication

Mesos currently uses an HTTP-like wire protocol to communicate with the Mesos components. Mesos uses the libprocess library to implement the communication that is located in 3rdparty/libprocess. The libprocess library provides asynchronous communication with processes. The communication primitives have an actor message passing, such as semantics. The libprocess messages are immutable, which makes parallelizing the libprocess internals easier. Mesos communication happens along with the following APIs:

Scheduler API: This is used to communicate with the framework scheduler and master. The internal communication is intended to be used only by the SchedulerDriver API.
Executor API: This is used to communicate with an executor and the Mesos slave.
Internal API: This is used to communicate with the Mesos master and slave.
Operator API: This is the API exposed by Mesos for operators and is used by web UI, among other things. Unlike most Mesos API, the operator API is a synchronous API.

To send a message, the actor does an HTTP POST request. The path is composed by the name of the actor followed by the name of the message. The User-Agent field is set to "libprocess/…" to distinguish from the normal HTTP requests. The message data is passed as the body of the HTTP request. Mesos uses protocol buffers to serialize all the messages (defined in src/messages/messages.proto). The parsing and interpretation of the message is left to the receiving actor.

Here is an example header of a message sent to master to register the framework by scheduler(1) running at 10.0.1.7:53523 address:

POST /master/mesos.internal.RegisterFrameworkMessage HTTP/1.1
User-Agent: libprocess/scheduler(1)@10.0.1.7:53523

The reply message header from the master that acknowledges the framework registration might look like this:

POST /scheduler(1)/mesos.internal.FrameworkRegisteredMessage HTTP/1.1
User-Agent: libprocess/master@10.0.1.7:5050

At the time of writing, there is a very early discussion about rewiring the Mesos Scheduler API and Executor API as a pure HTTP API (https://issues.apache.org/jira/browse/MESOS-2288). This will make the API standard and integration with Mesos for various tools much easier without the need to be dependent on native libmesos. Also, there is an ongoing effort to convert all the internal messages into a standardized JSON or protocol buffer format (https://issues.apache.org/jira/browse/MESOS-1127).

Auxiliary services

Apart from the preceding main components, a Mesos cluster also needs some auxiliary services. These services are not part of Mesos itself, and are not strictly required, but they form a basis for operating the Mesos cluster in production environments. These services include, but are not limited to, the following:

Shared filesystem: Mesos provides a view of the data center as a single computer and allows developers to develop for the data center scale application. With this unified view of resources, clusters need a shared filesystem to truly make the data center a computer. HDFS, NFS (Network File System), and Cloud-based storage options, such as S3, are popular among various Mesos deployments.
Consensus service: Mesos uses a consensus service to be resilient in face of failure. Consensus services, such as ZooKeeper or etcd, provide a reliable leader election in a distributed environment.
Service fabric: Mesos enables users to run a number of frameworks on unified computing resources. With a large number of applications and services running, it's important for users to be able to connect to them in a seamless manner. For example, how do users connect to Hive running on Mesos? How does the Ruby on Rails application discover and connect to the MongoDB database instances when one or both of them are running on Mesos? How is the website traffic routed to web servers running on Mesos? Answering these questions mainly requires service discovery and load balancing mechanisms, but also things such as IP/port management and security infrastructure. We are collectively referring to these services that connect frameworks to other frameworks and users as service fabric.
Operational services: Operational services are essential in managing operational aspects of Mesos. Mesos deployments and upgrades, monitoring cluster health and alerting when human intervention is required, logging, and security are all part of the operational services that play a very important role in a Mesos cluster.

Resource allocation

As a data center kernel, Mesos serves a large variety of workloads and no single scheduler will be able to satisfy the needs of all different frameworks. For example, the way in which a real-time processing framework schedules its tasks will be very different from how a long running service will schedule its task, which, in turn, will be very different from how a batch processing framework would like to use its resources. This observation leads to a very important design decision in Mesos: separation of resource allocation and task scheduling. Resource allocation is all about deciding who gets what resources, and it is the responsibility of the Mesos master. Task scheduling, on the other hand, is all about how to use the resources. This is decided by various framework schedulers according to their own needs. Another way to understand this would be that Mesos handles coarse-grain resource allocation across frameworks, and then each framework does fine-grain job scheduling via appropriate job ordering to achieve its needs.

The Mesos master gets information on the available resources from the Mesos slaves, and based on resource policies, the Mesos master offers these resources to different frameworks. Different frameworks can choose to accept or reject the offer. If the framework accepts a resource offer, the framework allocates the corresponding resources to the framework, and then the framework is free to use them to launch tasks. The following image shows the high-level flow of Mesos resource allocation:

understanding-mesos-internals-img-1

Mesos two level scheduler

Here is the typical flow of events for one framework in Mesos:

The framework scheduler registers itself with the Mesos master.
The Mesos master receives the resource offers from slaves. It invokes the allocation module and decides which frameworks should receive the resource offers.
The framework scheduler receives the resource offers from the Mesos master.
On receiving the resource offers, the framework scheduler inspects the offer to decide whether it's suitable. If it finds it satisfactory, the framework scheduler accepts the offer and replies to the master with the list of executors that should be run on the slave, utilizing the accepted resource offers. Alternatively, the framework can reject the offer and wait for a better offer.
The slave allocates the requested resources and launches the task executors. The executor is launched on slave nodes and runs the framework's tasks. It is up to the framework scheduler to accept or reject the resource offers. Here is an example of events that can happen when allocating resources.
The framework scheduler gets notified about the task's completion or failure. The framework scheduler will continue receiving the resource offers and task reports and launch tasks as it sees fit.
The framework unregisters with the Mesos master and will not receive any further resource offers. Note that this is optional and a long running service, and meta-framework will not unregister during the normal operation.

Because of this design, Mesos is also known as a two-level scheduler. Mesos' two-level scheduler design makes it simpler and more scalable, as the resource allocation process does not need to know how scheduling happens. This makes the Mesos core more stable and scalable. Frameworks and Mesos are not tied to each other and each can iterate independently. Also, this makes porting frameworks easier.

The choice of a two-level scheduler means that the scheduler does not have a global knowledge about resource utilization and the resource allocation decisions can be nonoptimal. One potential concern could be about the preferences that the frameworks have about the kind of resources needed for execution. Data locality, special hardware, and security constraints can be a few of the constraints on which tasks can run. In the Mesos realm, these preferences are not explicitly specified by a framework to the Mesos master, instead the framework rejects all the offers that do not meet its constraints.

The Mesos scheduler

Mesos was the first cluster scheduler to allow the sharing of resources to multiple frameworks. Mesos resource allocation is based on online Dominant Resource Fairness (DRF) called HierarchicalDRF. In a world of single resource static partitioning, fairness is easy to define. DRF extends this concept of fairness to multi-resource settings without the need for static partitioning. Resource utilization and fairness are equally important, and often conflicting, goals for a cluster scheduler. The fairness of resource allocation is important in a shared environment, such as data centers, to ensure that all the users/processes of the cluster get nearly an equal amount of resources.

Min-max fairness provides a well-known mechanism to share a single resource among multiple users. Min-max fairness algorithm maximizes the minimum resources allocated to a user. In its simplest form, it allocates 1/Nth of the resource to each of the users. The weighted min-max fairness algorithm can also support priorities and reservations. Min-max resource fairness has been a basis for many well-known schedulers in operating systems and distributed frameworks, such as Hadoop's fair scheduler (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html), capacity scheduler (https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html), Quincy scheduler (http://dl.acm.org/citation.cfm?id=1629601), and so on. However, it falls short when the cluster has multiple types of resources, such as the CPU, memory, disk, and network. When jobs in a distributed environment use different combinations of these resources to achieve the outcome, the fairness has to be redefined. For example, the two requests <1 CPU, 3 GB> and <3 CPU, 1 GB> come to the scheduler. How do they compare and what is the fair allocation?

DRF generalizes the min-max algorithm for multiple resources. A user's dominant resource is the resource for which the user has a biggest share. For example, if the total resources are <8 CPU, 5 GB>, then for the user allocation of <2 CPU, 1 GB>, the user's dominant share is maximumOf(2/8,1/5) means CPU. A user's dominant share is the fraction of the dominant resource that's allocated to the user. In our example, it would be 25 percent (2/8). DRF applies the min-max algorithm to the dominant share of each user. It has many provable properties:

Strategy proofness: A user cannot gain any advantage by lying about the demands.
Sharing incentive: DRF has a minimum allocation guarantee for each user, and no user will prefer exclusive partitioned cluster of size 1/N over DRF allocation.
Single resource fairness: In case of only one resource, DRF is equivalent to the min-max algorithm.
Envy free: Every user prefers his allocation over any other allocation of other users. This also means that the users with the same requests get equivalent allocations.
Bottleneck fairness: When one resource becomes a bottleneck, and each user has a dominant demand for it, DRF is equivalent to min-max.
Monotonicity: Adding resources or removing users can only increase the allocation of the remaining users.
Pareto efficiency: The allocation achieved by DRF will be pareto efficient, so it would be impossible to make a better allocation for any user without making allocation for some other user worse.

We will not further discuss DRF but will encourage you to refer to the DRF paper for more details at http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf.

Mesos uses role specified in FrameworkInfo for resource allocation decision. A role can be per user or per framework or can be shared by multiple users and frameworks. If it's not set, Mesos will set it to the current user that runs the framework scheduler. An optimization is to use deny resource offers from particular slaves for a specified time period. Mesos can revoke tasks allocation killing those tasks. Before killing a task, Mesos gives the framework a grace period to clean up. Mesos asks the executor to kill the task, but if it does not oblige the request, it will kill the executor and all of its tasks.

Weighted DRF

DRF calculates each role's dominant share and allocates the available resources to the user with the smallest dominant share. In practice, an organization rarely wants to assign resources in a complete fair manner. Most organizations want to allocate resources in a weighted manner, such as 50 percent resources to ads team, 30 percent to QA, and 20 percent to R&D teams. To satisfy this command functionality, Mesos implements weighted DRF, where masters can be configured with weights for different roles. When weights are specified, a client's DRF share will be divided by the weight. For example, a role that has a weight of two will be offered twice as many resources as a role with weight of one.

Mesos can be configured to use weighted DRF using the --weights and --roles flags on the master startup. The --weights flag expects a list of role/weight pairs in the form of role1=weight1 and role2=weight2. Weights do not need to be integers.

We must provide weights for each role that appear in --roles on the master startup.

Reservation

One of the other most asked questions for requirement is the ability to reserve resources. For example, persistent or stateful services, such as memcache, or a database running on Mesos, would need a reservation mechanism to avoid being negatively affected on restart. Without reservation, memcache is not guaranteed to get a resource offer from the slave, which has all the data and would incur significant time in initialization and downtime for the service. Reservation can also be used to limit the resource per role.

Reservation provides guaranteed resources for roles, but improper usage might lead to resource fragmentation and lower utilization of resources.

Note that all the reservation requests go through a Mesos authorization mechanism to ensure that the operator or framework requesting the operation has the proper privileges. Reservation privileges are specified to the Mesos master through ACL along with the rest of the ACL configuration. Mesos supports the following two kinds of reservation:

Static reservation
Dynamic reservation

Static reservation

In static reservation, resources are reserved for a particular role. The restart of the slave after removing the checkpointed state is required to change static reservation. Static reservation is thus typically managed by operators using the --resources flag on the slave. The flag expects a list of name(role):value for different resources. If a resource is assigned to role A, then only frameworks with role A are eligible to get an offer for that resource.

Any resources that do not include a role or resources that are not included in the --resources flag will be included in the default role (default *). For example, --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096" specifies that the slave has 8 CPUs and 4096 MB memory reserved for "ads" role and has 4 CPUs and 2048 MB memory unreserved.

Nonuniform static reservation across slaves can quickly become difficult to manage.

Dynamic reservation

Dynamic reservation allows operators and frameworks to manage reservation more dynamically. Frameworks can use dynamic reservations to reserve offered resources, allowing those resources to only be reoffered to the same framework.

At the time of writing, dynamic reservation is still being actively developed and is targeted toward the next release of Mesos (https://issues.apache.org/jira/browse/MESOS-2018).

When asked for a reservation, Mesos will try to convert the unreserved resources to reserved resources. On the other hand, during the unreserve operation, the previously reserved resources are returned to the unreserved pool of resources.

To support dynamic reservation, Mesos allows a sequence of Offer::Operations to be performed as a response to accepting resource offers. A framework manages reservation by sending Offer::Operations::Reserve and Offer::Operations::Unreserve as part of these operations, when receiving resource offers. For example, consider the framework that receives the following resource offer with 32 CPUs and 65536 MB memory:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "cpus",
      "type" : "SCALAR",
      "scalar" : { "value" : 32 },
      "role" : "*",
    },
    {
      "name" : "mem",
      "type" : "SCALAR",
      "scalar" : { "value" : 65536 },
      "role" : "*",
    }
  ]
}

The framework can decide to reserve 8 CPUs and 4096 MB memory by sending the Operation::Reserve message with resources field with the desired resources state:

[
  {
    "type" : Offer::Operation::RESERVE,
    "resources" : [
      {
        "name" : "cpus",
        "type" : "SCALAR",
        "scalar" : { "value" : 8 },
        "role" : <framework_role>,
        "reservation" : {
          "framework_id" : <framework_id>,
          "principal" : <framework_principal>
        }
      }
      {
        "name" : "mem",
        "type" : "SCALAR",
        "scalar" : { "value" : 4096 },
        "role" : <framework_role>,
        "reservation" : {
          "framework_id" : <framework_id>,
          "principal" : <framework_principal>
        }
      }
    ]
  }
]

After a successful execution, the framework will receive resource offers with reservation. The next offer from the slave might look as follows:

{
  "id" : <offer_id>,
  "framework_id" : <framework_id>,
  "slave_id" : <slave_id>,
  "hostname" : <hostname>,
  "resources" : [
    {
      "name" : "cpus",
      "type" : "SCALAR",
      "scalar" : { "value" : 8 },
      "role" : <framework_role>,
      "reservation" : {
        "framework_id" : <framework_id>,
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "mem",
      "type" : "SCALAR",
      "scalar" : { "value" : 4096 },
      "role" : <framework_role>,
      "reservation" : {
        "framework_id" : <framework_id>,
        "principal" : <framework_principal>
      }
    },
    {
      "name" : "cpus",
      "type" : "SCALAR",
      "scalar" : { "value" : 24 },
      "role" : "*",
    },
    {
      "name" : "mem",
      "type" : "SCALAR",
      "scalar" : { "value" : 61440 },
      "role" : "*",
    }
  ]
}

As shown, the framework has 8 CPUs and 4096 MB memory reserved resources and 24 CPUs and 61440 MB memory underserved in the resource offer. The unreserve operation is similar. The framework on receiving the resource offer can send the unreserve operation message, and subsequent offers will not have reserved resources.

The operators can use/reserve and/unreserve HTTP endpoints of the operator API to manage the reservation. The operator API allows operators to change the reservation specified when the slave starts. For example, the following command will reserve 4 CPUs and 4096 MB memory on slave1 for role1 with the operator authentication principal ops:

ubuntu@master:~ $ curl -d slaveId=slave1 -d resources="{
         {
           "name" : "cpus",
           "type" : "SCALAR",
           "scalar" : { "value" : 4 },
           "role" : "role1",
           "reservation" : {
             "principal" : "ops"
           }
         },
         {
           "name" : "mem",
           "type" : "SCALAR",
           "scalar" : { "value" : 4096 },
           "role" : "role1",
           "reservation" : {
             "principal" : "ops"
           }
         },
       }"
       -X POST http://master:5050/master/reserve

Before we end this discussion on resource allocation, it would be important to note that the Mesos community continues to innovate on the resource allocation front by incorporating interesting ideas, such as oversubscription (https://issues.apache.org/jira/browse/MESOS-354), from academic literature and other systems.

Summary

In this article, we looked at the Mesos architecture in detail and learned how Mesos deals with resource allocation, resource isolation, and fault tolerance. We also saw the various ways in which we can extend Mesos.