Hadoop Cluster Deployment

Chapter 1. Setting Up Hadoop Cluster – from Hardware to Distribution

Hadoop is a free and open source distributed storage and computational platform. It was created to allow storing and processing large amounts of data using clusters of commodity hardware. In the last couple of years, Hadoop became a de facto standard for the big data projects. In this chapter, we will cover the following topics:

Choosing Hadoop cluster hardware
Hadoop distributions
Choosing OS for the Hadoop cluster

This chapter will give an overview of the Hadoop philosophy when it comes to choosing and configuring hardware for the cluster. We will also review the different Hadoop distributions, the number of which is growing every year. This chapter will explain the similarities and differences between those distributions.

For you, as a Hadoop administrator or an architect, the practical part of cluster implementation starts with making decisions on what kind of hardware to use and how much of it you will need, but there are some essential questions that need to be asked before you can place your hardware order, roll up your sleeves, and start setting things up. Among such questions are those related to cluster design, such as how much data will the cluster need to store, what are the projections of data growth rate, what would be the main data access pattern, will the cluster be used mostly for predefined scheduled tasks, or will it be a multitenant environment used for exploratory data analysis? Hadoop's architecture and data access model allows great flexibility. It can accommodate different types of workload, such as batch processing huge amounts of data or supporting real-time analytics with projects like Impala.

At the same time, some clusters are better suited for specific types of work and hence it is important to arrive at the hardware specification phase with the ideas about cluster design and purpose in mind. When dealing with clusters of hundreds of servers, initial decisions about hardware and general layout will have a significant influence on a cluster's performance, stability, and associated costs.

Choosing Hadoop cluster hardware

Hadoop is a scalable clustered non-shared system for massively parallel data processing. The whole concept of Hadoop is that a single node doesn't play a significant role in the overall cluster reliability and performance. This design assumption leads to choosing hardware that can efficiently process small (relative to total data size) amounts of data on a single node and doesn't require lots of reliability and redundancy on a hardware level. As you may already know, there are several types of servers that comprise the Hadoop cluster. There are master nodes, such as NameNode, Secondary NameNode, and JobTracker and worker nodes that are called DataNodes. In addition to the core Hadoop components, it is a common practice to deploy several auxiliary servers, such as Gateways, Hue server, and Hive Metastore. A typical Hadoop cluster can look like the following diagram:

Typical Hadoop cluster layout

The roles that those types of servers play in a cluster are different, so are the requirements for hardware specifications and reliability of these nodes. We will first discuss different hardware configurations for DataNodes and then will talk about typical setups for NameNode and JobTracker.

Choosing the DataNode hardware

DataNode is the main worker node in a Hadoop cluster and it plays two main roles: it stores pieces of HDFS data and executes MapReduce tasks. DataNode is Hadoop's primary storage and computational resource. One may think that since DataNodes play such an important role in a cluster, you should use the best hardware available for them. This is not entirely true. Hadoop was designed with an idea that DataNodes are "disposable workers", servers that are fast enough to do useful work as a part of the cluster, but cheap enough to be easily replaced if they fail. Frequency of hardware failures in large clusters is probably one of the most important considerations that core Hadoop developers had in mind. Hadoop addresses this issue by moving the redundancy implementation from the cluster hardware to the cluster software itself.

Note

Hadoop provides redundancy on many levels. Each DataNode stores only some blocks for the HDFS files and those blocks are replicated multiple times to different nodes, so in the event of a single server failure, data remains accessible. The cluster can even tolerate multiple nodes' failure, depending on the configuration you choose. Hadoop goes beyond that and allows you to specify which servers reside on which racks and tries to store copies of data on separate racks, thus, significantly increasing probability that your data remains accessible even if the whole rack goes down (though this is not a strict guarantee). This design means that there is no reason to invest into the RAID controller for Hadoop DataNodes.

Instead of using RAID for local disks, a setup that is known as JBOD (Just a Bunch of Disks) is a preferred choice. It provides better performance for Hadoop workload and reduces hardware costs. You don't have to worry about individual disk failure since redundancy is provided by HDFS.

Storing data is the first role that DataNode plays. The second role is to serve as a data processing node and execute custom MapReduce code. MapReduce jobs are split into lots of separate tasks, which are executed in parallel on multiple DataNodes and for a job to produce logically consistent results, all subtasks must be completed.

This means that Hadoop has to provide redundancy not only on storage, but also on a computational layer. Hadoop achieves this by retrying failed tasks on different nodes, without interrupting the whole job. It also keeps track of nodes that have abnormally high rate of failures or have been responding slower than others and eventually such nodes can be blacklisted and excluded from the cluster.

So, what should the hardware for a typical DataNode look like? Ideally, DataNode should be a balanced system with a reasonable amount of disk storage and processing power. Defining "balanced system" and "reasonable amount of storage" is not as simple a task as it may sound. There are many factors that come into play when you are trying to spec out an optimal and scalable Hadoop cluster. One of the most important considerations is total cluster storage capacity and cluster storage density. These parameters are tightly related. Total cluster storage capacity is relatively simple to estimate. It basically answers questions such as how much data we can put into the cluster. The following is a list of steps that you can take to estimate the required capacity for your cluster:

Identify data sources: List out all known data sources and decide whether full or partial initial data import will be required. You should reserve 15-20 percent of your total cluster capacity, or even more to accommodate any new data sources or unplanned data size growth.
Estimate data growth rate: Each identified data source will have a data ingestion rate associated with it. For example, if you are planning to do daily exports from your OLTP database, you can easily estimate how much data this source will produce over the course of the week, month, year, and so on. You will need to do some test exports to get an accurate number.
Multiply your estimated storage requirements by a replication factor: So far, we talked about the usable storage capacity. Hadoop achieves redundancy on the HDFS level by copying data blocks several times and placing them on different nodes in the cluster. By default, each block is replicated three times. You can adjust this parameter, both by increasing or decreasing the replication factor. Setting the replication factor to 1 completely diminishes a cluster's reliability and should not be used. So, to get raw cluster storage capacity, you need to multiply your estimates by a replication factor. If you estimated that you need 300 TB of usable storage this year and you are planning to use a replication factor of 3, your raw capacity will be 900 TB.
Factoring in MapReduce temporary files and system data: MapReduce tasks produce intermediate data that is being passed from the map execution phase to the reduce phase. This temporary data doesn't reside on HDFS, but you need to allocate about 25-30 percent of total server disk capacity for temporary files. Additionally, you will need separate disk volumes for an operating system, but storage requirements for OS are usually insignificant.

Identifying total usable and raw cluster storage capacity is the first step in nailing down hardware specifications for the DataNode. For further discussions, we will mean raw capacity when referring to cluster's total available storage, since this is what's important from the hardware perspective. Another important metric is storage density, which is the total cluster storage capacity divided by the number of DataNodes in the cluster. Generally, you have two choices: either deploy lots of servers with low storage density, or use less servers with higher storage density. We will review both the options and outline the pros and cons for each.

Low storage density cluster

Historically, Hadoop clusters were deployed on reasonably low storage density servers. This allowed scaling clusters to petabytes of storage capacity using low capacity hard drives available on the market at that time. While the hard drive capacity increased significantly over the last several years, using a large low-density cluster is still a valid option for many. Cost is the main reason you will want to go this route. Individual Hadoop node performance is driven not only by storage capacity, but rather by a balance that you have between RAM/CPU and disks. Having lots of storage on every DataNode, but not having enough RAM and CPU resources to process all the data, will not be beneficial in most cases.

It is always hard to give specific recommendations about the Hadoop cluster hardware. A balanced setup will depend on the cluster workload, as well as the allocated budget. New hardware appears on the market all the time, so any considerations should be adjusted accordingly. To illustrate hardware selection logic for a low density cluster, we will use the following example:

Let's assume we have picked up a server with 6 HDD slots. If we choose reasonably priced 2 TB hard drives, it will give us 12 TB of raw capacity per server.

Note

There is little reason to choose faster 15000 rpm drives for your cluster. Sequential read/write performance matters much more for Hadoop cluster, than random access speed. 7200 rpm drives are a preferred choice in most cases.

For a low density server, our main aim is to keep the cost low to be able to afford a large number of machines. 2 x 4 core CPUs match this requirement and will give reasonable processing power. Each map or reduce task will utilize a single CPU core, but since some time will be spent waiting on IO, it is OK to oversubscribe the CPU core. With 8 cores available, we can configure about 12 map/reduce slots per node.

Each task will require from 2 to 4 GB of RAM. 36 GB of RAM is a reasonable choice for this type of server, but going with 48 GB is ideal. Note that we are trying to balance different components. It's of little use to significantly increase the amount of RAM for this configuration, because you will not be able to schedule enough tasks on one node to properly utilize it.

Let's say you are planning to store 500 TB of data in your cluster. With the default replication factor of 3, this will result in 1500 TB of raw capacity. If you use low density DataNode configuration, you will need 63 servers to satisfy this requirement. If you double the required capacity, you will need more than 100 servers in your cluster. Managing a large number of servers has lots of challenges of its own. You will need to think if there is enough physical room in your data center to host additional racks. Additional power consumption and air conditioning also present significant challenges when the number of servers grows. To address these problems, you can increase the storage capacity of an individual server, as well as tune up other hardware specs.

High storage density cluster

Many companies are looking into building smaller Hadoop clusters, but with more storage and computational power per server. Besides addressing issues mentioned above, such clusters can be a better fit for workload where huge amounts of storage are not a priority. Such workload is computationally intensive and includes machine learning, exploratory analytics, and other problems.

The logic behind choosing and balancing hardware components for a high density cluster is the same as for a lower density one. As an example of such a configuration, we will choose a server with 16 x 2 TB hard drives or 24 x 1 TB hard drives. Having more lower capacity disks per server is preferable, because it will provide better IO throughput and better fault tolerance. To increase the computational power of the individual machine, we will use 16 CPU cores and 96 GB of RAM.

NameNode and JobTracker hardware configuration

Hadoop implements a centralized coordination model, where there is a node (or a group of nodes) whose role is to coordinate tasks among servers that comprise the cluster. The server that is responsible for HDFS coordination is called NameNode and the server responsible for MapReduce jobs dispatching is called JobTracker. Actually NameNode and JobTracker are just separate Hadoop processes, but due to their critical role in almost all cases, these services run on dedicated machines.

The NameNode hardware

NameNode is critical to HDFS availability. It stores all the filesystem metadata: which blocks comprise which files, on which DataNodes these blocks can be found, how many free blocks are available, and which servers can host them. Without NameNode, data in HDFS is almost completely useless. The data is actually still there, but without NameNode you will not be able to reconstruct files from data blocks, nor will you be able to upload new data. For a long time, NameNode was a single point of failure, which was less than ideal for a system that advertises high fault tolerance and redundancy of most components and processes. This was addressed with the introduction of the NameNode High Availability setup in Apache Hadoop 2.0.0, but still hardware requirements for NameNode are very different from what was outlined for DataNode in the previous section. Let's start with the memory estimates for NameNode. NameNode has to store all HDFS metadata info, including files, directories' structures, and blocks allocation in memory. This may sound like a wasteful usage of RAM, but NameNode has to guarantee fast access to files on hundreds or thousands of machines, so using hard drives for accessing this information would be too slow. According to the Apache Hadoop documentation, each HDFS block will occupy approximately 250 bytes of RAM on NameNode, plus an additional 250 bytes will be required for each file and directory. Let's say you have 5,000 files with an average of 20 GB per file. If you use the default HDFS block file size of 64 MB and a replication factor of 3, your NameNode will need to hold information about 50 million blocks, which will require 50 million x 250 bytes plus filesystem overhead equals 1.5 GB of RAM. This is not as much as you may have imagined, but in most cases a Hadoop cluster has many more files in total and since each file will consist of at least one block, memory usage on NameNode will be much higher. There is no penalty for having more RAM on the NameNode than your cluster requires at the moment, so overprovisioning is fine. Systems with 64-96 GB of RAM are a good choice for the NameNode server.

To guarantee persistency of filesystem metadata, NameNode has to keep a copy of its memory structures on disk as well. For this, NameNode maintains a file called editlog, which captures all changes that are happening to the HDFS, such as new files and directories creation and replication factor changes. This is very similar to the redo logfiles that most relational databases use. In addition to editlog, NameNode maintains a full snapshot of the current HDFS metadata state in an fsimage file. In case of a restart, or server crash, NameNode will use the latest fsimage and apply all the changes from the editlog file that needs to be applied to restore a valid point-in-time state of the filesystem.

Unlike traditional database systems, NameNode delegates the task of periodically applying changes from editlog to fsimage to a separate server called Secondary NameNode. This is done to keep the editlog file size under control, because changes that are already applied to fsimage are no longer required in the logfile and also to minimize the recovery time. Since these files are mirroring data structures that NameNode keeps in memory, disk space requirements for them are normally pretty low. fsimage will not grow bigger than the amount of RAM you allocated for NameNode and editlog will be rotated once it has reached 64 MB by default. This means that you can keep the disk space requirements for NameNode in the 500 GB range. Using RAID on the NameNode makes a lot of sense, because it provides protection of critical data from individual disk crashes. Besides serving filesystem requests from HDFS clients, NameNode also has to process heartbeat messages from all DataNodes in the cluster. This type of workload requires significant CPU resources, so it's a good idea to provision 8-16 CPU cores for NameNode, depending on the planned cluster size.

In this book, we will focus on setting up NameNode HA, which will require Primary and Standby NameNodes to be identical in terms of hardware. More details on how to achieve high availability for NameNode will be provided in Chapter 2, Installing and Configuring Hadoop.

The JobTracker hardware

Besides NameNode and Secondary NameNode, there is another master server in the Hadoop cluster called the JobTracker. Conceptually, it plays a similar role for the MapReduce framework as NameNode does for HDFS. JobTracker is responsible for submitting user jobs to TaskTrackers, which are services running on each DataNode. TaskTrackers send periodic heartbeat messages to JobTracker, reporting current status of running jobs, available map/reduce slots, and so on. Additionally, JobTracker keeps a history of the last executed jobs (number is configurable) in memory and provides access to Hadoop-specific or user-defined counters associated with the jobs. While RAM availability is critical to JobTracker, its memory footprint is normally smaller than that of NameNode. Having 24-48 GB of RAM for mid- and large-size clusters is a reasonable estimate. You can review this number if your cluster will be a multitenant environment with thousands of users. By default, JobTracker doesn't save any state information to the disk and uses persistent storage only for logging purpose. This means that total disk requirements for this service are minimal. Just like NameNode, JobTracker will need to be able to process huge amounts of heartbeat information from TaskTrackers, accept and dispatch incoming user jobs, and also apply job scheduling algorithms to be able to utilize a cluster most efficiently. These are highly CPU-intensive tasks, so make sure you invest in fast multi-core processors, similar to what you would pick up for NameNode.

All three types of master nodes are critical to Hadoop cluster availability. If you lose a NameNode server, you will lose access to HDFS data. Issues with Secondary NameNode will not cause an immediate outage, but will delay the filesystem checkpointing process. Similarly, a crash of JobTracker will cause all running MapReduce jobs to abort and no new jobs will be able to run. All these consequences require a different approach to the master's hardware selection than what we have discussed for DataNode. Using RAID arrays for critical data volumes, redundant network and power supplies, and potentially higher-grade enterprise level hardware components is a preferred choice.

Gateway and other auxiliary services

Gateway servers are a client's access points to the Hadoop cluster. Interaction with data in HDFS requires having connectivity between the client program and all nodes inside the cluster. This is not always practical from a network design and security perspective. Gateways are usually deployed outside of the primary cluster subnet and are used for data imports and other user programs. Additional infrastructure components and different shells can be deployed on standalone servers, or combined with other services. Hardware requirements to these optional services are obviously much lower than those for cluster nodes and often you can deploy gateways on virtual machines. 4-8 CPU cores and 16-24 GB of RAM is a reasonable configuration for a Gateway node.

Network considerations

In Hadoop cluster, network is a component that is as important as a CPU, disk, or RAM. HDFS relies on network communication to update NameNode on a current filesystem status, as well as to receive and send data blocks to the client. MapReduce jobs also use the network for status messages, but additionally uses bandwidth when a file block has to be read from a DataNode that is not local to the current TaskTracker, and to send intermediate data from mappers to the reducers. In short, there is a lot of network activity going on in a Hadoop cluster. As of now, there are two main choices when it comes to the network hardware. A 1 GbE network is cheap, but is rather limited in throughput, while a 10 GbE network can significantly increase the costs of a large Hadoop deployment. Like every other component of the cluster, the network choice will depend on the intended cluster layout.

For larger clusters, we came up with generally lower spec machines, with less disks, RAM, and CPU per node, assuming that a large volume of such servers will provide enough capacity. For the smaller cluster, we have chosen high-end servers. We can use the same arguments when it comes to choosing which network architecture to apply.

For clusters with multiple less powerful nodes, installing 10 GbE makes little sense for two reasons. First of all, it will increase the total cost of building the cluster significantly and you may not be able to utilize all the available network capacity. For example, with six disks per DataNode, you should be able to achieve about 420 MB/sec of local write throughput, which is less than the network bandwidth. This means that the cluster bottleneck will shift from the network to the disks' IO capacity. On the other hand, a smaller cluster of fast servers with lots of storage will most probably choke on a 1 GbE network and most of the server's available resources will be wasted. Since such clusters are normally smaller, a 10 GbE network hardware will not have as big of an impact on the budget as for a larger setup.

Tip

Most of the modern servers come with several network controllers. You can use bonding to increase network throughput.

Hadoop hardware summary

Let's summarize the possible Hadoop hardware configurations required for different types of clusters.

DataNode for low storage density cluster:

Component	Specification
Storage	6-8 2 TB hard drives per server, JBOD setup, no RAID
CPU	8 CPU cores
RAM	32-48 GB per node
Network	1 GbE interfaces, bonding of several NICs for higher throughput is possible

DataNode for high storage density cluster

Component	Specification
Storage	16-24 1 TB hard drives per server, JBOD setup, no RAID
CPU	16 CPU cores
RAM	64-96 GB per node
Network	10 GbE network interface

NameNode and Standby NameNode

Component	Specification
Storage	Low disk space requirements: 500 GB should be enough in most cases. RAID 10 or RAID 5 for `fsimage` and `editlog`. Network attached storage to place a copy of these files
CPU	8-16 CPU cores, depending on the cluster size
RAM	64-96 GB
Network	1 GbE or 10 GbE interfaces, bonding of several NICs for higher throughput is possible

JobTracker

Component	Specification
Storage	Low disk space requirements: 500 GB should be enough in most cases for logfiles and the job's state information
CPU	8-16 CPU cores, depending on the cluster size
RAM	64-96 GB.
Network	1 GbE or 10 GbE interfaces, bonding of several NICs for higher throughput is possible

Hadoop distributions

Hadoop comes in many different flavors. There are many different versions and many different distributions available from a number of companies. There are several key players in this area today and we will discuss what options they provide.

Hadoop versions

Hadoop releasing a versioning system is, to say the least, confusing. There are several branches with different stable versions available and it is important to understand what features each branch provides (or excludes). As of now, these are the following Hadoop versions available: 0.23, 1.0, and 2.0. Surprisingly, higher versions do not always include all the features from the lower versions. For example, 0.23 includes NameNode High Availability and NameNode Federation, but drops support for the traditional MaprReduce framework (MRv1) in favor of a new YARN framework (MRv2).

MRv2 is compatible with MRv1 on the API level, but a daemon's setup and configuration, and concepts are different. Version 1.0 still includes MRv1, but lacks NameNode HA and Federation features, which many consider critical for production usage. Version 2.0 is actually based on 0.23 and has the same feature set, but will be used for future development and releases. One of the reasons that Hadoop released versions seem not to follow straightforward logic, is that Hadoop is still a relatively new technology and many features that are highly desirable by some users can introduce instability and sometimes they require significant code changes and approach changes, such as in a case with YARN. This leads to lots of different code branches with different stable release versions and lots of confusion to the end user. Since the purpose of this book is to guide you through planning and implementing the production Hadoop cluster, we will focus on stable Hadoop versions that provide proven solutions such as MRv1, but will also include important availability features for the NameNode. As you can see, this will narrow down the choice of a Hadoop release version right away.

Choosing Hadoop distribution

Apache Hadoop is not the only distribution available. There are several other companies that maintain their own forks of the project, both free and proprietary. You probably have already started seeing why this would make sense: streamlining the release process for Hadoop and combining different features from several Hadoop branches makes it much easier for the end user to implement a cluster. One of the most popular non-Apache distributions of Hadoop is Cloudera Hadoop Distribution or CDH.

Cloudera Hadoop distribution

Cloudera is the company that provides commercial support, professional services, and advanced tools for Hadoop. Their CDH distribution is free and open source under the same Apache 2.0 license. What makes CDH appealing to the end user is that there are fewer code branches, version numbers are aligned, and critical bug fixes are backported to older versions. At this time, the latest major CDH release version is CDH4, which combines features from Apache 2.0 and 1.0 releases. It includes NameNode HA and Federation, supports both MRv1 and MRv2, which none of the Apache releases does at the moment. Another valuable feature that CDH provides, is integration of different Hadoop ecosystem projects. HDFS and MapReduce are core components of Hadoop, but over time many new projects were built on top of these components. These projects make Hadoop more user-friendly, speed up development cycles, build multitier MapReduce jobs easily, and so on.

One of the projects available in CDH that is gaining a lot of attention is Impala, which allows running real-time queries on Hadoop, bypassing MapReduce layer completely and accessing data directly from HDFS. Having dozens of ecosystem components, each with its own compatibility requirements and a variety of Apache Hadoop branches, does not make integration an easy task. CDH solves this problem for you by providing core Hadoop and most of the popular ecosystem projects that are compatible and tested with each other in one distribution. This is a big advantage for the user and it made CDH the most popular Hadoop distribution at the moment (according to Google Trends). In addition to CDH, Cloudera also distributes Cloudera Manager—a web based management tool to provision, configure, and monitor your Hadoop cluster. Cloudera Manager comes in both free and paid enterprise versions.

Hortonworks Hadoop distribution

Another popular Hadoop distribution is Hortonworks Data Platform (HDP), by Hortonworks. Similarly to Cloudera, Hortonworks provides a pre-packaged distribution of core and ecosystem Hadoop projects, as well as commercial support and services for it. As of now, the latest stable version of HDP 1.2 and 2.0 is in Alpha stage; both are based on Apache Hadoop 1.0 and 2.0 accordingly. HDP 1.2 provides several features that are not included in the CDH or Apache distribution. Hortonworks implemented NameNode HA on Hadoop 1.0, not by back porting JournalNodes and Quorum-based storage from Apache Hadoop 2.0, but rather by implementing cold cluster failover based on Linux HA solutions. HDP also includes HCatalog—a service that provides an integration point for projects like Pig and Hive. Hortonworks makes a bet on integrating Hadoop with traditional BI tools, an area that has lots of interest from existing and potential Hadoop users. HDP includes an ODBC driver for Hive, which is claimed to be compatible with most existing BI tools. Another unique HDP feature is its availability on the Windows platform. Bringing Hadoop to the Windows world will have a big impact on the platform's adoption rates and can make HDP a leading distribution for this operating system, but unfortunately this is still in alpha version and can't be recommended for the production usage at the moment. When it comes to cluster management and monitoring, HDP includes Apache Ambari, which is a web-based tool, similar to Cloudera Manager, but is 100 percent free and open source with no distinction between free and enterprise versions.

MapR

While Cloudera and Hortonworks provide the most popular Hadoop distributions, they are not the only companies that use Hadoop as a foundation for their products. There are several projects that should be mentioned here. MapR is a company that provides a Hadoop-based platform. There are several different versions of their product: M3 is a free version with limited features, and M5 and M7 are Enterprise level commercial editions. MapR takes a different approach than Cloudera or Hortonworks. Their software is not free, but has some features that can be appealing to the Enterprise users. The major difference of the MapR platform from Apache Hadoop is that instead of HDFS, a different proprietary filesystem called MapR-FS is used. MapR-FS is implemented in C++ and provides lower latency and higher concurrency access than Java-based HDFS. It is compatible with Hadoop on an API level, but it's a completely different implementation. Other MapR-FS features include the ability to mount Hadoop cluster as an NFS volume, cluster-wide snapshots, and cluster mirroring. Obviously, all these features rely on the MapR-FS implementation.

As you can see, the modern Hadoop landscape is far from being plain. There are many options to choose from. It is easy to narrow down the list of available options when you consider requirements for production cluster. Production Hadoop version needs to be stable and well tested. It needs to include important components, such as NameNode HA and proved MRv1 framework. For you, as a Hadoop administrator, it is important to be able to easily install Hadoop on multiple nodes, without a need to handpick required components and worry about compatibility. These requirements will quickly draw your attention to distributions like CDH or HDP. The rest of this book will be focused around CDH distribution as it is the most popular choice for production installations right now. CDH also provides a rich features set and good stability. It is worth mentioning that Hadoop 2 got its first GA release while this book was in progress. Hadoop 2 brings in many new features such as NameNode High Availability, which were previously available only in CDH.

Choosing OS for the Hadoop cluster

Choosing an operating system for your future Hadoop cluster is a relatively simple task. Hadoop core and its ecosystem components are all written in Java, with a few exceptions. While Java code itself is cross-platform, currently Hadoop only runs on Linux-like systems. The reason for this is that too many design decisions were made with Linux in mind, which made the code surrounding core Hadoop components such as start/stop scripts and permissions model dependent on the Linux environment.

When it comes to Linux, Hadoop is pretty indifferent to specific implementations and runs well on different varieties of this OS: Red Hat, CentOS, Debian, Ubuntu, Suse, and Fedora. All these distributions don't have specific requirements for running Hadoop. In general, nothing prevents Hadoop from successfully working on any other POSIX-style OS, such as Solaris or BSD, if you make sure that all dependencies are resolved properly and all shell supporting scripts are working. Still, most of the production installations of Hadoop are running on Linux and this is the OS that we will be focusing on in our further discussions. Specifically, examples in this book will be focused on CentOS, since it is one of the popular choices for the production system, as well as its twin, Red Hat.

Apache Hadoop provides source binaries, as well as RPM and DEB packages for stable releases. Currently, this is a 1.0 branch. Building Hadoop from the source code, while still being an option, is not recommended for most of the users, since it requires experience in assembling big Java-based projects and careful dependencies resolution. Both Cloudera and Hortonworks distributions provide an easy way to setup a repository on your servers and install all required packages from there.

Tip

There is no strict requirement to run the same operating system across all Hadoop nodes, but common sense suggests, that the lesser the deviation in nodes configuration, the easier it is to administer and manage it.

Hadoop Cluster Deployment: Construct a modern Hadoop data platform effortlessly and gain insights into how to manage clusters efficiently

What do you get with eBook?

Product Details