Reader small image

You're reading from  Hadoop 2.x Administration Cookbook

Product typeBook
Published inMay 2017
PublisherPackt
ISBN-139781787126732
Edition1st Edition
Tools
Right arrow
Author (1)
Aman Singh
Aman Singh
author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Right arrow

Chapter 12. Security

In this chapter, we will cover the following recipes:

  • Introduction

  • Encrypting disk using LUKS

  • Configuring Hadoop users

  • HDFS encryption at Rest

  • Configuring SSL in Hadoop

  • In-transit encryption

  • Enabling service level authorization

  • Securing ZooKeeper

  • Configuring auditing

  • Configuring Kerberos server

  • Configuring and enabling Kerberos for Hadoop

Introduction


In this chapter, we will configure Hadoop cluster to run in secure mode and enable authentication, authorization, and secure transit data. By default, Hadoop runs in nonsecure mode with no access control on data blocks or service-level access. We can run all the Hadoop daemons with a single user hadoop, without worrying about security and which daemons access what.

In addition to this, it is important to encrypt the disk, HDFS data at rest, and also to enable Kerberos for the authentication of service access. By default, a HDFS block can be accessed by any map or reduce task, but when Kerberos is enabled all this access is verified.

Note

Each directory, whether it is on HDFS or local disk must have the right permissions and should only allow the permissions which are necessary to run the service and not any more. Refer to the following link for recommended permissions on each directory in Hadoop:

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode...

Encrypting disk using LUKS


Before we even start with Hadoop, it is important to secure at the operating system and network level. It is expected of the users to have prior knowledge for securing Linux and networks, and in this recipe, we will only look at disk encryption.

It is good practice to encrypt the data disk, so that even if they are stolen, the data is safe. The entire disk can be encrypted or just the disk where critical data resides.

Getting ready

To step through the recipes in this chapter, make sure you have at least one node with CentOS 6 and above installed. It does not matter which flavor of Linux you choose, as long as you are comfortable with it. Users must have prior knowledge of Linux installation and basic commands. The same settings apply to all the nodes in the cluster.

How to do it...

  1. Connect to a node, which at a later stage will be used to install Hadoop or configured as a Namenode or Datanode data disk. We are using the nn1.cluster1.com node.

  2. Make sure you switch to...

Configuring Hadoop users


In this recipe, we will configure users to run Hadoop services so as to have better control of access by daemons.

In all the recipes so far, we have configured all services/daemons, whether it's HDFS, YARN, or Hive to run with user hadoop. This is not the right practice for production clusters as it would be difficult to control services in a fine and granular manner.

It is recommended to segregate services to run with different users, for example, HDFS daemons as hdfs:hadoop, YARN daemons as yarn:hadoop, and other services such as Hive or HBase with their own respective users.

Getting ready

To step through the recipe in this section, we need a Hadoop cluster already configured and it is assumed that users are aware about Hadoop installation and configuration. Refer to Chapter 1, Hadoop Architecture and Deployment for the installation and configuration of a Hadoop cluster. In this recipe, we are just separating daemons to run with different users, rather than them all...

HDFS encryption at Rest


In this recipe, we will look at transparent HDFS encryption, which is encryption of data at rest. A typical use case could be a cluster used by a financial domain and others within a company using HDFS to store critical data.

The concept involves Key Management Server (KMS), which provides keys and encryption zones that secure data using the key. To access data, we need the key and data from the encrypted zone that cannot be moved to nonencrypted zones without a proper key.

Getting ready

To step through the recipe in this section, we need Hadoop cluster configured with HDFS at least. The changes can be done on one node and then the modified files copied across all nodes in the cluster.

How to do it...

  1. Connect to the master node in the cluster; we are using the nn1.cluster1.com node.

  2. Switch to user hadoop or root and make all the changes, as shown in the following steps.

  3. Edit the file /opt/cluster/hadoop/etc/hadoop/core-site.xml and enable the KMS store by adding the following...

Configuring SSL in Hadoop


In this recipe, we will configure SSL for Hadoop services. We can configure SSL for Web UI, WebHDFS, YARN, shuffle phase, RPC, and so on. The important components for enabling SSL are certificates, keystore, and truststore. These must individually be kept secure and safe.

We can have SSL single or two-way, but the preferred method is a single way in which the clients validate the server's identity. Using 2-way SSL increases latency and involves configuration overhead.

Getting ready

To complete this recipe, the user must have a running cluster with HDFS and YARN setup. The users can refer to Chapter 1, Hadoop Architecture and Deployment for installation details.

The assumption here is that the user is very familiar with HDFS concepts and knows its layout, and is also familiar with how SSL works, with experience of creating SSL certificates. For this recipe, we will be using self-signed certificates, but for production it is recommended to use a proper CA-signed certificate...

In-transit encryption


In this recipe, we will configure in-transit encryption for securing the transfer of data between nodes during the shuffle phase. The mapper output is consumed by reducers, which can run on different nodes, so to secure the transfer channel, we secure the communication between Mappers and Reducers. We will be securing the RPC communication channel as well, although it induces a slight overhead and must be setup only if it is absolutely necessary.

Getting ready

To complete the recipe, the user must have completed the previous Configure SSL in Hadoop recipe. We will be extending the configuration already set up in that section by adding a few more options.

Note

It is recommended that the users explore SSL and learn more about ciphers to understand its security and performance implications.

How to do it...

  1. Connect to the nn1.cluster1.com master node and switch to user hadoop.

  2. To enable RPC privacy, edit core-site.xml to add the following lines on each node in the cluster:

    <...

Enabling service level authorization


In this recipe, we will look at service level authorization, which is a mechanism to ensure that the clients connecting to Hadoop services have the right permissions and authorization to access them. This is more of a global control in comparison to the control at the job queue level. Which users can submit jobs to the cluster or which Datanodes can connect to the Namenode based on the Datanode service user.

Service level authorization checks are performed much before any other checks, such as file permissions or permissions on sub queues.

Getting ready

For this recipe, you will need a running cluster with HDFS and YARN configured, and it is good to have a basic understanding of Linux users and permissions.

How to do it...

  1. Connect to the nn1.cluster1.com master node and switch to user hadoop.

  2. All the configuration goes into the hadoop-policy.xml file on each node in the cluster.

  3. Firstly, allow all users to connect as DFSclient using the following configuration...

Securing ZooKeeper


Another important component to secure is ZooKeeper, as it is a very important component in the Hadoop cluster. The nodes contributing towards quorum should communicate over a secure channel and should be safeguarded against any clear text exchanges.

In this recipe, we will configure ZooKeeper to run in secure mode by enabling SSL. The ZooKeeper to be used for this secure connection must support Netty and we will enable Netty in the existing ZooKeeper setup previously in Chapter 11, Troubleshooting, Diagnostics, and Best Practices.

Getting ready

Make sure that the user has completed the ZooKeeper configuration recipe in Chapter 4, High Availability. We will be using the existing ZooKeeper cluster and adding the configuration for securing it. Also, the user must have completed the Configuring SSL in Hadoop recipe, as we will be using the existing keystore file and truststore for this recipe.

How to do it...

  1. Connect to the nn1.cluster1.com Namenode and switch to user hadoop.

  2. We...

Configuring auditing


In this recipe, we will touch base upon auditing in Hadoop, which is important to keep track of who did what and at what time. All users must hold accountability for their actions, and to make that possible, we need to track the activities of users by enabling audit logs. There are two audit logs, one for users and the other for services, which help to answer important questions such as Who touched my files? Is data accessed from protected IPs?

Getting ready

For this recipe, you will again need a running cluster with HDFS and YARN. Users must have completed the Configuring multi-node cluster recipe.

How to do it...

  1. Connect to the nn1.cluster1.com master node and switch to user hadoop.

  2. The file where these changes will be made is log4j.properties.

  3. The categories which control audit logging are log4j.category.SecurityLogger for service, and for each of HDFS, Mapred, and YARN, we have audit log handlers categories implementing log4j.logger.org.apache.hadoop.

  4. To enable audits for...

Configuring Kerberos server


In this recipe, we will configure Kerberos server and look at some of the fundamental components of Kerberos, which are important to understand its working and lay the foundation for setting up Kerberos for Hadoop. Refer to the following diagram, which explains the working of Kerberos:

Kerberos consists of two main components, authentication server (AS) and Key distribution center (KDC, subcomponent KGS). The clients, which could be users, hosts, or services are called principal, authenticate to AS and, on being successful, are granted a ticket (TGT), which is a token to use other services in the respective realm (domain).

The password is never sent over the wire and the TGT granted to the client by the KDC is encapsulated with the client password. The TGT received will be cached by the client and can be used to connect to any service or host within the realm or across domains, if a trust relationship is configured.

KDC is the middleman between clients and services...

Configuring and enabling Kerberos for Hadoop


In this recipe, we will be configuring Kerberos for a Hadoop cluster and enabling the authentication of services using tokens. Each service and user must have its principal created and imported to the keytab files. These keytab files should be available to the Hadoop daemons to read the passwords and perform operations.

It is assumed that the user has completed the previous recipe "Kerberos Server Setup" and is comfortable using Kerberos.

Getting ready

Make sure that the user has a running cluster with HDFS or YARN fully functional in a multinode cluster and a Kerberos server set up.

How to do it...

  1. The first thing is to make sure all the nodes are in sync with time and DNS is fully set up.

  2. On each node in the cluster, install the Kerberos workstation packages using the following commands:

    # yum install -y krb5-libs krb5-workstation
    
  3. Connect to the KDC server rep.cluster1.com and create a host key for each host in the cluster, as shown in the following...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017Publisher: PacktISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh