Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product typeBook

Published inMay 2017

PublisherPackt

ISBN-139781787126732

Edition1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1)

Aman Singh

Chapter 12. Security

In this chapter, we will cover the following recipes:

Introduction
Encrypting disk using LUKS
Configuring Hadoop users
HDFS encryption at Rest
Configuring SSL in Hadoop
In-transit encryption
Enabling service level authorization
Securing ZooKeeper
Configuring auditing
Configuring Kerberos server
Configuring and enabling Kerberos for Hadoop

Introduction

In this chapter, we will configure Hadoop cluster to run in secure mode and enable authentication, authorization, and secure transit data. By default, Hadoop runs in nonsecure mode with no access control on data blocks or service-level access. We can run all the Hadoop daemons with a single user hadoop, without worrying about security and which daemons access what.

In addition to this, it is important to encrypt the disk, HDFS data at rest, and also to enable Kerberos for the authentication of service access. By default, a HDFS block can be accessed by any map or reduce task, but when Kerberos is enabled all this access is verified.

Note

Each directory, whether it is on HDFS or local disk must have the right permissions and should only allow the permissions which are necessary to run the service and not any more. Refer to the following link for recommended permissions on each directory in Hadoop:

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode...

Encrypting disk using LUKS

Before we even start with Hadoop, it is important to secure at the operating system and network level. It is expected of the users to have prior knowledge for securing Linux and networks, and in this recipe, we will only look at disk encryption.

It is good practice to encrypt the data disk, so that even if they are stolen, the data is safe. The entire disk can be encrypted or just the disk where critical data resides.

Getting ready

To step through the recipes in this chapter, make sure you have at least one node with CentOS 6 and above installed. It does not matter which flavor of Linux you choose, as long as you are comfortable with it. Users must have prior knowledge of Linux installation and basic commands. The same settings apply to all the nodes in the cluster.

How to do it...

Connect to a node, which at a later stage will be used to install Hadoop or configured as a Namenode or Datanode data disk. We are using the nn1.cluster1.com node.
Make sure you switch to...

Configuring Hadoop users

In this recipe, we will configure users to run Hadoop services so as to have better control of access by daemons.

In all the recipes so far, we have configured all services/daemons, whether it's HDFS, YARN, or Hive to run with user hadoop. This is not the right practice for production clusters as it would be difficult to control services in a fine and granular manner.

It is recommended to segregate services to run with different users, for example, HDFS daemons as hdfs:hadoop, YARN daemons as yarn:hadoop, and other services such as Hive or HBase with their own respective users.

Getting ready

To step through the recipe in this section, we need a Hadoop cluster already configured and it is assumed that users are aware about Hadoop installation and configuration. Refer to Chapter 1, Hadoop Architecture and Deployment for the installation and configuration of a Hadoop cluster. In this recipe, we are just separating daemons to run with different users, rather than them all...

HDFS encryption at Rest

In this recipe, we will look at transparent HDFS encryption, which is encryption of data at rest. A typical use case could be a cluster used by a financial domain and others within a company using HDFS to store critical data.

The concept involves Key Management Server (KMS), which provides keys and encryption zones that secure data using the key. To access data, we need the key and data from the encrypted zone that cannot be moved to nonencrypted zones without a proper key.

Getting ready

To step through the recipe in this section, we need Hadoop cluster configured with HDFS at least. The changes can be done on one node and then the modified files copied across all nodes in the cluster.

How to do it...

Connect to the master node in the cluster; we are using the nn1.cluster1.com node.
Switch to user hadoop or root and make all the changes, as shown in the following steps.
Edit the file /opt/cluster/hadoop/etc/hadoop/core-site.xml and enable the KMS store by adding the following...

Configuring SSL in Hadoop

In this recipe, we will configure SSL for Hadoop services. We can configure SSL for Web UI, WebHDFS, YARN, shuffle phase, RPC, and so on. The important components for enabling SSL are certificates, keystore, and truststore. These must individually be kept secure and safe.

We can have SSL single or two-way, but the preferred method is a single way in which the clients validate the server's identity. Using 2-way SSL increases latency and involves configuration overhead.

Getting ready

To complete this recipe, the user must have a running cluster with HDFS and YARN setup. The users can refer to Chapter 1, Hadoop Architecture and Deployment for installation details.

The assumption here is that the user is very familiar with HDFS concepts and knows its layout, and is also familiar with how SSL works, with experience of creating SSL certificates. For this recipe, we will be using self-signed certificates, but for production it is recommended to use a proper CA-signed certificate...

In-transit encryption

In this recipe, we will configure in-transit encryption for securing the transfer of data between nodes during the shuffle phase. The mapper output is consumed by reducers, which can run on different nodes, so to secure the transfer channel, we secure the communication between Mappers and Reducers. We will be securing the RPC communication channel as well, although it induces a slight overhead and must be setup only if it is absolutely necessary.

Getting ready

To complete the recipe, the user must have completed the previous Configure SSL in Hadoop recipe. We will be extending the configuration already set up in that section by adding a few more options.

Note

It is recommended that the users explore SSL and learn more about ciphers to understand its security and performance implications.

How to do it...

Connect to the nn1.cluster1.com master node and switch to user hadoop.
To enable RPC privacy, edit core-site.xml to add the following lines on each node in the cluster:
```
<...
```

Enabling service level authorization

In this recipe, we will look at service level authorization, which is a mechanism to ensure that the clients connecting to Hadoop services have the right permissions and authorization to access them. This is more of a global control in comparison to the control at the job queue level. Which users can submit jobs to the cluster or which Datanodes can connect to the Namenode based on the Datanode service user.

Service level authorization checks are performed much before any other checks, such as file permissions or permissions on sub queues.

Getting ready

For this recipe, you will need a running cluster with HDFS and YARN configured, and it is good to have a basic understanding of Linux users and permissions.

How to do it...

Connect to the nn1.cluster1.com master node and switch to user hadoop.
All the configuration goes into the hadoop-policy.xml file on each node in the cluster.
Firstly, allow all users to connect as DFSclient using the following configuration...

Securing ZooKeeper

Another important component to secure is ZooKeeper, as it is a very important component in the Hadoop cluster. The nodes contributing towards quorum should communicate over a secure channel and should be safeguarded against any clear text exchanges.

In this recipe, we will configure ZooKeeper to run in secure mode by enabling SSL. The ZooKeeper to be used for this secure connection must support Netty and we will enable Netty in the existing ZooKeeper setup previously in Chapter 11, Troubleshooting, Diagnostics, and Best Practices.

Getting ready

Make sure that the user has completed the ZooKeeper configuration recipe in Chapter 4, High Availability. We will be using the existing ZooKeeper cluster and adding the configuration for securing it. Also, the user must have completed the Configuring SSL in Hadoop recipe, as we will be using the existing keystore file and truststore for this recipe.

How to do it...

Connect to the nn1.cluster1.com Namenode and switch to user hadoop.
We...

Configuring auditing

In this recipe, we will touch base upon auditing in Hadoop, which is important to keep track of who did what and at what time. All users must hold accountability for their actions, and to make that possible, we need to track the activities of users by enabling audit logs. There are two audit logs, one for users and the other for services, which help to answer important questions such as Who touched my files? Is data accessed from protected IPs?

Getting ready

For this recipe, you will again need a running cluster with HDFS and YARN. Users must have completed the Configuring multi-node cluster recipe.

How to do it...

Connect to the nn1.cluster1.com master node and switch to user hadoop.
The file where these changes will be made is log4j.properties.
The categories which control audit logging are log4j.category.SecurityLogger for service, and for each of HDFS, Mapred, and YARN, we have audit log handlers categories implementing log4j.logger.org.apache.hadoop.
To enable audits for...

Configuring Kerberos server

In this recipe, we will configure Kerberos server and look at some of the fundamental components of Kerberos, which are important to understand its working and lay the foundation for setting up Kerberos for Hadoop. Refer to the following diagram, which explains the working of Kerberos:

Kerberos consists of two main components, authentication server (AS) and Key distribution center (KDC, subcomponent KGS). The clients, which could be users, hosts, or services are called principal, authenticate to AS and, on being successful, are granted a ticket (TGT), which is a token to use other services in the respective realm (domain).

The password is never sent over the wire and the TGT granted to the client by the KDC is encapsulated with the client password. The TGT received will be cached by the client and can be used to connect to any service or host within the realm or across domains, if a trust relationship is configured.

KDC is the middleman between clients and services...

Configuring and enabling Kerberos for Hadoop

In this recipe, we will be configuring Kerberos for a Hadoop cluster and enabling the authentication of services using tokens. Each service and user must have its principal created and imported to the keytab files. These keytab files should be available to the Hadoop daemons to read the passwords and perform operations.

It is assumed that the user has completed the previous recipe "Kerberos Server Setup" and is comfortable using Kerberos.

Getting ready

Make sure that the user has a running cluster with HDFS or YARN fully functional in a multinode cluster and a Kerberos server set up.

How to do it...

The first thing is to make sure all the nodes are in sync with time and DNS is fully set up.
On each node in the cluster, install the Kerberos workstation packages using the following commands:
```
# yum install -y krb5-libs krb5-workstation
```
Connect to the KDC server rep.cluster1.com and create a host key for each host in the cluster, as shown in the following...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop 2.x Administration Cookbook

Published in: May 2017Publisher: PacktISBN-13: 9781787126732

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Other recommended products

Related to this chapter

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2