Reader small image

You're reading from  Securing Hadoop

Product typeBook
Published inNov 2013
Reading LevelIntermediate
PublisherPackt
ISBN-139781783285259
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Sudheesh Narayan
Sudheesh Narayan
author image
Sudheesh Narayan

Sudheesh Narayanan is a Technology Strategist and Big Data Practitioner with expertise in technology consulting and implementing Big Data solutions. With over 15 years of IT experience in Information Management, Business Intelligence, Big Data & Analytics, and Cloud & J2EE application development, he provided his expertise in architecting, designing, and developing Big Data products, Cloud management platforms, and highly scalable platform services. His expertise in Big Data includes Hadoop and its ecosystem components, NoSQL databases (MongoDB, Cassandra, and HBase), Text Analytics (GATE and OpenNLP), Machine Learning (Mahout, Weka, and R), and Complex Event Processing. Sudheesh is currently working with Genpact as the Assistant Vice President and Chief Architect – Big Data, with focus on driving innovation and building Intellectual Property assets, frameworks, and solutions. Prior to Genpact, he was the co-inventor and Chief Architect of the Infosys BigDataEdge product.
Read more about Sudheesh Narayan

Right arrow

Appendix A. Solutions Available for Securing Hadoop

This section will focus on providing an overview of the various commercial and open source technologies that are available to address the various security aspects, and how they fit into the reference architecture of securing enterprise Big Data assets.

Hadoop distribution with enhanced security support


Intel Distribution of Apache Hadoop software provides some enhanced security features in a Hadoop distribution. Some of the key features for Intel's distribution are:

  • It provides an integrated data encryption feature for sensitive data. The encryption is based on OpenSSL 1.0.1.C, which is optimized for Intel AES-NI.

  • Apart from encryption, Intel's distribution supports out of the box compression and encryption capabilities.

  • Sensitive data is never exposed either in motion or at rest. Thus, it is used to ingest encrypted data into the Hadoop ecosystem and process the encrypted data. Encryption keys are integrated using the Java keystore functionality.

  • Intel's Manager for Apache Hadoop Software provides deployment, management, monitoring, alerting, and security features.

  • It provides a feature for managing user access to data and services using Kerberos by creating access control lists (ACLs) and limiting user access to data sets and services.

  • Deployment and setup of the secure Hadoop cluster is automated and integrated with key management systems.

    Note

    More details on Intel's Hadoop Distribution are available at https://hadoop.intel.com.

Automation of a secured Hadoop cluster deployment


Let us have a look at some of the most important tools.

Cloudera Manager

Cloudera Manager is another of the most popular Hadoop Management and Deployment Tool. Some of the key features of Cloudera Manager with respect to securing a Hadoop Cluster are:

  • Cloudera Manager automates the entire Hadoop cluster setup and enables an automated setup of a secure Hadoop cluster with Kerberos. Cloudera Manager automatically sets up the Keytab file in all the slave nodes, and updates the Hadoop configuration with the required Keytab locations and service principal details. Cloudera Manager updates the configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, oozie-site.xml, hue.ini, and taskcontroller.cfg) without any manual intervention.

  • It supports the deployment of a role-based administration, where there are read-only administrators who monitor the cluster while others can change the deployments.

  • It enables administrators to configure alerts specific to user activity and access. This can be leveraged to security incidents and event monitoring.

  • Cloudera can send events to enterprise SIEM tools about security incidents in Hadoop using SNMP.

  • It can integrate user credentials using LDAP with Active Directory.

    Note

    More details on Cloudera Manager are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cloudera-manager.html.

Zettaset

Zettaset (http://www.zettaset.com/) provides a product Zettaset Orchestrator that provides seamless secured Hadoop deployment and management. Zettaset doesn't provide any Hadoop distribution, but works with all distributions such as Cloudera, Hortonworks, and Apache Hadoop. Some of the key features of the Zettaset Orchestrator are:

  • It provides an automated deployment of a secured Hadoop cluster

  • It hardens the entire Hadoop deployment from an enterprise perspective to address policy, compliance, access control, and risk management within the Hadoop cluster environment

  • It integrates seamlessly with an existing enterprise security policy framework using LDAP and Active Directory (AD)

  • It provides centralized configuration management, logging, and auditing

  • It provides role-based access controls (RBACs) and enables Kerberos to be seamlessly integrated with the rest of the ecosystem

All other platform management tools such as Ambari and Greenplum Hadoop Deployment Manager need manual setup for establishing a secured Hadoop cluster. The Keytab files, service principals, and the configuration files have to be manually deployed on all nodes.

Different Hadoop data encryption options


Let us have a look at the various options available.

Dataguise for Hadoop

Dataguise (DG) for Hadoop provides a symmetric-key-based encryption of the data. One of the key features of Dataguise is to identify and encrypt sensitive data. It supports encryption and masking techniques for sensitive data protection. It enables encryption of data with Hadoop API, Sqoop, and Flume. Thus, it can be used to encrypt data moving in and out of the Hadoop ecosystem. Administrators can schedule the data scan within the Hadoop ecosystem at regular intervals, and detect sensitive data and encrypt or mask it. More details on Dataguise are available at http://dataguise.com/products/dghadoop.html.

Gazzang zNcrypt

Gazzang zNcrypt provides a transparent block level encryption and provides the ability to manage the keys used for encryption. zNcrypt acts like a virtual filesystem that intercepts any application layer request to access the files. It encrypts the block as it is written to the disk. zNcrypt leverages the Intel AES-NI hardware encryption acceleration for maximum performance in the cryptographic process. It also provides role-based access control and policy-based management of the encryption keys. This can be used to implement multiple classification level security in a secured Hadoop cluster.

eCryptfs for Hadoop

eCryptfs is a cryptographic stacked Linux filesystem. eCryptfs stores cryptographic metadata in the header of each file written. When the encrypted files are copied between hosts, the file will be decrypted with the proper key in the Linux kernel key ring. We can set up a secured Hadoop cluster with eCryptfs on each node. This ensures that data is transparently shared between nodes, and that all the data is encrypted before being written to the disk.

More information on eCryptfs is available in the following link: https://launchpad.net/ecryptfs.

Securing the Hadoop ecosystem with Project Rhino


Project Rhino aimed to provide an integrated end-to-end data security view of the Hadoop ecosystem.

It provides the following key features:

  • Hadoop crypto codec framework and crypto codec implementation to provide block-level encryption support for data stored in Hadoop

  • Key distribution and management support so that MapReduce can decrypt the block and execute the program as required

  • Enhancing the security features of HBase by introducing cell-level authentication for HBase, and providing transparent encryption for HBase tables stored in Hadoop

  • Standardized audit logging framework and log formats for easy audit trail analysis

    Note

    More details on project Rhino are available at https://github.com/intel-hadoop/project-rhino/.

Mapping of security technologies with the reference architecture


We looked at the various commercial and open source tools that enable securing the Big Data platform. This section provides the mapping of these various technologies and how they fit into the overall reference architecture.

Infrastructure security

Physical security needs to be enforced manually. However, unauthorized access to a distributed cluster is avoided by deploying Kerberos security in the cluster. Kerberos ensures that the services and users confirm their identity with the KDC before they are provided access to the infrastructure services. Project Rhino aims to extend this further by providing the token-based authentication framework.

OS and filesystem security

Filesystem security is enforced by providing a secured virtualization layer on the existing OS filesystem using the file encryption technique. Files written to the disk are encrypted and while files read from the file are decrypted on-the-fly. These features are provided by eCryptfs and zNcrypt tools. SELinux also provides significant protection by hardening the OS.

Application security

Tools such as Sentry and HUE provide a platform for secured access to Hadoop. They integrate with LDAP to provide seamless enterprise integration.

Network perimeter security

One of the common techniques to ensure perimeter security in Hadoop is by isolation of the Hadoop cluster from the rest of the enterprise. However, users still need to access the cluster with tools such as Knox and HttpFS , that provide the proxy layer for end users to remotely connect to the Hadoop cluster and submit jobs and access the filesystem.

Data masking and encryption

To protect data in motion and at rest, encryption and masking techniques are deployed. Tools such as IBM Optim and Dataguise provide large scale data masking for enterprise data. To protect data in REST in Hadoop, we deploy block-level encryption in Hadoop. Intel's distribution supports the encryption and compression of files. Project Rhino enables block-level encryption similar to Dataguise and Gazzang.

Authentication and authorization

While authentication and authorization has matured significantly, tools such as Zettaset Orchestrator and Project Rhino enable integration with the enterprise system for authentication and authorization.

Audit logging, security policies, and procedures

Common Security Audit logging for user access to Hadoop Cluster is enabled by tools such as Cloudera Manager. Cloudera Manager also has the ability to generate alerts and events based on the configured organizational policies. Similarly, Intel's manager and Zettaset Orchestrator also provide the security policies enforcement in the cluster as per organizational policies.

Security Incident and Event Monitoring

Detecting security incident and monitoring events in a Big Data platform is essential. Open source tools such as OSSEC and IBM Gaudium enable a secured Hadoop cluster to detect security incidents and provide easy integration with enterprise SIEM tools.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Securing Hadoop
Published in: Nov 2013Publisher: PacktISBN-13: 9781783285259
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudheesh Narayan

Sudheesh Narayanan is a Technology Strategist and Big Data Practitioner with expertise in technology consulting and implementing Big Data solutions. With over 15 years of IT experience in Information Management, Business Intelligence, Big Data & Analytics, and Cloud & J2EE application development, he provided his expertise in architecting, designing, and developing Big Data products, Cloud management platforms, and highly scalable platform services. His expertise in Big Data includes Hadoop and its ecosystem components, NoSQL databases (MongoDB, Cassandra, and HBase), Text Analytics (GATE and OpenNLP), Machine Learning (Mahout, Weka, and R), and Complex Event Processing. Sudheesh is currently working with Genpact as the Assistant Vice President and Chief Architect – Big Data, with focus on driving innovation and building Intellectual Property assets, frameworks, and solutions. Prior to Genpact, he was the co-inventor and Chief Architect of the Infosys BigDataEdge product.
Read more about Sudheesh Narayan