This section will focus on providing an overview of the various commercial and open source technologies that are available to address the various security aspects, and how they fit into the reference architecture of securing enterprise Big Data assets.
You're reading from Securing Hadoop
Intel Distribution of Apache Hadoop software provides some enhanced security features in a Hadoop distribution. Some of the key features for Intel's distribution are:
It provides an integrated data encryption feature for sensitive data. The encryption is based on OpenSSL 1.0.1.C, which is optimized for Intel AES-NI.
Apart from encryption, Intel's distribution supports out of the box compression and encryption capabilities.
Sensitive data is never exposed either in motion or at rest. Thus, it is used to ingest encrypted data into the Hadoop ecosystem and process the encrypted data. Encryption keys are integrated using the Java keystore functionality.
Intel's Manager for Apache Hadoop Software provides deployment, management, monitoring, alerting, and security features.
It provides a feature for managing user access to data and services using Kerberos by creating access control lists (ACLs) and limiting user access to data sets and services.
Deployment and setup of the secure Hadoop cluster is automated and integrated with key management systems.
Note
More details on Intel's Hadoop Distribution are available at https://hadoop.intel.com.
Let us have a look at some of the most important tools.
Cloudera Manager is another of the most popular Hadoop Management and Deployment Tool. Some of the key features of Cloudera Manager with respect to securing a Hadoop Cluster are:
Cloudera Manager automates the entire Hadoop cluster setup and enables an automated setup of a secure Hadoop cluster with Kerberos. Cloudera Manager automatically sets up the Keytab file in all the slave nodes, and updates the Hadoop configuration with the required Keytab locations and service principal details. Cloudera Manager updates the configuration files (
core-site.xml
,hdfs-site.xml
,mapred-site.xml
,oozie-site.xml
,hue.ini
, andtaskcontroller.cfg
) without any manual intervention.It supports the deployment of a role-based administration, where there are read-only administrators who monitor the cluster while others can change the deployments.
It enables administrators to configure alerts specific to user activity and access. This can be leveraged to security incidents and event monitoring.
Cloudera can send events to enterprise SIEM tools about security incidents in Hadoop using SNMP.
It can integrate user credentials using LDAP with Active Directory.
Note
More details on Cloudera Manager are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cloudera-manager.html.
Zettaset (http://www.zettaset.com/) provides a product Zettaset Orchestrator that provides seamless secured Hadoop deployment and management. Zettaset doesn't provide any Hadoop distribution, but works with all distributions such as Cloudera, Hortonworks, and Apache Hadoop. Some of the key features of the Zettaset Orchestrator are:
It provides an automated deployment of a secured Hadoop cluster
It hardens the entire Hadoop deployment from an enterprise perspective to address policy, compliance, access control, and risk management within the Hadoop cluster environment
It integrates seamlessly with an existing enterprise security policy framework using LDAP and Active Directory (AD)
It provides centralized configuration management, logging, and auditing
It provides role-based access controls (RBACs) and enables Kerberos to be seamlessly integrated with the rest of the ecosystem
All other platform management tools such as Ambari and Greenplum Hadoop Deployment Manager need manual setup for establishing a secured Hadoop cluster. The Keytab files, service principals, and the configuration files have to be manually deployed on all nodes.
Let us have a look at the various options available.
Dataguise (DG) for Hadoop provides a symmetric-key-based encryption of the data. One of the key features of Dataguise is to identify and encrypt sensitive data. It supports encryption and masking techniques for sensitive data protection. It enables encryption of data with Hadoop API, Sqoop, and Flume. Thus, it can be used to encrypt data moving in and out of the Hadoop ecosystem. Administrators can schedule the data scan within the Hadoop ecosystem at regular intervals, and detect sensitive data and encrypt or mask it. More details on Dataguise are available at http://dataguise.com/products/dghadoop.html.
Gazzang zNcrypt provides a transparent block level encryption and provides the ability to manage the keys used for encryption. zNcrypt acts like a virtual filesystem that intercepts any application layer request to access the files. It encrypts the block as it is written to the disk. zNcrypt leverages the Intel AES-NI hardware encryption acceleration for maximum performance in the cryptographic process. It also provides role-based access control and policy-based management of the encryption keys. This can be used to implement multiple classification level security in a secured Hadoop cluster.
eCryptfs is a cryptographic stacked Linux filesystem. eCryptfs stores cryptographic metadata in the header of each file written. When the encrypted files are copied between hosts, the file will be decrypted with the proper key in the Linux kernel key ring. We can set up a secured Hadoop cluster with eCryptfs on each node. This ensures that data is transparently shared between nodes, and that all the data is encrypted before being written to the disk.
More information on eCryptfs is available in the following link: https://launchpad.net/ecryptfs.
Project Rhino aimed to provide an integrated end-to-end data security view of the Hadoop ecosystem.
It provides the following key features:
Hadoop crypto codec framework and crypto codec implementation to provide block-level encryption support for data stored in Hadoop
Key distribution and management support so that MapReduce can decrypt the block and execute the program as required
Enhancing the security features of HBase by introducing cell-level authentication for HBase, and providing transparent encryption for HBase tables stored in Hadoop
Standardized audit logging framework and log formats for easy audit trail analysis
Note
More details on project Rhino are available at https://github.com/intel-hadoop/project-rhino/.
We looked at the various commercial and open source tools that enable securing the Big Data platform. This section provides the mapping of these various technologies and how they fit into the overall reference architecture.
Physical security needs to be enforced manually. However, unauthorized access to a distributed cluster is avoided by deploying Kerberos security in the cluster. Kerberos ensures that the services and users confirm their identity with the KDC before they are provided access to the infrastructure services. Project Rhino aims to extend this further by providing the token-based authentication framework.
Filesystem security is enforced by providing a secured virtualization layer on the existing OS filesystem using the file encryption technique. Files written to the disk are encrypted and while files read from the file are decrypted on-the-fly. These features are provided by eCryptfs and zNcrypt tools. SELinux also provides significant protection by hardening the OS.
Tools such as Sentry and HUE provide a platform for secured access to Hadoop. They integrate with LDAP to provide seamless enterprise integration.
One of the common techniques to ensure perimeter security in Hadoop is by isolation of the Hadoop cluster from the rest of the enterprise. However, users still need to access the cluster with tools such as Knox and HttpFS , that provide the proxy layer for end users to remotely connect to the Hadoop cluster and submit jobs and access the filesystem.
To protect data in motion and at rest, encryption and masking techniques are deployed. Tools such as IBM Optim and Dataguise provide large scale data masking for enterprise data. To protect data in REST in Hadoop, we deploy block-level encryption in Hadoop. Intel's distribution supports the encryption and compression of files. Project Rhino enables block-level encryption similar to Dataguise and Gazzang.
While authentication and authorization has matured significantly, tools such as Zettaset Orchestrator and Project Rhino enable integration with the enterprise system for authentication and authorization.
Common Security Audit logging for user access to Hadoop Cluster is enabled by tools such as Cloudera Manager. Cloudera Manager also has the ability to generate alerts and events based on the configured organizational policies. Similarly, Intel's manager and Zettaset Orchestrator also provide the security policies enforcement in the cluster as per organizational policies.