Reader small image

You're reading from  Securing Hadoop

Product typeBook
Published inNov 2013
Reading LevelIntermediate
PublisherPackt
ISBN-139781783285259
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Sudheesh Narayan
Sudheesh Narayan
author image
Sudheesh Narayan

Sudheesh Narayanan is a Technology Strategist and Big Data Practitioner with expertise in technology consulting and implementing Big Data solutions. With over 15 years of IT experience in Information Management, Business Intelligence, Big Data & Analytics, and Cloud & J2EE application development, he provided his expertise in architecting, designing, and developing Big Data products, Cloud management platforms, and highly scalable platform services. His expertise in Big Data includes Hadoop and its ecosystem components, NoSQL databases (MongoDB, Cassandra, and HBase), Text Analytics (GATE and OpenNLP), Machine Learning (Mahout, Weka, and R), and Complex Event Processing. Sudheesh is currently working with Genpact as the Assistant Vice President and Chief Architect – Big Data, with focus on driving innovation and building Intellectual Property assets, frameworks, and solutions. Prior to Genpact, he was the co-inventor and Chief Architect of the Infosys BigDataEdge product.
Read more about Sudheesh Narayan

Right arrow

Chapter 4. Securing the Hadoop Ecosystem

In Chapter 3, Setting Up a Secured Hadoop Cluster, we looked at how to set up Kerberos authentication for HDFS and MapReduce components within a secured Hadoop cluster. But in our secured Big Data journey, this is only half done. The Hadoop ecosystem consists of various components such as Hive, Oozie, and HBase. We need to secure all the other Hadoop ecosystem components. In this chapter, we will look at the each of the ecosystem components and the various security challenges for each of these components, and how to set up secured authentication and user authorization for each of them.

Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce).

The following are the topics that we'll be covering in this chapter:

  • Configuring...

Configuring Kerberos for Hadoop ecosystem components


The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured.

Securing Hive

Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop:

There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows:

  • The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using...

Best practices for securing the Hadoop ecosystem components


We looked at different types of Hadoop ecosystem components and understood how to set up a secured Hadoop ecosystem with all these components. In this section, let us summarize these best practices as follows:

  • All services that are running within the Hadoop ecosystem need to be authenticated with KDC. This will ensure that there is no rogue process creating malicious activity.

  • It is a best practice to store the KDC credentials in an LDAP store, so that the credentials and authorizations can be centrally managed.

  • The keytab file needs to be secured, and only the user for whom the file is created should be provided with read access to the file.

  • Whenever a Java client is accessing the service, client authentication should be done by the service using RPC authentication mechanism.

  • Whenever user impersonation is used to impersonate an end user by the service user, the service process has to be fully secured by Kerberos and also the host running...

Summary


In this chapter, we looked at the steps that need to be adopted to set up various Hadoop ecosystem components. At the high level, the process involves creating the Kerberos principal for each of the components and then securing the keytab file under the user's home directory. If the service has to impersonate the end user, then the service principal is configured as superuser in Hadoop. Each ecosystem component has specific configuration that needs to be updated to support secured authentication with Kerberos. Some of the components such as Sqoop or Sqoop2, leave a certain amount of security hole when used in production. So these components have to be used with caution and deployed with additional security measures.

In the next chapter, we will look at how to integrate the authentication and authorization of these ecosystem components with the Enterprise Identity Management systems.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Securing Hadoop
Published in: Nov 2013Publisher: PacktISBN-13: 9781783285259
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudheesh Narayan

Sudheesh Narayanan is a Technology Strategist and Big Data Practitioner with expertise in technology consulting and implementing Big Data solutions. With over 15 years of IT experience in Information Management, Business Intelligence, Big Data & Analytics, and Cloud & J2EE application development, he provided his expertise in architecting, designing, and developing Big Data products, Cloud management platforms, and highly scalable platform services. His expertise in Big Data includes Hadoop and its ecosystem components, NoSQL databases (MongoDB, Cassandra, and HBase), Text Analytics (GATE and OpenNLP), Machine Learning (Mahout, Weka, and R), and Complex Event Processing. Sudheesh is currently working with Genpact as the Assistant Vice President and Chief Architect – Big Data, with focus on driving innovation and building Intellectual Property assets, frameworks, and solutions. Prior to Genpact, he was the co-inventor and Chief Architect of the Infosys BigDataEdge product.
Read more about Sudheesh Narayan