Reader small image

You're reading from  Hadoop Real-World Solutions Cookbook - Second Edition

Product typeBook
Published inMar 2016
Publisher
ISBN-139781784395506
Edition2nd Edition
Right arrow
Author (1)
Tanmay Deshpande
Tanmay Deshpande
author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande

Right arrow

Chapter 2. Exploring HDFS

In this chapter, we'll take a look at the following recipes:

  • Loading data from a local machine to HDFS

  • Exporting HDFS data to a local machine

  • Changing the replication factor of an existing file in HDFS

  • Setting the HDFS block size for all the files in a cluster

  • Setting the HDFS block size for a specific file in a cluster

  • Enabling transparent encryption for HDFS

  • Importing data from another Hadoop cluster

  • Recycling deleted data from trash to HDFS

  • Saving compressed data in HDFS

Introduction


In the previous chapter, we discussed the installation and configuration details of a Hadoop cluster. In this chapter, we are going to explore the details of HDFS. As we know, Hadoop has two important components:

  • Storage: This includes HDFS

  • Processing: This includes Map Reduce

HDFS takes care of the storage part of Hadoop. So, let's explore the internals of HDFS through various recipes.

Loading data from a local machine to HDFS


In this recipe, we are going to load data from a local machine's disk to HDFS.

Getting ready

To perform this recipe, you should have an already Hadoop running cluster.

How to do it...

Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS.

  • Using the copyFromLocal command

    • To copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this:

      hadoop fs -mkdir /mydir1
      hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1
      
  • Using the put command

    • We will first create the directory, and then put the local file in HDFS:

      hadoop fs -mkdir /mydir2
      hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2
      

You can validate that the files have been copied to the correct folders by listing the files:

hadoop fs -ls /mydir1
hadoop fs -ls /mydir2

How it works...

When you use HDFS copyFromLocal or the put command, the following...

Exporting HDFS data to a local machine


In this recipe, we are going to export/copy data from HDFS to the local machine.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

Performing this recipe is as simple as copying data from one folder to the other. There are a couple of ways in which you can export data from HDFS to the local machine.

  • Using the copyToLocal command, you'll get this code:

    hadoop fs -copyToLocal /mydir1/LICENSE.txt /home/ubuntu
    
  • Using the get command, you'll get this code:

    hadoop fs -get/mydir1/LICENSE.txt /home/ubuntu
    

How it works...

When you use HDFS copyToLocal or the get command, the following things occur:

  1. First of all, the client contacts NameNode because it needs a specific file in HDFS.

  2. NameNode then checks whether such a file exists in its FSImage. If the file is not present, the error code is returned to the client.

  3. If the file exists, NameNode checks the metadata for blocks and replica placements in DataNodes.

  4. NameNode...

Changing the replication factor of an existing file in HDFS


In this recipe, we are going to take a look at how to change the replication factor of a file in HDFS. The default replication factor is 3.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

Sometimes. there might be a need to increase or decrease the replication factor of a specific file in HDFS. In this case, we'll use the setrep command.

This is how you can use the command:

hadoop fs -setrep [-R] [-w] <noOfReplicas><path> ...

In this command, a path can either be a file or directory; if its a directory, then it recursively sets the replication factor for all replicas.

  • The w option flags the command and should wait until the replication is complete

  • The r option is accepted for backward compatibility

First, let's check the replication factor of the file we copied to HDFS in the previous recipe:

hadoop fs -ls /mydir1/LICENSE.txt
-rw-r--r--   3 ubuntu supergroup      15429 2015...

Setting the HDFS block size for all the files in a cluster


In this recipe, we are going to take a look at how to set a block size at the cluster level.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

The HDFS block size is configurable for all files in the cluster or for a single file as well. To change the block size at the cluster level itself, we need to modify the hdfs-site.xml file.

By default, the HDFS block size is 128MB. In case we want to modify this, we need to update this property, as shown in the following code. This property changes the default block size to 64MB:

<property>
<name>dfs.block.size</name>
    <value>67108864</value>
    <description>HDFS Block size</description>
</property>

If you have a multi-node Hadoop cluster, you should update this file in the nodes, that is, NameNode and DataNode. Make sure you save these changes and restart the HDFS daemons:

/usr/local/hadoop...

Setting the HDFS block size for a specific file in a cluster


In this recipe, we are going to take a look at how to set the block size for a specific file only.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

In the previous recipe, we learned how to change the block size at the cluster level. But this is not always required. HDFS provides us with the facility to set the block size for a single file as well. The following command copies a file called myfile to HDFS, setting the block size to 1MB:

hadoop fs -Ddfs.block.size=1048576  -put /home/ubuntu/myfile /

Once the file is copied, you can verify whether the block size is set to 1MB and has been broken into exact chunks:

hdfs fsck -blocks /myfile
    Connecting to namenode via http://localhost:50070/fsck?ugi=ubuntu&blocks=1&path=%2Fmyfile
    FSCK started by ubuntu (auth:SIMPLE) from /127.0.0.1 for path /myfile at Thu Oct 29 14:58:00 UTC 2015
    .Status: HEALTHY
    Total size: ...

Enabling transparent encryption for HDFS


When handling sensitive data, it is always important to consider the security measures. Hadoop allows us to encrypt sensitive data that's present in HDFS. In this recipe, we are going to see how to encrypt data in HDFS.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

For many applications that hold sensitive data, it is very important to adhere to standards such as PCI, HIPPA, FISMA, and so on. To enable this, HDFS provides a utility called encryption zone in which we can create a directory so that data is encrypted on writes and decrypted on read.

To use this encryption facility, we first need to enable Hadoop Key Management Server (KMS):

/usr/local/hadoop/sbin/kms.sh start

This would start KMS in the Tomcat web server.

Next, we need to append the following properties in core-site.xml and hdfs-site.xml.

In core-site.xml, add the following property:

<property>
    <name>hadoop.security.key...

Importing data from another Hadoop cluster


Sometimes, we may want to copy data from one HDFS to another either for development, testing, or production migration. In this recipe, we will learn how to copy data from one HDFS cluster to another.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

Hadoop provides a utility called DistCp, which helps us copy data from one cluster to another. Using this utility is as simple as copying from one folder to another:

hadoop distcp hdfs://hadoopCluster1:9000/source hdfs://hadoopCluster2:9000/target

This would use a Map Reduce job to copy data from one cluster to another. You can also specify multiple source files to be copied to the target. There are a couple of other options that we can also use:

  • -update: When we use DistCp with the update option, it will copy only those files from the source that are not part of the target or differ from the target.

  • -overwrite: When we use DistCp with the overwrite option...

Recycling deleted data from trash to HDFS


In this recipe, we are going to see how to recover deleted data from the trash to HDFS.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster.

How to do it...

To recover accidently deleted data from HDFS, we first need to enable the trash folder, which is not enabled by default in HDFS. This can be achieved by adding the following property to core-site.xml:

<property>
    <name>fs.trash.interval</name>
    <value>120</value>
</property>

Then, restart the HDFS daemons:

/usr/local/hadoop/sbin/stop-dfs.sh
/usr/local/hadoop/sbin/start-dfs.sh

This will set the deleted file retention to 120 minutes.

Now, let's try to delete a file from HDFS:

hadoop fs -rmr /LICENSE.txt
    15/10/30 10:26:26 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 0 minutes.
    Moved: 'hdfs://localhost:9000/LICENSE.txt' to trash at: hdfs://localhost:9000/user...

Saving compressed data in HDFS


In this recipe, we are going to take a look at how to store and process compressed data in HDFS.

Getting ready

To perform this recipe, you should already have a running Hadoop.

How to do it...

It's always good to use compression while storing data in HDFS. HDFS supports various types of compression algorithms such as LZO, BIZ2, Snappy, GZIP, and so on. Every algorithm has its own pros and cons when you consider the time taken to compress and decompress and the space efficiency. These days people prefer Snappy compression as it aims to achieve a very high speed and a reasonable amount of compression.

We can easily store and process any number of files in HDFS. To store compressed data, we don't need to specifically make any changes to the Hadoop cluster. You can simply copy the compressed data in the same way it's in HDFS. Here is an example of this:

hadoop fs -mkdir /compressed
hadoop fs –put file.bz2 /compressed

Now, we'll run a sample program to take a look at...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop Real-World Solutions Cookbook - Second Edition
Published in: Mar 2016Publisher: ISBN-13: 9781784395506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande