Hadoop Backup and Recovery Solutions

By Gaurav Barot , Chintan Mehta , Amij Patel
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Knowing Hadoop and Clustering Basics

About this book

Hadoop offers distributed processing of large datasets across clusters and is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. It enables computing solutions that are scalable, cost-effective, flexible, and fault tolerant to back up very large data sets from hardware failures.

Starting off with the basics of Hadoop administration, this book becomes increasingly exciting with the best strategies of backing up distributed storage databases.

You will gradually learn about the backup and recovery principles, discover the common failure points in Hadoop, and facts about backing up Hive metadata. A deep dive into the interesting world of Apache HBase will show you different ways of backing up data and will compare them. Going forward, you'll learn the methods of defining recovery strategies for various causes of failures, failover recoveries, corruption, working drives, and metadata. Also covered are the concepts of Hadoop matrix and MapReduce. Finally, you'll explore troubleshooting strategies and techniques to resolve failures.

Publication date:
July 2015


Chapter 1. Knowing Hadoop and Clustering Basics

Today, we are living in the age of data. People are generating data in different ways: they take pictures, send e-mails, upload images, write blogs, comment on someone's blog or picture, change their status on social networking sites, tweet on Twitter, update details on LinkedIn, and so on. Just a couple of decades ago, a general belief was that 1 GB of disk storage would be more than enough for a personal computer. And, nowadays, we use hard disks having capacity in terabytes. Data size has not only grown in personal space but also in professional services, where people have to deal with a humongous amount of data. Think of the data managed by players such as Google, Facebook, New York Stock Exchange, Amazon, and many others. The list is endless. This situation challenged the way we were storing, managing, and processing data traditionally. New technologies and platforms have emerged, which provide us with solutions to these challenges. But again, do you think a solution providing smart data storage and retrieval will be enough? Would you like to be in a situation where you log in to your Facebook or Instagram account and find that all your data was lost due to a hardware failure? Absolutely not!

It is important to not only store data but also process the data in order to generate information that is necessary. And it is equally important to back up and recover the data in the case of any type of failure. We need to have a sound plan and policy defined and implemented for data backup and recovery in the case of any failure.

Apache Hadoop is a platform that provides practical, cost-effective, and scalable infrastructure to store, manage, and process data. This book focuses on understanding backup and recovery needs and defining and determining backup and recovery strategies, as well as implementing them in order to ensure data and metadata backup and recovery after the occurrence of a failure in Hadoop. To begin with, in this chapter, we will discuss basic but important concepts of Hadoop, HDFS, daemons, and clustering. So, fasten your seatbelts and get ready to fly with me in the Hadoop territory!


Understanding the need for Hadoop

As mentioned at the beginning of the chapter, the enormous data growth first became a challenge for big players such as Yahoo!, Google, Amazon, and Facebook. They had to not only store this data but also process it effectively. When their existing tools and technologies were not enough to process this huge set of data, Google introduced the world to MapReduce in 2004. Prior to publishing MapReduce in 2004, Google also published a paper in 2003 on Google File System (GFS), describing pragmatic, scalable, and distributed file systems optimized to store huge datasets. It was built to support large-scale, data-intensive, and distributed processing applications.

Both these systems were able to solve a major problem that many companies were facing at that time: to manage large datasets in an effective manner.

Doug Cutting, who developed Apache Lucene, grabbed the opportunity and led an initiative to develop an open source version of MapReduce called Hadoop. Right after this, Yahoo! and many others supported the initiative and efforts. Very soon, Hadoop was accepted as an open source framework for processing huge datasets in a distributed environment. The good part is that it was running on a cluster of commodity computers and hardware. By 2008, big organizations such as Yahoo!, Facebook, New York Times, and many others started using Hadoop.


There is a common misconception that Hadoop is an acronym of a long name. However, that's not the case; it's a made-up name and not an acronym. Let's see what Doug Cutting has to say about the name:

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere—these are my naming criteria. Kids are good at generating such names. Google is a kid's term."

By now, you must be thinking that knowing the history of Hadoop is fine but what do you do with it? What is the actual use of Hadoop? What are the scenarios where Hadoop will be beneficial compared to other technologies? Well, if you are thinking of all these questions, then your neurons are fired up and you've started getting into the Hadoop space. (By the way, if you are not thinking about these questions, then you might already know what Hadoop does and how it performs those actions or you need a cup of strong coffee!)

For the next few minutes, assume that you are the owner of an online store. You have thousands of products listed on your store and they range from apparels to electronic items and home appliances to sports products. What items will you display on the home page? Also, once a user has an account on your store, what kind of products will you display to the user? Well, here are the choices:

  • Display the same products to all the users on the home page

  • Display different products to users accessing the home page based on the country and region they access the page from. (Your store is very popular and has visitors from around the globe, remember!)

  • Display the same products to the users after they log in. (Well, you must be thinking "do I really want to do this?" If so, it's time to start your own e-commerce store.)

  • Display products to users based on their purchase history, the products they normally search for, and the products that other users bought, who had a similar purchase history as this user.

Here, the first and third options will be very easy to implement but they won't add much value for the users, will they? On the other hand, options 2 and 4 will give users the feeling that someone is taking care of their needs and there will be higher chances of the products being sold. But the solution needs algorithms to be written in such a way that they analyze the data and give quality results. Having relevant data displayed on the store will result in happy and satisfied customers, revisiting members, and most importantly, more money!

You must be thinking: what relates Hadoop to the online store?

Apache Hadoop provides tools that solve a big problem mentioned earlier: managing large datasets and processing them effectively. Hadoop provides linear scalability by implementing a cluster of low-cost commodity hardware. The main principle behind this design is to bring computing logic to data rather than bringing huge amounts of data to computing logic, which can be time-consuming and inefficient in the case of huge datasets. Let's understand this with an example.

Hadoop uses a cluster for data storage and computation purposes. In Hadoop, parallel computation is performed by MapReduce. A developer develops or provides the computation logic for data processing. Hadoop runs the computation logic on the machines or nodes where the data exists rather than sending data to the machines where the computation logic is built-in. This results in performance improvement, as you are no longer sending huge amounts of data across the network but sending the processing/computation logic, which is less in size compared to the nodes where the data is stored.

To put all of this in a nutshell, Hadoop is an open source framework used to run and write distributed applications for processing huge amounts of data. However, there are a few fundamental differences between Hadoop and traditional distributed frameworks, as follows:

  • Low cost per byte: Hadoop's HDFS uses low-cost hardware storage and shares the cost of the network and computers it runs on with MapReduce. HDFS is open source software, which again reduces the cost of ownership. This cost advantage lets organizations store more data per dollar than traditional distributed systems.

  • High data reliability: As Hadoop's distributed system is running on commodity hardware, there can be very high chances of device failure and data loss. Hadoop has been architected keeping in mind this common but critical issue. Hadoop has been tested and has proven itself in multiple use cases and cluster sizes against such failures.

  • High throughput: Hadoop provides large throughput access to application data and is suitable for applications with large datasets.

  • Scalable: We can add more nodes to a Hadoop cluster to increase linear scalability very easily by adding more nodes in the Hadoop cluster.

The following image represents a Hadoop cluster that can be accessed by multiple clients. As demonstrated, a Hadoop cluster is a set of commodity machines. Data storage and data computing occur in this set of machines by HDFS and MapReduce, respectively. Different clients can submit a job for processing to this cluster, and the machines in the cluster jointly execute the job.

There are many projects developed by the Apache Software Foundation, which make the job of HDFS and MapReduce easier. Let's touch base on some of those projects.

Apache Hive

Hive is a data warehouse infrastructure built on Hadoop that uses its storage and execution model. It was initially developed by Facebook. It provides a query language that is similar to SQL and is known as the Hive query language (HQL). Using this language, we can analyze large datasets stored in file systems, such as HDFS. Hive also provides an indexing feature. It is designed to support ad hoc queries and easy data summarization as well as to analyze large volumes of data. You can get the full definition of the language at https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Apache Pig

The native language of Hadoop is Java. You can also write MapReduce in a scripting language such as Python. As you might already know, the MapReduce model allows you to translate your program execution in a series of map and reduce stages and to configure it properly. Pig is an abstraction layer built on top of Hadoop, which simplifies the authoring of MapReduce jobs. Instead of writing code in Java, developers write data processing jobs in scripting languages. Pig can execute a series of MapReduce jobs using Pig Scripts.

Apache HBase

Apache HBase is an open source implementation of Google's Big Table. It was developed as part of the Hadoop project and runs on top of HDFS. It provides a fault-tolerant way to store large quantities of sparse data. HBase is a nonrelational, column-oriented, and multidimensional database, which uses HDFS for storage. It represents a flexible data model with scale-out properties and a simple API. Many organizations use HBase, some of which are Facebook, Twitter, Mozilla, Meetup, Yahoo!.

Apache HCatalog

HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore. It provides a shared schema and data types for Hadoop tools. It solves the problem of tools not agreeing on the schema, data types, and how the data is stored. It enables interoperability across HBase, Hive, Pig, and MapReduce by providing one consistent data model and a shared schema for these tools. HCatalog achieves this by providing table abstraction. HCatalog's goal is to simplify the user's interaction with HDFS data and enable data sharing between tools and execution platforms.

There are other Hadoop projects such as Sqoop, Flume, Oozie, Whirr, and ZooKeeper, which are part of the Hadoop ecosystem.

The following image gives an overview of the Hadoop ecosystem:


Understanding HDFS design

So far, in this chapter, we have referred to HDFS many times. It's now time to take a closer look at HDFS. This section talks about HDFS basics, design, and its daemons, such as NameNode and DataNode.

The Hadoop Distributed File System (HDFS) has been built to support high throughput, streaming reads and writes of huge files. There can be an argument from traditional SAN or NAS lovers that the same functions can be performed by SAN or NAS as well. They also offer centralized, low-latency access to large file systems. So, what's the purpose of having HDFS?

Let's talk about Facebook (that's what you like, isn't it?). There are hundreds of thousands of users using their laptops, desktops, tablets, or smart phones, trying to pull humongous amount of data together from the centralized Facebook server. Do you think traditional SAN or NAS will work effectively in this scenario? Well, your answer is correct. Absolutely not!

HDFS has been designed to handle these kinds of situations and requirements. The following are the goals of HDFS:

  • To store multimillion files, where each file cannot be more than 1 GB and the overall file storage crosses petabytes.

  • To create clusters using commodity hardware rather than using RAID to achieve the large storage mentioned in the previous point. Here, high availability and throughput is achieved through application-level replication.

  • To provide robustness by gracefully handling hardware failure.

HDFS has many similarities to a traditional file system, such as SAN or GFS. The following are some of them:

  • Files are stored in blocks. Metadata is available, which keeps track of filenames to block mapping.

  • It also supports the directory tree structure, as a traditional file system.

  • It works on the permission model. You can give different access rights on the file to different users.

However, along with the similarities, there are differences as well. Here are some of the differences:

  • Very low storage cost per byte: HDFS uses commodity storage and shares the cost of the network it runs on with other systems of the Hadoop stack. Also, being open source, it reduces the total cost of ownership. Due to this cost benefit, it allows an organization to store the same amount of data at a very low cost compared to traditional NAS or SAN systems.

  • Block size for the data: Traditional file systems generally use around 8 KB block size for the data, whereas HDFS uses larger block size of data. By default, the block size in HDFS is 64 MB, but based on the requirement, the admin can raise it to 1 GB or higher. Having a larger block size ensures that the data can be read and written in large sequential operations. It can improve performance as it minimizes drive seek operations, especially when performing large I/O streaming operations.

  • Data protection mechanisms: Traditional file systems use specialized data storage for data protection, while HDFS replicates each block to multiple machines in the cluster. By default, it replicates the data block to three nodes. This ensures data reliability and high availability. You will see how this can be achieved and how Hadoop retrieves data in the case of failure, later in this book.

HDFS runs on commodity hardware. The cluster consists of different nodes. Each node stores a subset of data, and the data is aggregated to make a complete file system. The data is also replicated on three different nodes to provide fault tolerance. Since it works on the principle of moving computation to large data for processing, it provides high throughput and fits best for the applications where huge datasets are involved. HDFS has been built keeping the following goals in mind:

  • Handling hardware failure elegantly.

  • Streaming data access to the dataset.

  • Handling large datasets.

  • A simple coherency model with a principle of the write-once-read-many access model.

  • Move computation to data rather than moving data to computation for processing. This is very helpful in the case of large datasets.

Getting familiar with HDFS daemons

We have understood the design of HDFS. Now, let's talk about the daemons that play an important role in the overall HDFS architecture. Let's understand the function of the daemons with the following small conversation; let the members involved in the conversation introduce themselves:


I'm the one who needs data to be read and written. But, I'm not a daemon.


Hi, there! People sit in front of me and ask me to read and write data. I'm also not an HDFS daemon.


There is only one of my kinds here. I am the coordinator here. (Really? What about the secondary NameNode? Don't worry, we will cover this at the end of this section.)


We are the ones who actually store your data. We live in a group. Sometimes, we are in an army of thousands!

Now, let's see different scenarios in order to understand the function of each of these.

Scenario 1 – writing data to the HDFS cluster

Here, a user is trying to write data into the Hadoop cluster. See how the client, NameNode, and DataNode are involved in writing the data:


"Hello Mr. Client, I want to write 2 GB of data. Can you please do it for me?"


"Sure, Mr. User. I'll do it for you."


"And yes, please divide the data into 512 MB blocks and replicate it at three different locations, as you usually do."


"With pleasure, sir."

Client (thinking)

Hmm, now I have to divide this big file into smaller blocks, as asked by Mr. User. But to write it to three different locations, I have to ask Mr. NameNode. He is such a knowledgeable person.


"Hey, Mr. NameNode. I need your help. I have this big file, which I have divided into smaller blocks. But now I need to write it to three different locations. Can you please provide me the details of these locations?"


"Sure, buddy. That's what I do. First of all, let me find three DataNodes for you."

NameNode (after finding DataNodes)

"Here you go, my friend! Take the addresses of the DataNodes where you can store information. I have also sorted them based on the increase in distance from you."


(Such a gentleman.) "Thanks a lot, Mr. NameNode. I appreciate all the efforts you've put in."


"Hello Mr. DataNode1. Can you please write this block of data on your disk? And yes, please take the list of the nodes as well. Please ask them to replicate the data."


"With pleasure, Mr. Client. Let me start storing the data and while I'm doing that, I will also forward the data to the next DataNode."


"Let me follow you, dear friend. I've started storing the data and forwarded it to the next node."


"I'm the last guy who's storing the data. Now, the replication of data is completed."

All DataNodes

"Mr. NameNode, we have completed the job. Data has been written and replicated successfully."


"Mr. Client, your block has been successfully stored and replicated. Please repeat the same with the rest of the blocks too."

Client (after repeating it for all the blocks)

"All the blocks are written, Mr. NameNode. Please close the file. I truly appreciate your help."


"OK. The case is closed from my end. Now, let me store all the metadata on my hard disk."

As you have seen from the conversation in the preceding table, the following are the observations:

  • The client is responsible for dividing the files into smaller chunks or blocks

  • NameNode keeps the address of each DataNode and coordinates the data writing and replication process. It also stores the metadata of the file. There is only one NameNode per HDFS cluster.

  • The DataNode stores the blocks of the data and takes care of the replication.

Since we started the discussion with the writing of data, it cannot be completed without discussing the reading process. We have the same members involved in this conversation too.

Scenario 2 – reading data from the HDFS cluster

You have seen how data is written. Now, let's talk about a scenario when the user wants to read data. It will be good to understand the role of the client, NameNode, and DataNode in this scenario.


"Hello, Mr. Client. Do you remember me? I asked you to store some data earlier."


"Certainly, I do remember you, Mr. User."


"Good to know that. Now, I need the same data back. Can you please read this file for me?"


"Certainly, I'll do it for you."


"Hi, Mr. NameNode. Can you please provide the details of this file?"


"Sure, Mr. Client. I have stored the metadata for this file. Let me retrieve the data for you.

Here, you go. You will need these two things to get the file back:

  • A list of all the blocks for this file

  • A list of all the DataNodes for each block

Use this information to download the blocks."


"Thanks, Mr. NameNode. You're always helpful."

Client (to the nearest DataNode)

"Please give me block 1." (The process gets repeated until all the blocks of the files have been retrieved.)

Client (after retrieving all the blocks for the file)

"Hi, Mr. User. Here's the file you needed."


"Thanks, Mr. Client."

The entire read and write process is displayed in the following image:

Now, you may wonder what happens when any of the DataNodes is down or the data is corrupted. This is handled by HDFS. It detects various types of faults and handles them elegantly. We will discuss fault tolerance in the later chapters of this book.

Now, you have understood the concepts of NameNode and DataNode. Let's do a quick recap of the same.

HDFS, having the master/slave architecture, consists of a single NameNode, which is like a master server managing the file system namespace and regulates access to the files by the client. The DataNode is like a slave and there is one DataNode per node in the cluster. It is responsible for the storage of data in its local storage. It is also responsible for serving read and write requests by the client. A DataNode can also perform block creation, deletion, and replication based on the command received from the NameNode. The NameNode is responsible for file operations such as opening, closing, and renaming files.


You may have also heard the name of another daemon in HDFS called the secondary NameNode. From the name, it seems that it will be a backup NameNode, which will take charge in the case of a NameNode failure. But that's not the case. The secondary NameNode is responsible for internal housekeeping, which will ensure that the NameNode will not run out of memory and the startup time of the NameNode is faster. It periodically reads the file system changes log and applies them to the fsimage file, thus bringing the NameNode up to date while it reads the file during startup.

Many modern setups use the HDFS high availability feature in which the secondary NameNode is not used. It uses an active and a standby NameNode.


Understanding the basics of Hadoop cluster

Until now, in this chapter, we have discussed the different individual components of Hadoop. In this section, we will discuss how those components come together to build the Hadoop cluster.

There are two main components of Hadoop:

  • HDFS: This is the storage component of Hadoop, which is optimized for high throughput. It works best while reading and writing large files. We have also discussed the daemons of HDFS—NameNode, DataNode, and the secondary NameNode.

  • MapReduce: This is a batch-based, distributed computing framework, which allows you to do parallel processing over large amounts of raw data. It allows you to process the data in the storage asset itself, which reduces the distance over which the data needs to be transmitted for processing. MapReduce is a two-step process, as follows:

    • The map stage: Here, the master node takes the input and divides the input into smaller chunks (subproblems), which are similar in nature. These chunks (subproblems) are distributed to worker nodes. Each worker node can also divide the subproblem and distribute it to other nodes by following the same process. This creates a tree structure. At the time of processing, each worker processes the subproblem and passes the result back to the parent node. The master node receives a final input collected from all the nodes.

    • The reduce stage: The master node collects all the answers from the worker nodes to all the subproblems and combines them to form a meaningful output. This is the output to the problem, which originally needed to be solved.

Along with these components, a fully configured Hadoop cluster runs a set of daemons on different servers in the network. All these daemons perform specific functions. As discussed in the previous section, some of them exist only on one server, while some can be present on multiple servers. We have discussed NameNode, DataNode, and the secondary NameNode. On top of these three daemons, we have other two daemons that are required in the Hadoop cluster, as follows:

  • JobTracker: This works as a connection between your application and Hadoop. The client machine submits the job to the JobTracker. The JobTracker gets information about the DataNodes that contain the blocks of the file to be processed from the NameNode. After getting the information about the DataNodes, the JobTracker provides the code for map computation to TaskTrackers, which are running on the same server as DataNodes. All the TaskTrackers run the code and store the data on their local server. After completing the map process, the JobTracker starts the reduce tasks (which run on the same node), where it gets all the intermediate results from those DataNodes. After getting all the data, HDFS does the final computation, where it combines the intermediate results and provides the final result to the client. The final results are written to HDFS.

  • TaskTracker: As discussed earlier, the JobTracker manages the execution of individual tasks on each slave node. In the Hadoop cluster, each slave node runs a DataNode and a TaskTracker daemon. The TaskTracker communicates and receives instructions from the JobTracker. When the TaskTracker receives instructions from the JobTracker for computation, it executes the map process, as discussed in the previous point. The TaskTracker runs both map and reduce jobs. During this process, the TaskTracker also monitors the task's progress and provides heartbeats and the task status back to the JobTracker. If the task fails or the node goes down, the JobTracker sends the job to another TaskTracker by consulting the NameNode.


Here, JobTracker and TaskTracker are used in the Hadoop cluster. However, in the case of YARN, different daemons are used, such as ResourceManager and NodeManager.

The following image displays all the components of the Hadoop cluster:

The preceding image is a very simple representation of the components in the Hadoop cluster. These are only those components that we have discussed so far. A Hadoop cluster can have other components, such as HiveServer, HBase Master, Zookeeper, Oozie, and so on.

Hadoop clusters are used to increase the speed of data analysis of your application. If the data grows in such a way that it supersedes the processing power of the cluster, we can add cluster nodes to increase throughput. This way, the Hadoop cluster provides high scalability. Also, each piece of data is copied onto other nodes of the clusters. So, in the case of failure of any node, the data is still available.

While planning the capacity of a Hadoop cluster, one has to carefully go through the following options. An appropriate choice of each option is very important for a good cluster design:

  • Pick the right distribution version of Hadoop

  • Designing Hadoop for high availability of the network and NameNode

  • Select appropriate hardware for your master—NameNode, the secondary NameNode, and JobTracker

  • Select appropriate hardware for your worker—for data storage and computation

  • Do proper cluster sizing, considering the number of requests to HBase, number of MapReduce jobs in parallel, size of incoming data per day, and so on

  • Operating system selection

We have to determine the required components in the Hadoop cluster based on our requirements. Once the components are determined, we have to determine base infrastructure requirements. Based on the infrastructure requirements, we have to perform the hardware sizing exercise. After the sizing, we need to ensure load balancing and proper separation of racks/locations. Last but not least, we have to choose the right monitoring solution to look at what our cluster is actually doing.

We will discuss these points in later chapters of this book.



In this chapter, we discussed the Hadoop basics, the practical use of Hadoop, and the different projects of Hadoop. We also discussed the two major components of Hadoop: HDFS and MapReduce. We had a look at the daemons involved in the Hadoop cluster—NameNode, the secondary NameNode, DataNode, JobTracker, and TaskTracker—along with their functions. We also discussed the steps that one has to follow when designing the Hadoop cluster.

In the next chapter, we will talk about the essentials of backup and recovery of the Hadoop implementation.

About the Authors

  • Gaurav Barot

    Gaurav Barot is an experienced software architect and PMP-certified project manager with more than 12 years of experience. He has a unique combination of experience in enterprise resource planning, sales, education, and technology. He has served as an enterprise architect and project leader in projects in various domains, including healthcare, risk, insurance, media, and so on for customers in the UK, USA, Singapore, and India.

    Gaurav holds a bachelor's degree in IT engineering from Sardar Patel University, and has completed his post graduation in IT from Deakin University Melbourne.

    Browse publications by this author
  • Chintan Mehta

    Chintan Mehta is a co-founder of KNOWARTH Technologies and heads the cloud/RIMS/DevOps team. He has rich, progressive experience in server administration of Linux, AWS Cloud, DevOps, RIMS, and on open source technologies. He is also an AWS Certified Solutions Architect. Chintan has authored MySQL 8 for Big Data, Mastering Apache Solr 7.x, MySQL 8 Administrator's Guide, and Hadoop Backup and Recovery Solutions. Also, he has reviewed Liferay Portal Performance Best Practices and Building Serverless Web Applications.

    Browse publications by this author
  • Amij Patel

    Amij Patel is a cofounder of KNOWARTH Technologies (www.knowarth.com) and leads mobile, UI/UX, and e-commerce vertical. He is an out-of-the-box thinker with a proven track record of designing and delivering the best design solutions for enterprise applications and products.

    He has a lot of experience in the Web, portals, e-commerce, rich Internet applications, user interfaces, big data, and open source technologies. His passion is to make applications and products interactive and user friendly using the latest technologies. Amij has a unique ability—he can deliver or execute on any layer and technology from the stack.

    Throughout his career, he has been honored with awards for making valuable contributions to businesses and delivering excellence through different roles, such as a practice leader, architect, and team leader. He is a cofounder of various community groups, such as Ahmedabad JS and the Liferay UI developers' group. These are focused on sharing knowledge of UI technologies and upcoming trends with the broader community. Amij is respected as motivational, the one who leads by example, a change agent, and a proponent of empowerment and accountability.

    Browse publications by this author
Hadoop Backup and Recovery Solutions
Unlock this book and the full library for $5 a month*
Start now