Reader small image

You're reading from  Modern Big Data Processing with Hadoop

Product typeBook
Published inMar 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781787122765
Edition1st Edition
Languages
Concepts
Right arrow
Authors (3):
V Naresh Kumar
V Naresh Kumar
author image
V Naresh Kumar

Naresh has more than a decade of professional experience in designing, implementing and running very large scale Internet Applications in Fortune Top 500 Companies. He is a Full Stack Architect with hands-on experience in domains like E-commerce, Web-hosting, Healthcare, Bigdata & Analytics, Data Streaming, Advertising and Databases. He believes in Opensource and contributes to them actively. He keeps himself up-to-date with emerging technologies starting from Linux Systems Internals to Frontend technologies. He studied in BITS-Pilani, Rajasthan with Dual Degree in Computer Science & Economics.
Read more about V Naresh Kumar

Manoj R Patil
Manoj R Patil
author image
Manoj R Patil

Manoj R Patil is the Chief Architect in Big Data at Compassites Software Solutions Pvt. Ltd. where he overlooks the overall platform architecture related to Big Data solutions, and he also has a hands-on contribution to some assignments. He has been working in the IT industry for the last 15 years. He started as a programmer and, on the way, acquired skills in architecting and designing solutions, managing projects keeping each stakeholder's interest in mind, and deploying and maintaining the solution on a cloud infrastructure. He has been working on the Pentaho-related stack for the last 5 years, providing solutions while working with employers and as a freelancer as well. Manoj has extensive experience in JavaEE, MySQL, various frameworks, and Business Intelligence, and is keen to pursue his interest in predictive analysis. He was also associated with TalentBeat, Inc. and Persistent Systems, and implemented interesting solutions in logistics, data masking, and data-intensive life sciences.
Read more about Manoj R Patil

Prashant Shindgikar
Prashant Shindgikar
author image
Prashant Shindgikar

Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Read more about Prashant Shindgikar

View More author details
Right arrow

Evolution data architecture with Hadoop

Hadoop is a software that helps in scalable and distributed computing. Before Hadoop came into existence, there were many technologies that were used by the industry to take care of their data needs. Let's classify these storage mechanisms:

  • Hierarchical database
  • Network database
  • Relational database

Let's understand what these data architectures are.

Hierarchical database architecture

This model of storing Enterprise data was invented by IBM in the early 60s and was used in their applications. The basic concept of hierarchical databases is that the data is organized in the form of a rooted tree. The root node is the beginning of the tree and then all the children are linked only to one of its parent nodes. This is a very unique way of storing and retrieving things.

If you have some background in computer science, trees are one of the unique ways of storing data so that it has some relation with each other (like a parent and child relationship).

This picture illustrates how data is organized in a typical HDBMS:

As we can see, the root node is the organization itself and all the data associated with the organization follows a tree structure which depicts several relationships. These relationships can be understood like this:

  • Employee owns Laptop, Mobile phone, Workstation, and iMac
  • Employee belongs to organization
  • Many vendors supply different requirements:
    • Computer vendors supply iMac and Workstation
  • Catering is in both India and USA; two vendors, The Best Caterers and Bay Area Caterers, serve these

Even though we have expressed multiple types of relationships in this one gigantic data store, we can see that the data gets duplicated and also querying data for different types of needs becomes a challenge.

Let's take a simple question like: Which vendor supplied the iMac owned by Employee-391?

In order to do this, we need to traverse the tree and find information from two different sub-trees.

Network database architecture

The network database management system also has its roots in computer science: graph theory, where there are a vast and different types of nodes and relationships connect them together. There is no specific root node in this structure. It was invented in the early 70s:

As we can see, in this structure, there are a few core datasets and there are other datasets linked with the core datasets.

This is how we can understand it:

  • The main hospital is defined
  • It has many subhospitals
  • Subhospitals are in India and USA
  • The Indian hospital uses the data in patients
  • The USA hospital uses the data in patients
  • The patients store is linked to the main hospital
  • Employees belong to the hospital and are linked with other organizations

In this structure, depending upon the design we come up with, the data is represented as a network of elements.

Relational database architecture

This system was developed again in IBM in the early 80s and is considered one of the most reputed database systems to date. A few notable examples of the software that adopted this style are Oracle and MySQL.

In this model, data is stored in the form of records where each record in turn has several attributes. All the record collections are stored in a table. Relationships exist between the data attributes across tables. Sets of related tables are stored in a database.

Let's see a typical example of how this RDBMS table looks:

We are defining the following types of tables and relationships

Employees

  • The table consists of all the employee records
  • Each record is defined in terms of:
    • Employee unique identifier
    • Employee name
    • Employee date of birth
    • Employee address
    • Employee phone
    • Employee mobile

Devices

  • The table consists of all the devices that are owned by employees
  • Each ownership record is defined in terms of the following:
    • Device ownership identifier
    • Device model
    • Device manufacturer
    • Device ownership date
    • Device unique number
    • Employee ID

Department

A table consisting of all the departments in the organization:

  • Unique department ID
  • Unique department name

Department and employee mapping table

This is a special table that consists of only the relationships between the department and employee using their unique identifiers:

  • Unique department ID
  • Unique employee ID

Hadoop data architecture

So far, we have explored several types of data architectures that have been in use by Enterprises. In this section, we will understand how the data architecture is made in Hadoop.

Just to give a quick introduction, Hadoop has multiple components:

  • Data
  • Data management
  • Platform to run jobs on data

Data layer

This is the layer where all of the data is stored in the form of files. These files are internally split by the Hadoop system into multiple parts and replicated across the servers for high availability.

Since we are talking about the data stored in terms of files, it is very important to understand how these files are organized for better governance.

The next diagram shows how the data can be organized in one of the Hadoop storage layers. The content of the data can be in any form as Hadoop does not enforce them to be in a specific structure. So, we can safely store Blu-Ray™ Movies, CSV (Comma Separated Value) Files, AVRO Encoded Files, and so on inside this data layer.

You might be wondering why we are not using the word HDFS (Hadoop Distributed File System) here. It's because Hadoop is designed to run on top of any distributed file system.

Data management layer

This layer is responsible for keeping track of where the data is stored for a given file or path (in terms of servers, offsets, and so on). Since this is just a bookkeeping layer, it's very important that the contents of this layer are protected with high reliability and durability. Any corruption of the data in this layer will cause the entire data files to be lost forever.

In Hadoop terminology, this is also called NameNode.

Job execution layer

Once we have the data problem sorted out, next come the programs that read and write data. When we talk about data on a single server or a laptop, we are well aware where the data is and accordingly we can write programs that read and write data to the corresponding locations.

In a similar fashion, the Hadoop storage layer has made it very easy for applications to give file paths to read and write data to the storage as part of the computation. This is a very big win for the programming community as they need not worry about the underlying semantics about where the data is physically stored across the distributed Hadoop cluster.

Since Hadoop promotes the compute near the data model, which gives very high performance and throughput, the programs that were run can be scheduled and executed by the Hadoop engine closer to where the data is in the entire cluster. The entire transport of data and movement of the software execution is all taken care of by Hadoop.

So, end users of Hadoop see the system as a simple one with massive computing power and storage. This abstraction has won everyone’s requirements and has become the standard in big data computing today.

Previous PageNext Page
You have been reading a chapter from
Modern Big Data Processing with Hadoop
Published in: Mar 2018Publisher: PacktISBN-13: 9781787122765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
V Naresh Kumar

Naresh has more than a decade of professional experience in designing, implementing and running very large scale Internet Applications in Fortune Top 500 Companies. He is a Full Stack Architect with hands-on experience in domains like E-commerce, Web-hosting, Healthcare, Bigdata & Analytics, Data Streaming, Advertising and Databases. He believes in Opensource and contributes to them actively. He keeps himself up-to-date with emerging technologies starting from Linux Systems Internals to Frontend technologies. He studied in BITS-Pilani, Rajasthan with Dual Degree in Computer Science & Economics.
Read more about V Naresh Kumar

author image
Manoj R Patil

Manoj R Patil is the Chief Architect in Big Data at Compassites Software Solutions Pvt. Ltd. where he overlooks the overall platform architecture related to Big Data solutions, and he also has a hands-on contribution to some assignments. He has been working in the IT industry for the last 15 years. He started as a programmer and, on the way, acquired skills in architecting and designing solutions, managing projects keeping each stakeholder's interest in mind, and deploying and maintaining the solution on a cloud infrastructure. He has been working on the Pentaho-related stack for the last 5 years, providing solutions while working with employers and as a freelancer as well. Manoj has extensive experience in JavaEE, MySQL, various frameworks, and Business Intelligence, and is keen to pursue his interest in predictive analysis. He was also associated with TalentBeat, Inc. and Persistent Systems, and implemented interesting solutions in logistics, data masking, and data-intensive life sciences.
Read more about Manoj R Patil

author image
Prashant Shindgikar

Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Read more about Prashant Shindgikar