Everything an organization or an individual does has a digital footprint that can be used or analyzed. Quite simply, big data refers to the usage of an ever increasing volume of data.
In this chapter, we will provide an overview of big data and Hadoop with the help of following topics:
What is big data?
Big data technology evolution
Big data landscape
Careers in big data
The Hadoop jungle explained
The first and the coolest definition is "data that is so big it cannot be processed using traditional data processing tools and applications".
Second, the most professionally accepted definition is the famous 3V oneâ"3Vs (volume, variety, and velocity) are the three defining properties or dimensions of big data. Volume refers to the amount of data, variety refers to the number of types of data, and velocity refers to the speed of data processing."
There is an ongoing debate on data volumeâexactly what can be classified as big data? But unfortunately there is no such defined rule to classify it. For a small company that deals with gigabytes of data, terabytes are considered big. For a large company that deals with terabytes of data, petabytes are considered big. But, in any case, we are talking of data at least more than a terabyte.
The size of data is growing at an exponential rate for both individuals and organizations. Also, no one wants to throw away data, especially when the price of hard disks has been dropping consistently. In addition to a desire to store endless data, corporations also want to analyze it. The data is in different formsâstructured data such as massive transactional data, semistructured data such as documents, and unstructured data such as images and videos.
Until around 5 years ago, organizations used to extract, transform, and load (ETL) large amounts of data in daily batches into data warehouses or data marts with business intelligence and analytics tools on top of that. Now, with more real-time data sources such as messages, social media, and transactions, the processing of data in daily batches does not add much value. Business is now much more competitive and increasingly online. The increased speed of data generation and business needs has increased data velocity to as fast as real time.
With memory and hard disks getting cheaper and machines faster as every single day passes, the expectation to see real-time data analytics has never been greater than it is now.
But now businesses want to process heterogeneous types of dataâanything from Excel tables and databases to pure text, pictures, videos, web data, GPS data, sensor data, documents, mobile messages, or even papers that can be scanned and transformed to electronic format.
Organizations must adapt to the new data formats and be able to process and analyze them.
Although it may feel like Hadoop is just a few years old going by the market buzz, in truth it is more than 10 years old. In 2003, Yahoo was doing a project called Nutch to address their web-search performance and scalability issue.
In 2003 and 2004, Google published its paper on Google's filesystem and MapReduce, respectively.
Nutch's developers started making the change using the design from Google's papers, which helped the process of scaling to more than just one machine. The solution worked to a very large extent, but still it wasn't a web-scale solution.
In 2006, Yahoo hired Doug Cutting to improve the web-search solution. He liaised with Apache to develop an open source project and took out the storage and processing parts from Nutch to create a generic reusable framework. Interestingly, Doug named the open source project Hadoop after his son's toy elephant and even he couldn't have thought that the name would become the top buzzword in the world of information technology a few years later. Doug's team improved the Hadoop framework and released a truly web-scale solution in 2008.
Yahoo kept on migrating its applications onto Hadoop and by 2011 it was running its search engine on 42,000 nodes containing hundreds of petabytes of data storage.
Almost every large web giant, such as Facebook, Google, Microsoft, and eBay, is exploiting web data using Hadoop as its framework. According to an IDC survey, 63.3 percent of all organizations have either deployed or are planning to deploy Hadoop in the next year. The remaining 36.6 percent are planning to implement it as well but may take more than a year to do so.
There are numerous Hadoop distribution companies such as Cloudera and HortonWorks that make the deployment and management of Hadoop a lot easier. They also provide software support and help in the integration of Hadoop into an organization's IT landscape.
They also forecast that the big data technology and services market will grow at 27 percent compound annual growth rate to $32.4 billion through 2017âabout six times the growth rate of the overall information and communication technology market.
At the moment, there are at least dozens of start-ups with very unique products and they have secured hundreds of millions of venture capital.
However you define it, though, there's no denying that big data will be a huge market and will attract the majority of technology expenditure for an organization.
We will discuss the big data components responsible for functions such as storage, resource management, governance, processing, and analysis. Most of these big data components are packaged into an enterprise-grade-supported Hadoop distribution, which will be discussed later in more detail.
The data is stored either on a distributed on-premise filesystem, such as Hadoop Distributed Filesystem (HDFS), or a cloud-based system, such as Amazon S3. The data is also stored in NoSQL databases, such as HBase or Cassandra on Hadoop storage.
To move data into a big data ecosystem, there are data integration tools, such as Flume and Sqoop, or web-service interfaces, such as REST and SOAP for Amazon S3.
The term NoSQL is slightly misleading as NoSQL databases do support SQL operations. The term is now popularly defined as Not Only SQL.
Some other characteristics are schema-free, easy replication support, integrated caching, simple API, and eventually consistent/BASE (not ACID).
In comparison to relational databases, NoSQL databases have superior performance and are more scalable if you need to handle the following:
Large volumes of structured, semistructured, and unstructured data
Quick iterative requirements involving frequent code changes
High growth business requiring efficient and scale-out architecture
There are mainly four types of NoSQL databases:
Document databases: They pair each key with a document. Documents can be simple or complex data structures and contain key-value pairs, key-array pairs, or even nested documents; examples of these databases are MongoDB and CouchDB.
Key-value databases: They are the simplest of all NoSQL-type databases. Every single item is stored as an attribute name (or key) and corresponding value. Some key-value stores, such as Redis, allow each value to have a data type such as integer, which may be useful. Document databases can also function as a key-value database, but they are more targeted at complex structures; examples of these databases are Riak and Voldemort.
It is a flooded market with at least dozens of NoSQL vendors, each claiming superiority over the others. Most of the databases within their type follow similar architecture and development methodologies and so it is not uncommon that organizations stick to a few of the vendors only.
Effective resource management is a must, especially when there will be multiple applications running, fighting for the computing and data resources. Resource Managers such as YARN and Mesos manage allocation, de-allocation, and efficient utilization of your compute/data resources.
There is also a collection of tools to manage the workflow, provisioning, and distributed coordination; for example, Oozie and Zookeeper.
Data governance is all about taking care of the data, ensuring that the metadata (information about data) is recorded properly and the data is only accessed by authorized people and systems. There is a collection of tools to manage metadata, authentication, authorization, security, privacy settings, and lineage. The tools used are Apache Falcon, HCatalog, Sentry, and Kerberos.
Batch computing is an efficient way to process large data volumes, as the data is collected over a period of time and then processed. The MapReduce, Pig, and Hive scripts are used for batch processing.
Typically, a batch process will have persistent-disk-based data storage for input, output, and intermediate results.
Examples include end-of-day risk metrics calculation and historical trade analytics.
Typically, a batch process will have in-memory processing with continuous data input, but it doesn't necessarily need persistent storage for output and intermediate results.
Examples include live processing of tweets and stock prices, fraud detection, and system monitoring.
Data integration tools bring data in and out of Hadoop ecosystems. In addition to tools provided by Apache or Hadoop distributions, such as Flume and Sqoop, there are many premium vendors such as Informatica, Talend, Syncsort, and Datameer, which are a one-stop shop for the majority of data integration needs. As many data integration tools are user-interface-based and use fourth-generation languages, they are easy to use and will keep your Hadoop ecosystem free from complicated MapReduce low-level programs.
Machine learning is the development of models and algorithms that learn from the input data and improve it based on feedback. The program is driven by the input data and doesn't follow explicit instructions. The most popular suite of machine learning libraries is from Mahout, but it's not uncommon to program using the Spark MLlib library or the Java-based custom MapReduce.
A few examples are speech recognition, anomaly or fraud detection, forecasting, and recommending products.
Hadoop distributions can be combined with different business intelligence and data visualization vendors that can connect to the underlying Hadoop platform to produce management and analytical reports.
Gartner predicted that the roles directly related to big data are predicted to be around 4.4 million across the world by 2015, which could be anything from basic server maintenance to high-end data science innovation.
As more companies want to make use of analytics on their big data, it looks more likely that the number of roles will only increase going forward. 2015 is likely to see a marked increase in the number of people involved within big data directly as the market has started maturing and now more businesses have proven business benefits.
The technical skills required are very diverseâserver maintenance, low-level MapReduce programming, NoSQL database administration, data analytics, graphic designers for visualizations, data integration, data science, machine learning, and business analysis. Even the non-technical rolesâproject management, front office staff such as finance, trading, marketing, and sales teams to analyze the resultsâwill need retraining with the usage of the new set of analytics and visualization tools.
There are very few people with skills on big data or Hadoop and the demand is very high. So, obviously, they are generally paid higher than the market average. The pay packages of Hadoop roles will be better if:
The skills are new and scarce
The skills are comparably difficult to learn
Technical skills such as MapReduce are combined with business domain knowledge
The skills required are so diverse that you can choose a career in a subject that you are passionate about.
Some of the most popular job roles are:
Architects: They have the breadth and depth of knowledge across a wide range of Hadoop components, understand the big picture, recommend the hardware and software to be used, and lead a team of developers and analysts if required. Job role examples include big data architect, Hadoop architect, senior Hadoop developer, and lead Hadoop developer.
Analysts: They apply business knowledge with mathematical and statistical skills to analyze data using high-level languages such as Pig, Hive, SQL, and BI tools. Job role examples include business analysts and business intelligence developers.
Data scientists: They have the breadth and depth of knowledge across a wide range of analytical and visualization tools, possess business knowledge, understand the Hadoop ecosystem, know a bit of programming, but have expertise in mathematics and statistics and have excellent communication skills.
People with development skills in Java, C#, relational databases, and server administration will find it easier to learn Hadoop development, get hands-on with a few projects, and choose to either specialize in a tool or programming language. They can also learn more skills across a variety of different components and take on architect/technical lead roles.
People with analysis skills who already understand the business processes and integration with technology can learn a few high-level programming languages such as Pig, R, Python, or BI tools. Experience in BI tools and high-level programming works best with good business domain knowledge.
Although I am trying here to divide this into two simple career pathsâdevelopment and analysisâin the real world there is a significant overlap in all the mentioned job roles.
As long as you have excellent development and analysis skills and are also ready to learn mathematics and business (either formal education or experience), there is nothing stopping you from becoming the "data scientist"âclaimed to be the sexiest job for the 21st century by Harvard Business Review.
For detailed knowledge on this subject, I recommend a study of the book Hadoop: The Definitive Guide, Tom White, O'Reilly Media.
A rack is normally a collection of 10 or more nodes physically stored together and connected to the same network switch. So, network latency between any two nodes in a rack is lower than the latency between two nodes on different racks.
A cluster is a collection of racks as shown in the following figure:
A file from a non-Hadoop system will be broken into multiple blocks and each block will be replicated on HDFS in a distributed form. Hadoop works best with very large files, as it is designed for streaming or sequential data access rather than random access.
Blocks have a default size of 128 megabytes each since Hadoop 2.1.0. Newer versions of Hadoop are moving towards higher default block sizes. The block size can be configured depending on an application's needs.
A 350 megabyte file with a 128 megabyte block size consumes three blocks, but the third block does not consume a full 128 megabytes. The blocks fit well with the replication design, which allows HDFS to be fault tolerant and available on commodity hardware.
As shown in the following figure, each block is replicated to multiple nodes. For example, Block 1 is stored on Node 1 and Node 2. Block 2 is stored on Node 1 and Node 3. And Block 3 is stored on Node 2 and Node 3. This allows for node failure without data loss. If Node 1 crashes, Node 2 still runs and has Block 1's data. In this example, we are only replicating data across two nodes, but you can set replication to be across many more nodes by changing Hadoop's configuration or even setting the replication factor for each individual file.
MapReduce programs are of two typesâa map transformation or a reduce transformation
A MapReduce job will be executing as parallel tasks on different nodes â the map job first, and then the results of the map job will be processed by the reduce job
An HDFS node is either a NameNode or DataNode.
A MapReduce node is either a JobTracker or a TaskTracker.
The client generally communicates with a JobTracker. However, it can also communicate with the NameNode or any of the DataNodes.
There is only one NameNode in the cluster. The DataNodes store all the distributed blocks, but the metadata for those files are stored on the NameNode. This means that NameNode is a single point of failure and therefore deserves the best enterprise hardware for maximum reliability. As NameNode keeps the entire filesystem metadata in memory, it is recommended to buy as much RAM as possible.
An HDFS cluster has many DataNodes. The DataNodes store blocks of data. When a client requests to read a file from HDFS, the client finds out from the NameNode which blocks make up that file and on which DataNodes those blocks are stored. The client can now directly read the blocks from the corresponding DataNodes. These DataNodes are inexpensive enterprise hardware and the replication is provided in the software layer instead of the hardware layer.
As NameNode, there is only one JobTracker on the cluster that manages all MapReduce jobs submitted by clients. It schedules the Map and Reduce tasks using TaskTrackers on the DataNodes itself. Basically, the task is executed very intelligently using the locations of data blocks via rack-awareness. Rack-awareness is a configuration that means the data blocks should be processed on the same rack if possible. The block is also replicated on a different rack to handle rack failures. The JobTracker monitors the progress of TaskTrackers and reschedules failed TaskTrackers on a different replicated DataNode.
MapReduce V2 came up with architectural changes to fix these limitations. This new architecture is also called Yet Another Resource Negotiator (YARN). However, it is not mandatory to run YARN on MapReduce V2.
MapReduce V1 is supported on V2, as V1 is still widely used by organizations across the world.
YARN works on two fundamentals:
As shown in the following figure, when an application gets invoked by the client, an Application Master gets started on a NodeManager. The Application Master is then responsible for negotiating resources with the ResourceManager. These resources are assigned to containers on each slave node and the tasks are run in the containers.
The Hadoop architecture is now modified to support high availability of NameNode, which is a key requirement for any business-critical application. There are now two NameNodes, one active and one standby.
As shown in the following figure, there are JournalNodes. For a basic setup of one active and one standby NameNode, there are three JournalNodes. As expected, only one of the NameNodes is active at a time. The JournalNodes work together to decide which of the NameNodes is to be the active one. If, for some reason, the active NameNode has gone down, the backup NameNode will take over.
Hadoop has been improved further to provide extreme scalability. There are multiple NameNodes acting independently. Each NameNode has its own namespace and therefore has control over its own set of files. However, they share all of the DataNodes.
MapReduce V2, like V1, has awareness of the topology of the network. When rack-awareness is configured for your cluster, Hadoop will always try to run the task on the TaskTracker node with the highest bandwidth access to the data.
We know that big data is everywhere and that it needs to be processed and analyzed into something meaningful. But how do we process the data without breaking the bank?
It is enterprise grade but runs on commodity servers. Storing big data on traditional database storage is very expensive, generally in the order of $25,000 to $50,000 per terabyte per year. However, with Hadoop, the cost of storing data on commodity servers drops by 90 percent to the order of $2,500 to $5,000 per year.
It scales horizontally; data can be stored for a longer duration if required, as there is no need to get rid of older data if you run out of disk space. We can simply add more machines and keep on processing older data for long-term trend analytics. As it scales linearly, there is no performance impact when data grows and there is never a need to migrate to bigger platforms.
It can process semistructured/unstructured data such as images, log files, videos, and documents with much better performance and lower cost, which wasn't the case earlier.
Processing has moved to data rather than large data moved to the computing engine. There is minimal data movement and processing logic can be written in a variety of programming languages such as Java, Scala, Python, Pig, and R.
But Hadoop is not just about HDFS and MapReduce. The little elephant has a lot of friends in the jungle who help in different functions of the end-to-end data life cycle. We will talk briefly on popular members of the jungle in subsequent sections.
HDFS is about unlimited reliable distributed storage. That means that, once you have established your storage and then your business grows, the additional storage can be handled by adding more inexpensive commodity servers or disks. Now, you may wonder whether it might fail occasionally if the servers are not high end. The answer is yes; in fact, on a large cluster of say 10,000 servers, at least one will fail every day. Still, HDFS is self-healing and extremely fault tolerant as every piece of data is replicated on different node servers (the default is three nodes). So, if one data node fails, either of the other two can be used. Also, if a DataNode fails or a block gets corrupted or missed, the failed DataNode's data or block is replicated to a live, light-weight data node.
MapReduce is about moving the computing onto data nodes in a distributed form. It can read data from HDFS as well as other data sources such as mounted filesystems and databases. It is extremely fault tolerant. If any of the computing jobs on the node fails, Hadoop reassigns the job to another copy of the data node. Hadoop knows how to clean up the failed job and combine the results from various MapReduce jobs.
The MapReduce framework processed data in two phasesâMap and Reduce. Map is used to filter and sort the data and Reduce is used to aggregate the data. We will show you the famous word count example, which is the best way to understand the MapReduce framework:
The input data is split into multiple blocks in HDFS.
Each block is fed to Map for processing and the output is a key-value pair of word as key and count as the value.
The key-value pairs are sorted by key and partitioned to be fed to Reduce for processing. The key-value pair with the same keys will be processed with the same Reducer.
Reduce will sum the count of each word and output it as a key-count pair with word as key and count as the sum.
Another example of MapReduce at work is finding the customers with the highest number of transactions for a certain period, as portrayed in the following figure:
HBase is a wide-column database, one of the NoSQL database types. The database rows are stored as key-value pairs and don't store null values. So, it is very suitable for sparse data in terms of storage efficiency. It is linearly scalable and provides real-time random read-write access to data stored on HDFS.
Please visit http://hbase.apache.org/ for more details.
Hive is a high-level declarative language for writing MapReduce programs for all the data-warehouse-loving people who like to write SQL queries on the existing HDFS. It is a very good fit for structured data, and if you like to do a bit more than standard SQL, write your own user-defined functions in Java or Python and use them.
Please visit http://hive.apache.org/ for more details.
Pig is a high-level procedural language for writing MapReduce programs for data engineers coming from a procedural language background. It is also a good fit for structured data but can handle unstructured text data as well. If you need additional functions that are not available by default, you can write your own user-defined functions in Java or Python and use them.
As Hive, Pig scripts are also translated to MapReduce jobs, and many times are almost as good as MapReduce jobs in terms of performance.
Please visit http://pig.apache.org/ for more details.
It runs on a cluster of servers and makes cluster coordination fast and scalable. It is also fault tolerant because of its replication design.
Please visit http://zookeeper.apache.org/ for more details.
Oozie is a workflow scheduler system to manage a variety of Hadoop jobs such as Java MapReduce, MapReduce in other languages, Pig, Hive, and Sqoop. It can also manage system-specific jobs such as Java programs and shell scripts.
As for other components, it is also a scalable, reliable, and extensible system.
Please visit http://oozie.apache.org/ for more details.
Flume is a distributed, reliable, scalable, available service for efficiently collecting, aggregating, and moving large amounts of log data. As shown here, it has a simple and flexible architecture based on streaming data flows:
The incoming data can be log files, click streams, sensor devices, database logs, or social data streams, and outgoing data can be HDFS, S3, NoSQL, or Solr data.
The source will accept incoming data and write data to the channel(s). The channel provides flow order and persists the data, and the sink removes the data from the channel once the data is delivered to the outgoing data store.
Please visit http://flume.apache.org/ for more details.
Please visit http://sqoop.apache.org/ for more details.
There are many distributions in the market, generally classified into two segmentsâeither on premise or in the cloud. It is quite important to know which route we should go for, preferably during the early phase of the project. It is not uncommon to choose both for different use cases or to go for a hybrid approach.
Elasticity: In the cloud, you can add or delete computing and temporary data resources depending on your usage. So, you pay for only what you use, which is a major pro of the cloud. On premise, you need to procure resources for your peak usage and plan in advance for data growth.
Horizontal scalability: Whether on premise or in cloud, you should be able to add more servers whenever you like. However, it is much quicker to do so on the cloud.
Cost: This is debatable, but going by plain calculation, you may save cost by opting for the cloud because you will make savings on staff, technical support, and data centers.
Performance: This is again debatable, but if most of your data resides on premise, you will get better performance by going on premise due to data locality.
Ease of use: Hadoop server administration on premise is hard, especially due to the lack of skilled people. It's possibly easier to let our data servers be administered in cloud by administrators who are experts at it.
Security: Although there is no good reason to believe that on premise is more secure than the cloud, because the data is moved out of local premises to the cloud, it is difficult to ascertain the security of sensitive data such as customers' personal details.
Customizability: There are certain limitations on configurations and the ability to code on the cloud as the software is generic across all of their clients, so it is much better to keep the installation on premise if there are complex requirements that need a high degree of customizations.
Forrestor Research published its report in Q1 2014 on leading Hadoop distribution providers, as shown in the following figure. As of this writing, we are still waiting for the new report for 2015 to be released, but we don't expect to see any major changes.
The Hadoop Distributed Filesystem (HDFS)
The Hadoop MapReduce framework
Hadoop commonâshared libraries and utilities, including YARN
There are numerous Hadoop open source Apache components such as Hive, Pig, HBase, and Zookeeper each performing different functions.
If it's a free, open source collection of software, then why isn't everyone simply downloading, installing, and using it instead of different packaged distributions?
As Hadoop is an open source project, most of the vendors simply improve the existing code by adding new functionalities and combine it into their own distribution.
The majority of organizations choose to go with one of the vendor distributions, rather than the Apache open source distribution, because:
They provide technical support and consulting, which makes it easier to use the platforms for mission-critical and enterprise-grade tasks and scale out for more business cases
They supplement Apache tools with their own tools to address specific tasks
They package all version-compatible Hadoop components into an easy installable
They deliver fixes and patches as soon as they discover them, which makes their solutions more stable
In addition, vendors also contribute the updated code to the open source repository, fostering the growth of the overall Hadoop community.
As shown in the following figure, most of the distributions will have core Apache Hadoop components, complemented by their own tools, to provide additional functionalities:
Cloudera: It is the oldest distribution with the highest market share. Its components are mostly free and open source, but gets complemented with a few of its own tools if you upgrade to the paid service.
MapR: It has its own proprietary distributed file system (alternate to HDFS). Some of its components are open source, but you get a lot more if upgraded to the paid service. Its performance is slightly better than Cloudera and they have additional features such as mirroring and no single point of failure.
There are other premium distributions from IBM, Pivotal Software, Oracle, and Microsoft that generally combine Apache projects with their own software/hardware or partner with other vendors to provide all-inclusive data platforms.
It is also called Hadoop as a service. Today, many cloud providers offer Hadoop with a distribution of your choice. The main motivation to run Hadoop in the cloud is that you only pay for what you use.
Although the majority of Hadoop implementations are on premise, Hadoop in the cloud is also a good alternative for organizations who like to:
Lower the cost of innovation
Procure large-scale resources quickly
Have variable resource requirements
Run closer to the data already in the cloud
Simplify Hadoop operations
The public cloud market reached $67 billion in 2014 and will grow to $113 billion by 2018, according to Technology Business Research. This only proves that in future more and more data will be generated and stored in the cloud, which will make it easier to process in the cloud.
The top vendors who provide Hadoop as a service, with sometimes a choice of distribution like Cloudera, MapR, or HortonWorks, are:
Amazon Web Services
You can literally run your own cluster, provided you have a credit or debit card with some balance left. The steps to get started in the cloud are quite simple:
Create an account with the cloud service provider.
Upload your data into the cloud.
Develop your data-processing application in a programming language of your choice.
Configure and launch your cluster.
Retrieve the output.
We will discuss this in more detail in a subsequent chapter.
In this chapter, we had an overview of big data and saw the need to process big data using Hadoop. We also had an overview of the Hadoop architecture and its key components and distributions. We also went through how Hadoop fits into the larger big data landscape and related career opportunities.
In the next chapter, we will look at big data use cases within the financial sector.