Big Data Analytics

5 (1 reviews total)
By Venkat Ankam
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Big Data Analytics at a 10,000-Foot View

About this book

Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.

It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.

Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.

Publication date:
September 2016
Publisher
Packt
Pages
326
ISBN
9781785884696

 

Chapter 1. Big Data Analytics at a 10,000-Foot View

The goal of this book is to familiarize you with tools and techniques using Apache Spark, with a focus on Hadoop deployments and tools used on the Hadoop platform. Most production implementations of Spark use Hadoop clusters and users are experiencing many integration challenges with a wide variety of tools used with Spark and Hadoop. This book will address the integration challenges faced with Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN) and explain the various tools used with Spark and Hadoop. This will also discuss all the Spark components—Spark Core, Spark SQL, DataFrames, Datasets, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR and integration with analytics components such as Jupyter, Zeppelin, Hive, HBase, and dataflow tools such as NiFi. A real-time example of a recommendation system using MLlib will help us understand data science techniques.

In this chapter, we will approach Big Data analytics from a broad perspective and try to understand what tools and techniques are used on the Apache Hadoop and Apache Spark platforms.

Big Data analytics is the process of analyzing Big Data to provide past, current, and future statistics and useful insights that can be used to make better business decisions.

Big Data analytics is broadly classified into two major categories, data analytics and data science, which are interconnected disciplines. This chapter will explain the differences between data analytics and data science. Current industry definitions for data analytics and data science vary according to their use cases, but let's try to understand what they accomplish.

Data analytics focuses on the collection and interpretation of data, typically with a focus on past and present statistics. Data science, on the other hand, focuses on the future by performing explorative analytics to provide recommendations based on models identified by past and present data.

Figure 1.1 explains the difference between data analytics and data science with respect to time and value achieved. It also shows typical questions asked and tools and techniques used. Data analytics has mainly two types of analytics, descriptive analytics and diagnostic analytics. Data science has two types of analytics, predictive analytics and prescriptive analytics. The following diagram explains data science and data analytics:

Figure 1.1: Data analytics versus data science

The following table explains the differences with respect to processes, tools, techniques, skill sets, and outputs:

 

Data analytics

Data science

Perspective

Looking backward

Looking forward

Nature of work

Report and optimize

Explore, discover, investigate, and visualize

Output

Reports and dashboards

Data product

Typical tools used

Hive, Impala, Spark SQL, and HBase

MLlib and Mahout

Typical techniques used

ETL and exploratory analytics

Predictive analytics and sentiment analytics

Typical skill set necessary

Data engineering, SQL, and programming

Statistics, machine learning, and programming

This chapter will cover the following topics:

  • Big Data analytics and the role of Hadoop and Spark

  • Big Data science and the role of Hadoop and Spark

  • Tools and techniques

  • Real-life use cases

 

Big Data analytics and the role of Hadoop and Spark


Conventional data analytics uses Relational Database Management Systems (RDBMS) databases to create data warehouses and data marts for analytics using business intelligence tools. RDBMS databases use the Schema-on-Write approach; there are many downsides for this approach.

Traditional data warehouses were designed to Extract, Transform, and Load (ETL) data in order to answer a set of predefined questions, which are directly related to user requirements. Predefined questions are answered using SQL queries. Once the data is transformed and loaded in a consumable format, it becomes easier for users to access it with a variety of tools and applications to generate reports and dashboards. However, creating data in a consumable format requires several steps, which are listed as follows:

  1. Deciding predefined questions.

  2. Identifying and collecting data from source systems.

  3. Creating ETL pipelines to load the data into the analytic database in a consumable format.

If new questions arise, systems need to identify and add new data sources and create new ETL pipelines. This involves schema changes in databases and the effort of implementation typically ranges from one to six months. This is a big constraint and forces the data analyst to operate in predefined boundaries only.

Transforming data into a consumable format generally results in losing raw/atomic data that might have insights or clues to the answers that we are looking for.

Processing structured and unstructured data is another challenge in traditional data warehousing systems. Storing and processing large binary images or videos effectively is always a challenge.

Big Data analytics does not use relational databases; instead, it uses the Schema-on-Read (SOR) approach on the Hadoop platform using Hive and HBase typically. There are many advantages of this approach. Figure 1.2 shows the Schema-on-Write and Schema-on-Read scenarios:

Figure 1.2: Schema-on-Write versus Schema-on-Read

The Schema-on-Read approach introduces flexibility and reusability to systems. The Schema-on-Read paradigm emphasizes storing the data in a raw, unmodified format and applying a schema to the data as needed, typically while it is being read or processed. This approach allows considerably more flexibility in the amount and type of data that can be stored. Multiple schemas can be applied to the same raw data to ask a variety of questions. If new questions need to be answered, just get the new data and store it in a new directory of HDFS and start answering new questions.

This approach also provides massive flexibility over how the data can be consumed with multiple approaches and tools. For example, the same raw data can be analyzed using SQL analytics or complex Python or R scripts in Spark. As we are not storing data in multiple layers, which is needed for ETL, so the storage cost and data movement cost is reduced. Analytics can be done for unstructured and structured data sources along with structured data sources.

A typical Big Data analytics project life cycle

The life cycle of Big Data analytics using Big Data platforms such as Hadoop is similar to traditional data analytics projects. However, a major paradigm shift is using the Schema-on-Read approach for the data analytics.

A Big Data analytics project involves the activities shown in Figure 1.3:

Figure 1.3: The Big Data analytics life cycle

Identifying the problem and outcomes

Identify the business problem and desired outcome of the project clearly so that it scopes in what data is needed and what analytics can be performed. Some examples of business problems are company sales going down, customers visiting the website but not buying products, customers abandoning shopping carts, a sudden rise in support call volume, and so on. Some examples of project outcomes are improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing support call volume by 50% by the next quarter while keeping customers happy.

Identifying the necessary data

Identify the quality, quantity, format, and sources of data. Data sources can be data warehouses (OLAP), application databases (OLTP), log files from servers, documents from the Internet, and data generated from sensors and network hubs. Identify all the internal and external data source requirements. Also, identify the data anonymization and re-identification requirements of data to remove or mask personally identifiable information (PII).

Data collection

Collect data from relational databases using the Sqoop tool and stream data using Flume. Consider using Apache Kafka for reliable intermediate storage. Design and collect data considering fault tolerance scenarios.

Preprocessing data and ETL

Data comes in different formats and there can be data quality issues. The preprocessing step converts the data to a needed format or cleanses inconsistent, invalid, or corrupt data. The performing analytics phase will be initiated once the data conforms to the needed format. Apache Hive, Apache Pig, and Spark SQL are great tools for preprocessing massive amounts of data.

This step may not be needed in some projects if the data is already in a clean format or analytics are performed directly on the source data with the Schema-on-Read approach.

Performing analytics

Analytics are performed in order to answer business questions. This requires an understanding of data and relationships between data points. The types of analytics performed are descriptive and diagnostic analytics to present the past and current views on the data. This typically answers questions such as what happened and why it happened. In some cases, predictive analytics is performed to answer questions such as what would happen based on a hypothesis.

Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase are great tools for data analytics in batch processing mode. Real-time analytics tools such as Impala, Tez, Drill, and Spark SQL can be integrated into traditional business intelligence tools (Tableau, Qlikview, and others) for interactive analytics.

Visualizing data

Data visualization is the presentation of analytics output in a pictorial or graphical format to understand the analysis better and make business decisions based on the data.

Typically, finished data is exported from Hadoop to RDBMS databases using Sqoop for integration into visualization systems or visualization systems are directly integrated into tools such as Tableau, Qlikview, Excel, and so on. Web-based notebooks such as Jupyter, Zeppelin, and Databricks cloud are also used to visualize data by integrating Hadoop and Spark components.

The role of Hadoop and Spark

Hadoop and Spark provide you with great flexibility in Big Data analytics:

  • Large-scale data preprocessing; massive datasets can be preprocessed with high performance

  • Exploring large and full datasets; the dataset size does not matter

  • Accelerating data-driven innovation by providing the Schema-on-Read approach

  • A variety of tools and APIs for data exploration

 

Big Data science and the role of Hadoop and Spark


Data science is all about the following two aspects:

  • Extracting deep meaning from the data

  • Creating data products

Extracting deep meaning from data means fetching the value using statistical algorithms. A data product is a software system whose core functionality depends on the application of statistical analysis and machine learning to the data. Google AdWords or Facebook's People You May Know are a couple of examples of data products.

A fundamental shift from data analytics to data science

A fundamental shift from data analytics to data science is due to the rising need for better predictions and creating better data products.

Let's consider an example use case that explains the difference between data analytics and data science.

Problem: A large telecoms company has multiple call centers that collect caller information and store it in databases and filesystems. The company has already implemented data analytics on the call center data, which provided the following insights:

  • Service availability

  • The average speed of answering, average hold time, average wait time, and average call time

  • The call abandon rate

  • The first call resolution rate and cost per call

  • Agent occupancy

Now, the telecoms company would like to reduce the customer churn, improve customer experience, improve service quality, and cross-sell and up-sell by understanding the customers in near real-time.

Solution: Analyze the customer voice. The customer voice has deeper insights than any other information. Convert all calls to text using tools such as CMU Sphinx and scale out on the Hadoop platform. Perform text analytics to derive insights from the data, to gain high accuracy in call-to-text conversion, create models (language and acoustic) that are suitable for the company, and retrain models on a frequent basis with any changes. Also, create models for text analytics using machine learning and natural language processing (NLP) to come up with the following metrics while combining data analytics metrics:

  • Top reasons for customer churn

  • Customer sentiment analysis

  • Customer and problem segmentation

  • 360-degree view of the customer

Notice that the business requirement of this use case created a fundamental shift from data analytics to data science implementing machine learning and NLP algorithms. To implement this solution, new tools and techniques are used and a new role, data scientist, is needed.

A data scientist has a combination of multiple skill sets—statistics, software programming, and business expertise. Data scientists create data products and extract value from the data. Let's see how data scientists differ from other roles. This will help us in understanding roles and tasks performed in data science and data analytics projects.

Data scientists versus software engineers

The difference between the data scientist and software engineer roles is as follows:

  • Software engineers develop general-purpose software for applications based on business requirements

  • Data scientists don't develop application software, but they develop software to help them solve problems

  • Typically, software engineers use Java, C++, and C# programming languages

  • Data scientists tend to focus more on scripting languages such as Python and R

Data scientists versus data analysts

The difference between the data scientist and data analyst roles is as follows:

  • Data analysts perform descriptive and diagnostic analytics using SQL and scripting languages to create reports and dashboards.

  • Data scientists perform predictive and prescriptive analytics using statistical techniques and machine learning algorithms to find answers. They typically use tools such as Python, R, SPSS, SAS, MLlib, and GraphX.

Data scientists versus business analysts

The difference between the data scientist and business analyst roles is as follows:

  • Both have a business focus, so they may ask similar questions

  • Data scientists have the technical skills to find answers

A typical data science project life cycle

Let's learn how to approach and execute a typical data science project.

The typical data science project life cycle shown in Figure 1.4 explains that a data science project's life cycle is iterative, but a data analytics project's life cycle, as shown in Figure 1.3, is not iterative. Defining problems and outcomes and communicating phases are not in the iterations while improving the outcomes of the project. However, the overall project life cycle is iterative, which needs to be improved from time to time after production implementation.

Figure 1.4: A data science project life cycle

Defining problems and outcomes in the data preprocessing phase is similar to the data analytics project, which is explained in Figure 1.3. So, let's discuss the new steps required for data science projects.

Hypothesis and modeling

Given the problem, consider all the possible solutions that could match the desired outcome. This typically involves a hypothesis about the root cause of the problem. So, questions around the business problem arise, such as why customers are canceling the service, why support calls are increasing significantly, and why customers are abandoning shopping carts.

A hypothesis would identify the appropriate model given a deeper understanding of the data. This involves understanding the attributes of the data and their relationships and building the environment for the modeling by defining datasets for testing, training, and production. Create the appropriate model using machine learning algorithms such as logistic regression, k-means clustering, decision trees, or Naive Bayes.

Measuring the effectiveness

Execute the model by running the identified model against the datasets. Measure the effectiveness of the model by checking the results against the desired outcome. Use test data to verify the results and create metrics such as Mean Squared Error (MSE) to measure effectiveness.

Making improvements

Measurements will illustrate how much improvement is required. Consider what you might change. You can ask yourself the following questions:


  • Was the hypothesis around the root cause correct?

  • Ingesting additional datasets would provide better results?

  • Would other solutions provide better results?

Once you've implemented your improvements, test them again and compare them with the previous measurements in order to refine the solution further.

Communicating the results

Communication of the results is an important step in the data science project life cycle. The data scientist tells the story found within the data by correlating the story to business problems. Reports and dashboards are common tools to communicate the results.

The role of Hadoop and Spark

Apache Hadoop provides you with distributed storage and resource management, while Spark provides you with in-memory performance for data science applications. Hadoop and Spark have the following advantages for data science projects:

  • A wide range of applications and third-party packages

  • A machine learning algorithms library for easy usage

  • Spark integrations with deep learning libraries such as H2O and TensorFlow

  • Scala, Python, and R for interactive analytics using the shell

  • A unification feature—using SQL, machine learning, and streaming together

 

Tools and techniques


Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.

While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.

The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:

 

Tools used

Techniques used

Data collection

Apache Flume for real-time data collection and aggregation

Apache Sqoop for data import and export from relational data stores and NoSQL databases

Apache Kafka for the publish-subscribe messaging system

General-purpose tools such as FTP/Copy

Real-time data capture

Export

Import

Message publishing

Data APIs

Screen scraping

Data storage and formats

HDFS: Primary storage of Hadoop

HBase: NoSQL database

Parquet: Columnar format

Avro: Serialization system on Hadoop

Sequence File: Binary key-value pairs

RC File: First columnar format in Hadoop

ORC File: Optimized RC File

XML and JSON: Standard data interchange formats

Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others

Unstructured Text, images, videos, and so on

Data storage

Data archival

Data compression

Data serialization

Schema evolution

Data transformation and enrichment

MapReduce: Hadoop's processing framework

Spark: Compute engine

Hive: Data warehouse and querying

Pig: Data flow language

Python: Functional programming

Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools

Data munging

Filtering

Joining

ETL

File format conversion

Anonymization

Re-identification

Data analytics

Hive: Data warehouse and querying

Pig: Data flow language

Tez: Alternative to MapReduce

Impala: Alternative to MapReduce

Drill: Alternative to MapReduce

Apache Storm: Real-time compute engine

Spark Core: Spark core compute engine

Spark Streaming: Real-time compute engine

Spark SQL: For SQL analytics

SolR: Search platform

Apache Zeppelin: Web-based notebook

Jupyter Notebooks

Databricks cloud

Apache NiFi: Data flow

Spark-on-HBase connector

Programming languages: Java, Scala, and Python

Online Analytical Processing (OLAP)

Data mining

Data visualization

Complex event processing

Real-time stream processing

Full text search

Interactive data analytics

Data science

Python: Functional programming

R: Statistical computing language

Mahout: Hadoop's machine learning library

MLlib: Spark's machine learning library

GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs.

Predictive analytics

Sentiment analytics

Text and Natural Language Processing

Network analytics

Cluster analytics

 

Real-life use cases


Let's take a look at different kinds of use cases for Big Data analytics. Broadly, Big Data analytics use cases are classified into the following five categories:

  • Customer analytics: Data-driven customer insights are necessary to deepen relationships and improve revenue.

  • Operational analytics: Performance and high service quality are the keys to maintaining customers in any industry, from manufacturing to health services.

  • Data-driven products and services: New products and services that align with growing business demands.

  • Enterprise Data Warehouse (EDW) optimization: Early data adopters have warehouse architectures that are 20 years old. Businesses modernize EDW architectures in order to handle the data deluge.

  • Domain-specific solutions: Domain-specific solutions provide businesses with an effective way to implement new features or adhere to industry compliance.

The following table shows you typical use cases of Big Data analytics:

Problem class

Use cases

Data analytics or data science?

Customer analytics

A 360-degree view of the customer

Data analytics and data science

Call center analytics

Data analytics and data science

Sentiment analytics

Data science

Recommendation engine (for example, the next best action)

Data science

Operational analytics

Log analytics

Data analytics

Call center analytics

Data analytics

Unstructured data management

Data analytics

Document management

Data analytics

Network analytics

Data analytics and data science

Preventive maintenance

Data science

Geospatial data management

Data analytics and data science

IOT Analytics

Data analytics and data science

Data-driven products and services

Metadata management

Data analytics

Operational data services

Data analytics

Data/Big Data environments

Data analytics

Data marketplaces

Data analytics

Third-party data management

Data analytics

EDW optimization

Data warehouse offload

Data analytics

Structured Big Data lake

Data analytics

Licensing cost mitigation

Data analytics

Cloud data architectures

Data analytics

Software assessments and migrations

Data analytics

Domain-specific solutions

Fraud and compliance

Data analytics and data science

Industry-specific domain models

Data analytics

Data sourcing and integration

Data analytics

Metrics and reporting solutions

Data analytics

Turnkey warehousing solutions

Data analytics

 

Summary


Big Data analytics with Hadoop and Spark is broadly classified into two major categories: data analytics and data science. While data analytics focuses on past and present statistics, data science focuses on future statistics. While data science projects are iterative in nature, data analytics projects are not iterative.

Apache Hadoop provides you with distributed storage and resource management and Spark provides you with in-memory performance for Big Data analytics. A variety of tools and techniques are used in Big Data analytics depending on the type of use cases and their feasibility.

The next chapter will help you get started with Hadoop and Spark.

About the Author

  • Venkat Ankam

    Venkat Ankam has over 18 years of IT experience and over 5 years in big data technologies, working with customers to design and develop scalable big data applications. Having worked with multiple clients globally, he has tremendous experience in big data analytics using Hadoop and Spark.

    He is a Cloudera Certified Hadoop Developer and Administrator and also a Databricks Certified Spark Developer. He is the founder and presenter of a few Hadoop and Spark meetup groups globally and loves to share knowledge with the community.

    Venkat has delivered hundreds of trainings, presentations, and white papers in the big data sphere. While this is his first attempt at writing a book, many more books are in the pipeline.

    Browse publications by this author

Latest Reviews

(1 reviews total)
Excellent price for some fantastic books!

Recommended For You

Cybersecurity - Attack and Defense Strategies

Enhance your organization's secure posture by improving your attack and defense strategies

By Yuri Diogenes and 1 more