The goal of this book is to familiarize you with tools and techniques using Apache Spark, with a focus on Hadoop deployments and tools used on the Hadoop platform. Most production implementations of Spark use Hadoop clusters and users are experiencing many integration challenges with a wide variety of tools used with Spark and Hadoop. This book will address the integration challenges faced with Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN) and explain the various tools used with Spark and Hadoop. This will also discuss all the Spark components—Spark Core, Spark SQL, DataFrames, Datasets, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR and integration with analytics components such as Jupyter, Zeppelin, Hive, HBase, and dataflow tools such as NiFi. A real-time example of a recommendation system using MLlib will help us understand data science techniques.
In this chapter, we will approach Big Data analytics from a broad perspective and try to understand what tools and techniques are used on the Apache Hadoop and Apache Spark platforms.
Big Data analytics is the process of analyzing Big Data to provide past, current, and future statistics and useful insights that can be used to make better business decisions.
Big Data analytics is broadly classified into two major categories, data analytics and data science, which are interconnected disciplines. This chapter will explain the differences between data analytics and data science. Current industry definitions for data analytics and data science vary according to their use cases, but let's try to understand what they accomplish.
Data analytics focuses on the collection and interpretation of data, typically with a focus on past and present statistics. Data science, on the other hand, focuses on the future by performing explorative analytics to provide recommendations based on models identified by past and present data.
Figure 1.1 explains the difference between data analytics and data science with respect to time and value achieved. It also shows typical questions asked and tools and techniques used. Data analytics has mainly two types of analytics, descriptive analytics and diagnostic analytics. Data science has two types of analytics, predictive analytics and prescriptive analytics. The following diagram explains data science and data analytics:

Figure 1.1: Data analytics versus data science
The following table explains the differences with respect to processes, tools, techniques, skill sets, and outputs:
Data analytics |
Data science | |
---|---|---|
Perspective |
Looking backward |
Looking forward |
Nature of work |
Report and optimize |
Explore, discover, investigate, and visualize |
Output |
Reports and dashboards |
Data product |
Typical tools used |
Hive, Impala, Spark SQL, and HBase |
MLlib and Mahout |
Typical techniques used |
ETL and exploratory analytics |
Predictive analytics and sentiment analytics |
Typical skill set necessary |
Data engineering, SQL, and programming |
Statistics, machine learning, and programming |
This chapter will cover the following topics:
Big Data analytics and the role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
Tools and techniques
Real-life use cases
Conventional data analytics uses Relational Database Management Systems (RDBMS) databases to create data warehouses and data marts for analytics using business intelligence tools. RDBMS databases use the Schema-on-Write approach; there are many downsides for this approach.
Traditional data warehouses were designed to Extract, Transform, and Load (ETL) data in order to answer a set of predefined questions, which are directly related to user requirements. Predefined questions are answered using SQL queries. Once the data is transformed and loaded in a consumable format, it becomes easier for users to access it with a variety of tools and applications to generate reports and dashboards. However, creating data in a consumable format requires several steps, which are listed as follows:
Deciding predefined questions.
Identifying and collecting data from source systems.
Creating ETL pipelines to load the data into the analytic database in a consumable format.
If new questions arise, systems need to identify and add new data sources and create new ETL pipelines. This involves schema changes in databases and the effort of implementation typically ranges from one to six months. This is a big constraint and forces the data analyst to operate in predefined boundaries only.
Transforming data into a consumable format generally results in losing raw/atomic data that might have insights or clues to the answers that we are looking for.
Processing structured and unstructured data is another challenge in traditional data warehousing systems. Storing and processing large binary images or videos effectively is always a challenge.
Big Data analytics does not use relational databases; instead, it uses the Schema-on-Read (SOR) approach on the Hadoop platform using Hive and HBase typically. There are many advantages of this approach. Figure 1.2 shows the Schema-on-Write and Schema-on-Read scenarios:

Figure 1.2: Schema-on-Write versus Schema-on-Read
The Schema-on-Read approach introduces flexibility and reusability to systems. The Schema-on-Read paradigm emphasizes storing the data in a raw, unmodified format and applying a schema to the data as needed, typically while it is being read or processed. This approach allows considerably more flexibility in the amount and type of data that can be stored. Multiple schemas can be applied to the same raw data to ask a variety of questions. If new questions need to be answered, just get the new data and store it in a new directory of HDFS and start answering new questions.
This approach also provides massive flexibility over how the data can be consumed with multiple approaches and tools. For example, the same raw data can be analyzed using SQL analytics or complex Python or R scripts in Spark. As we are not storing data in multiple layers, which is needed for ETL, so the storage cost and data movement cost is reduced. Analytics can be done for unstructured and structured data sources along with structured data sources.
The life cycle of Big Data analytics using Big Data platforms such as Hadoop is similar to traditional data analytics projects. However, a major paradigm shift is using the Schema-on-Read approach for the data analytics.
A Big Data analytics project involves the activities shown in Figure 1.3:

Figure 1.3: The Big Data analytics life cycle
Identify the business problem and desired outcome of the project clearly so that it scopes in what data is needed and what analytics can be performed. Some examples of business problems are company sales going down, customers visiting the website but not buying products, customers abandoning shopping carts, a sudden rise in support call volume, and so on. Some examples of project outcomes are improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing support call volume by 50% by the next quarter while keeping customers happy.
Identify the quality, quantity, format, and sources of data. Data sources can be data warehouses (OLAP), application databases (OLTP), log files from servers, documents from the Internet, and data generated from sensors and network hubs. Identify all the internal and external data source requirements. Also, identify the data anonymization and re-identification requirements of data to remove or mask personally identifiable information (PII).
Collect data from relational databases using the Sqoop tool and stream data using Flume. Consider using Apache Kafka for reliable intermediate storage. Design and collect data considering fault tolerance scenarios.
Data comes in different formats and there can be data quality issues. The preprocessing step converts the data to a needed format or cleanses inconsistent, invalid, or corrupt data. The performing analytics phase will be initiated once the data conforms to the needed format. Apache Hive, Apache Pig, and Spark SQL are great tools for preprocessing massive amounts of data.
This step may not be needed in some projects if the data is already in a clean format or analytics are performed directly on the source data with the Schema-on-Read approach.
Analytics are performed in order to answer business questions. This requires an understanding of data and relationships between data points. The types of analytics performed are descriptive and diagnostic analytics to present the past and current views on the data. This typically answers questions such as what happened and why it happened. In some cases, predictive analytics is performed to answer questions such as what would happen based on a hypothesis.
Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase are great tools for data analytics in batch processing mode. Real-time analytics tools such as Impala, Tez, Drill, and Spark SQL can be integrated into traditional business intelligence tools (Tableau, Qlikview, and others) for interactive analytics.
Data visualization is the presentation of analytics output in a pictorial or graphical format to understand the analysis better and make business decisions based on the data.
Typically, finished data is exported from Hadoop to RDBMS databases using Sqoop for integration into visualization systems or visualization systems are directly integrated into tools such as Tableau, Qlikview, Excel, and so on. Web-based notebooks such as Jupyter, Zeppelin, and Databricks cloud are also used to visualize data by integrating Hadoop and Spark components.
Hadoop and Spark provide you with great flexibility in Big Data analytics:
Large-scale data preprocessing; massive datasets can be preprocessed with high performance
Exploring large and full datasets; the dataset size does not matter
Accelerating data-driven innovation by providing the Schema-on-Read approach
A variety of tools and APIs for data exploration
Data science is all about the following two aspects:
Extracting deep meaning from the data
Creating data products
Extracting deep meaning from data means fetching the value using statistical algorithms. A data product is a software system whose core functionality depends on the application of statistical analysis and machine learning to the data. Google AdWords or Facebook's People You May Know are a couple of examples of data products.
A fundamental shift from data analytics to data science is due to the rising need for better predictions and creating better data products.
Let's consider an example use case that explains the difference between data analytics and data science.
Problem: A large telecoms company has multiple call centers that collect caller information and store it in databases and filesystems. The company has already implemented data analytics on the call center data, which provided the following insights:
Service availability
The average speed of answering, average hold time, average wait time, and average call time
The call abandon rate
The first call resolution rate and cost per call
Agent occupancy
Now, the telecoms company would like to reduce the customer churn, improve customer experience, improve service quality, and cross-sell and up-sell by understanding the customers in near real-time.
Solution: Analyze the customer voice. The customer voice has deeper insights than any other information. Convert all calls to text using tools such as CMU Sphinx and scale out on the Hadoop platform. Perform text analytics to derive insights from the data, to gain high accuracy in call-to-text conversion, create models (language and acoustic) that are suitable for the company, and retrain models on a frequent basis with any changes. Also, create models for text analytics using machine learning and natural language processing (NLP) to come up with the following metrics while combining data analytics metrics:
Top reasons for customer churn
Customer sentiment analysis
Customer and problem segmentation
360-degree view of the customer
Notice that the business requirement of this use case created a fundamental shift from data analytics to data science implementing machine learning and NLP algorithms. To implement this solution, new tools and techniques are used and a new role, data scientist, is needed.
A data scientist has a combination of multiple skill sets—statistics, software programming, and business expertise. Data scientists create data products and extract value from the data. Let's see how data scientists differ from other roles. This will help us in understanding roles and tasks performed in data science and data analytics projects.
The difference between the data scientist and software engineer roles is as follows:
Software engineers develop general-purpose software for applications based on business requirements
Data scientists don't develop application software, but they develop software to help them solve problems
Typically, software engineers use Java, C++, and C# programming languages
Data scientists tend to focus more on scripting languages such as Python and R
The difference between the data scientist and data analyst roles is as follows:
Data analysts perform descriptive and diagnostic analytics using SQL and scripting languages to create reports and dashboards.
Data scientists perform predictive and prescriptive analytics using statistical techniques and machine learning algorithms to find answers. They typically use tools such as Python, R, SPSS, SAS, MLlib, and GraphX.
Let's learn how to approach and execute a typical data science project.
The typical data science project life cycle shown in Figure 1.4 explains that a data science project's life cycle is iterative, but a data analytics project's life cycle, as shown in Figure 1.3, is not iterative. Defining problems and outcomes and communicating phases are not in the iterations while improving the outcomes of the project. However, the overall project life cycle is iterative, which needs to be improved from time to time after production implementation.

Figure 1.4: A data science project life cycle
Defining problems and outcomes in the data preprocessing phase is similar to the data analytics project, which is explained in Figure 1.3. So, let's discuss the new steps required for data science projects.
Given the problem, consider all the possible solutions that could match the desired outcome. This typically involves a hypothesis about the root cause of the problem. So, questions around the business problem arise, such as why customers are canceling the service, why support calls are increasing significantly, and why customers are abandoning shopping carts.
A hypothesis would identify the appropriate model given a deeper understanding of the data. This involves understanding the attributes of the data and their relationships and building the environment for the modeling by defining datasets for testing, training, and production. Create the appropriate model using machine learning algorithms such as logistic regression, k-means clustering, decision trees, or Naive Bayes.
Execute the model by running the identified model against the datasets. Measure the effectiveness of the model by checking the results against the desired outcome. Use test data to verify the results and create metrics such as Mean Squared Error (MSE) to measure effectiveness.
Measurements will illustrate how much improvement is required. Consider what you might change. You can ask yourself the following questions:
Was the hypothesis around the root cause correct?
Ingesting additional datasets would provide better results?
Would other solutions provide better results?
Once you've implemented your improvements, test them again and compare them with the previous measurements in order to refine the solution further.
Apache Hadoop provides you with distributed storage and resource management, while Spark provides you with in-memory performance for data science applications. Hadoop and Spark have the following advantages for data science projects:
A machine learning algorithms library for easy usage
Spark integrations with deep learning libraries such as H2O and TensorFlow
Scala, Python, and R for interactive analytics using the shell
A unification feature—using SQL, machine learning, and streaming together
Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.
While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.
The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:
Let's take a look at different kinds of use cases for Big Data analytics. Broadly, Big Data analytics use cases are classified into the following five categories:
Customer analytics: Data-driven customer insights are necessary to deepen relationships and improve revenue.
Operational analytics: Performance and high service quality are the keys to maintaining customers in any industry, from manufacturing to health services.
Data-driven products and services: New products and services that align with growing business demands.
Enterprise Data Warehouse (EDW) optimization: Early data adopters have warehouse architectures that are 20 years old. Businesses modernize EDW architectures in order to handle the data deluge.
Domain-specific solutions: Domain-specific solutions provide businesses with an effective way to implement new features or adhere to industry compliance.
The following table shows you typical use cases of Big Data analytics:
Problem class |
Use cases |
Data analytics or data science? |
---|---|---|
Customer analytics |
A 360-degree view of the customer |
Data analytics and data science |
Call center analytics |
Data analytics and data science | |
Sentiment analytics |
Data science | |
Recommendation engine (for example, the next best action) |
Data science | |
Operational analytics |
Log analytics |
Data analytics |
Call center analytics |
Data analytics | |
Unstructured data management |
Data analytics | |
Document management |
Data analytics | |
Network analytics |
Data analytics and data science | |
Preventive maintenance |
Data science | |
Geospatial data management |
Data analytics and data science | |
IOT Analytics |
Data analytics and data science | |
Data-driven products and services |
Metadata management |
Data analytics |
Operational data services |
Data analytics | |
Data/Big Data environments |
Data analytics | |
Data marketplaces |
Data analytics | |
Third-party data management |
Data analytics | |
EDW optimization |
Data warehouse offload |
Data analytics |
Structured Big Data lake |
Data analytics | |
Licensing cost mitigation |
Data analytics | |
Cloud data architectures |
Data analytics | |
Software assessments and migrations |
Data analytics | |
Domain-specific solutions |
Data analytics and data science | |
Industry-specific domain models |
Data analytics | |
Data sourcing and integration |
Data analytics | |
Metrics and reporting solutions |
Data analytics | |
Turnkey warehousing solutions |
Data analytics |
Big Data analytics with Hadoop and Spark is broadly classified into two major categories: data analytics and data science. While data analytics focuses on past and present statistics, data science focuses on future statistics. While data science projects are iterative in nature, data analytics projects are not iterative.
Apache Hadoop provides you with distributed storage and resource management and Spark provides you with in-memory performance for Big Data analytics. A variety of tools and techniques are used in Big Data analytics depending on the type of use cases and their feasibility.
The next chapter will help you get started with Hadoop and Spark.