Packt+ | Advance your knowledge in tech

You're reading from Big Data Analytics with Hadoop 3

Product typeBook

Published inMay 2018

PublisherPackt

ISBN-139781788628846

Edition1st Edition

Tools

Hadoop

Concepts

Big Data

Author (1)

Sridhar Alla

Chapter 8. Batch Analytics with Apache Flink

This chapter will introduce the reader to Apache Flink, illustrating how to use Flink for big data analysis, based on the batch processing model. We will look at DataSet APIs, which provide easy-to-use methods for performing batch analysis on big data.

In this chapter, we will cover the following topics:

Introduction to Apache Flink
Installing Flink
Using the Scala shell
Using the Flink cluster UI
Batch Analytics using Flink

Introduction to Apache Flink

Flink is an open source framework for distributed stream processing, and has the following features:

It provides results that are accurate, even in the case of out-of-order or late-arriving data
It is stateful and fault tolerant, and can seamlessly recover from failures while maintaining an exactly-once application state
It performs at a large scale, running on thousands of nodes with very good throughput and latency characteristics

The following is a screenshot from the official documentation that shows how Apache Flink can be used:

Another way of viewing the Apache Flink framework is shown in the following screenshot:

All Flink programs are executed lazily, when the program’s main method is executed, the data loading and transformations do not happen directly. Rather, each operation is created and added to the program’s plan. The operations are actually executed when the execution is explicitly triggered by an execute() call on the execution environment. Whether...

Installing Flink

In this section, we will download and install Apache Flink.

Flink runs on Linux, OS X, and Windows. To be able to run Flink, the only requirement is having a working Java 7.x (or higher) installation. If you are using Windows, please take a look at the Flink on Windows guide at https://ci.apache.org/projects/flink/flink-docs-release-1.4/start/flink_on_windows.html, which describes how to run Flink on Windows for local setups.

You can check your version of Java by issuing the following command:

java -version

If you have Java 8, the output will look something like this:

java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Downloading Flink

Download the Apache Flink binaries relevant to your platform at https://flink.apache.org/downloads.html:

Figure: Screenshot showing Apache Flink libraries

Download Hadoop version 2.8 by clicking on Download. You will see the download page in your browser, as...

Using the Flink cluster UI

Using the Flink cluster UI, you can understand and monitor what's running in your cluster and dig deeply into various jobs and tasks. You can monitor the job statuses, cancel jobs, or debug any problems with the jobs. By looking at logs, you can also diagnose problems with your code, and fix them.

The following is a list of Completed Jobs:

You can drill down into any particular job to see more details about the job's execution:

Figure: Drilling down a particular job to see job's execution

You can look at the Timeline of the job to get more details:

Figure: Screenshot to see Timeline of a job

The following screenshot shows the Task Managers tab, showing all of the task managers. This helps you understand the number and status of the task managers:

You can also check the Logs, as shown in the following screenshot:

The Metrics tab gives you details of the memory and CPU resources:

Figure: Screenshot showing details of the memory and CPU resources in Metrics tab

You can also...

Batch analytics

Batch Analytics in Apache Flink are quite similar to the streaming analytics in the way Flink handles both types of analytics using same APIs. This gives a lot of flexibility and allows code reuse across both the different types of analytics.

In this section, we will look at some analytical jobs on the sample data we are using OnlineRetail.csv. We will also be loading cities.csv and temperature.csv to do some more join operations.

Reading file

Flink comes with several built-in formats to create data sets from common file formats. Many of them have shortcut methods on the execution environment.

File-based

File based sources can be read using APIs which are listed as follows:

readTextFile(path)/TextInputFormat: Reads files line wise and returns them as strings.
readTextFileWithValue(path)/TextValueInputFormat: Reads files line wise and returns them as StringValues. StringValues are mutable strings.
readCsvFile(path)/CsvInputFormat: Parses files of comma (or another char) delimited...

Summary

In this chapter, we have discussed Apache Flink and how Flink can be used to perform batch analysis on a large amount of data. We explored Flink and inner workings of Flink. Then we loaded and analyzed data performing transformations and aggregation operations. Then we explored how to perform Join operations on big data.

In the next chapter, we will discuss real-time analytics using Apache Flink.

The rest of the chapter is locked

You have been reading a chapter from

Big Data Analytics with Hadoop 3

Published in: May 2018Publisher: PacktISBN-13: 9781788628846

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Other recommended products

Related to this chapter

Learning Apache Flink

BookFeb 2017280 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Practical Predictive Analytics

This book teaches six specific steps needed to implement predictive analytics using R. It also teaches how team collaboration is critical and how it increases the chances of implementing a successful model. The book uses cases from healthcare, marketing, and government to build practical skills. Big Data is also covered, in this book, which will extend your skill sets by learning Databricks and RSpark.

BookJun 2017576 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Amazon Redshift Cookbook

The Amazon Redshift Cookbook helps you get to grips with architecting Redshift and performing database administration tasks. You'll learn techniques for building pipelines, loading data optimally, and deriving insights from this data, along with understanding how to optimize performance and costs associated with data warehouses, and build ingestion patterns with Amazon Redshift.

BookJul 2021384 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Learning Apache Apex

Applications that use and evaluate real-time streams need to take the features of the underlying processing engine into account. This is the first book about Apache Apex, teaching readers how to include the real-time streaming engine Apex in a functioning application, and which parts to add to make it performant and usable.

BookNov 2017290 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages