Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x for Java Developers

Product typeBook

Published inJul 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781787126497

Edition1st Edition

Languages

Java

Tools

Apache Spark

Concepts

Data Streaming

Authors (2):

Sourav Gulati

Sumit Kumar

View More author details

Chapter 2. Revisiting Java

This chapter is added as the refresher course of Java. In this chapter, we will discuss some concepts of Java that will be useful while creating applications in Apache Spark.

This book assumes that the reader is comfortable with the basics of Java. We will discuss useful Java concepts and mainly focus on what is new in Java 8? More importantly, we will touch upon on topics such as:

Generics
Interfaces
Lambda expressions
Streams

Why use Java for Spark?

With the rise in multi-core CPUs, Java could not keep up with the change in its design to utilize that extra power available to its disposal because of the complexity surrounding concurrency and immutability. We will discuss this in detail, later. First let's understand the importance and usability of Java in the Hadoop ecosystem. As MapReduce was gaining popularity, Google introduced a framework called Flume Java that helped in pipelining multiple MapReduce jobs. Flume Java consists of immutable parallel collections capable of performing lazily evaluated optimized chained operations. That might sound eerily similar to what Apache Spark does, but then even before Apache Spark and Java Flume, there was Cascading, which built an abstraction over MapReduce to simplify the way MapReduce tasks are developed, tested, and run. All these frameworks were majorly a Java implementation to simplify MapReduce pipelines among other things.

These abstractions were simple in fact...

Generics

Generics were introduced in Java 1.5. Generics help the user to create the general purpose code that has abstract type in its definition. That abstract type can be replaced with any concrete type in the implementation.

For example, the list interface or its implementations, such as ArrayList, LinkedList and so on, are defined with generic type. Users can provide the concrete type such as Integer, Long, or String while implementing the list:

List<Integer> list1 =new ArrayList<Integer>(); 
List<String> list2 =new ArrayList<String>();

Here, list1 is the list of integers and list2 is the list of strings. With Java 7, the compiler can infer the type. So the preceding code can also be written as follows:

List<Integer> list1 =new ArrayList<>(); 
List<String> list2 =new ArrayList<>();

Another huge benefit of generic type is that it brings compile-time safety. Let's create a list without the use of generics:

List list =new ArrayList<>();...

Interfaces

Interfaces are the reference types in Java. They are used in Java to define contracts among classes. Any class that implements that interface has to adhere to the contract that the interface defines.

For example, we have an interface car as follows that consists of three abstract methods:

public interface Car { 
   void shape(); 
   void price(); 
   void color(); 
}

Any class that implements this interface has to implement all the abstract methods of this interface unless it is an abstract class. Interfaces can only be implemented or extended by other interfaces, they cannot be instantiated.

Prior to Java 8, interfaces consisted only of abstract methods and final variables. In Java 8, interfaces may contain default and static methods as well.

Static method in an interface

The static method in an interface is similar to the static method in a class. Users cannot override them. So even if a class implements an interface, it cannot override a static method of an interface.

Like a static...

Lambda expressions

Lambda expressions are the brand new feature of Java. Lambda expressions are introduced in Java 8 and it is a step towards facilitating functional programming in Java.

Lambda expressions help you to define a method without declaring it. So, you do not need a name for the method, return-type, and so on. Lambda expressions, like anonymous inner classes, provide the way to pass behaviors to functions. Lambda, however, is a much more concise way of writing the code.

For example, the preceding example of an anonymous inner class can be converted to Lambda as follows:

public class MyFilterImpl { 
   public static void main(String[] args) { 
      File dir = new File("src/main/java"); 
      dir.list((dirname,name)->name.endsWith("java")); //Lambda Expression 
     } 
}

Note that the signature of the Lambda expression is exactly matching the signature of the accept method in the FilenameFilter interface.

Note

One of the huge differences between Lambda and anonymous inner classes...

Lexical scoping

Lexical scoping is also referred to as Static scoping. As per lexical scoping, a variable will be accessible in the scope in which it is defined. Here, the scope of the variable is determined at compile time.

Let us consider the following example:

public class LexicalScoping { 
   int a = 1; 
   // a has class level scope. So It will be available to be accessed 
   // throughout the class 
 
   public void sumAndPrint() { 
      int b = 1; 
      int c = a + b; 
      // b and c are local variables of method. These will be accessible 
      // inside the method only 
   } 
   // b and c are no longer accessible 
}

Variable a will be available throughout the class (let's not consider the difference of static and non-static as of now). However, variables b and c will be available inside the sumAndPrint method only.

Similarly, a variable given inside lambda expressions are accessible only to that Lambda. For example:

list.stream().map(n -> n*2 );

Here n is lexically scoped to...

Streams

A stream represents a collection of elements on which a chain of aggregate operations can be performed lazily. Streams have been optimized to cater to both sequential as well as parallel computation, keeping in mind the hardware capabilities of CPU cores. The Steams API was introduced in Java 8 to cater to functional programming needs of the developer. Streams are not Java based collections; however, they are capable enough to operate over collections by first converting them into streams. Some of the characteristics of streams that make them uniquely different from Java collection APIs are:

Streams do not store elements. It only transfer values received from sources such as I/O channels, generating functions, data structures (Collections API), and perform a set of pipelined computation on them.
Streams do not change the underlying data, they only process them and produce a new set of resultant data. When a distinct() or sorted() method is called on a stream, the source data does not...

Intermediate operations

Intermediate operations always return another stream and get lazily evaluated only when terminal operations are called. The feature of lazy evaluation optimizes intermediate operations when multiple operations are chained together as evaluation only takes place after terminal operation. Another scenario where lazy evaluation tends to be useful is in use cases of infinite or large streams as iteration over an entire stream may not be required or even possible, such as anyMatch, findFirst(), and so on. In these scenarios, short circuiting (discussed in the next section) takes place and the terminal operation exits the flow just after meeting the criteria rather than iterating over entire elements.

Intermediate operations can further be sub-divided into stateful and stateless operations. Stateful operations preserve the last seen value, as in methods such as sorted(), limit(), sequential(), and so on since they need them while processing the current record. For example...

Terminal operations

Terminal operations act as the trigger point in a pipelined stream operation to trigger execution. Terminal operations either return a void or a non-stream type object and once the pipelined operations have been executed on the stream, the stream becomes redundant. A Terminal operation is always required for a pipelined stream operation to be executed.

Stream operations can further be classified as short-circuiting operations. An intermediate operation is said to be short circuited when an infinite input produces a finite stream such as in the case of the limit() method. Similarly, short circuiting operations in the case of terminal operations is when an infinite input may terminate in finite time, as is the case for the methods anyMatch(), findAny(), and so on.

Streams also support parallelism , but in the case of stateful operations, since each parallel operation can have its own state, a parallel stateful operation may have to undergo multiple passes to complete the...

Summary

This chapter dealt with various concepts of Java and more particularly topics around generics and interfaces. We also discussed new features introduced in Java 8 such as Lambda expressions and streams. Introduction to the default method in Java 8 was also covered in detail.

In the next chapter, we will discuss the installation of Spark and Scala. We will also discuss Scala's REPL and the Spark UI along with exposure of Spark's REST API for working with Spark jobs, Sparks componentsm, and various configuration parameters.

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark 2.x for Java Developers

Published in: Jul 2017Publisher: PacktISBN-13: 9781787126497

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar

Other recommended products

Related to this chapter

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Learning Java Lambdas

In this short book, we take an in-depth look at lambdas in Java and their supporting features. It covers essential topics, such as functional interfaces and type inference, and the key differences between lambdas and closures.

BookMar 2017114 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

BookJun 2017296 pages

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

BookAug 2017278 pages

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

BookSep 2017360 pages

Big Data Analytics with Java

This book will start with the basic statistical analysis on big data using java and would then build on other topics on analytics like classification, regression, clustering and ensembling. It would also cover advanced topics of recommendation engines, massive graph analytics, real time analytics and deep learning.

BookJul 2017418 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages