Reader small image

You're reading from  Apache Spark 2.x for Java Developers

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787126497
Edition1st Edition
Languages
Right arrow
Authors (2):
Sourav Gulati
Sourav Gulati
author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

Sumit Kumar
Sumit Kumar
author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar

View More author details
Right arrow

Chapter 2. Revisiting Java

This chapter is added as the refresher course of Java. In this chapter, we will discuss some concepts of Java that will be useful while creating applications in Apache Spark.

This book assumes that the reader is comfortable with the basics of Java. We will discuss useful Java concepts and mainly focus on what is new in Java 8? More importantly, we will touch upon on topics such as:

  • Generics
  • Interfaces
  • Lambda expressions
  • Streams

Why use Java for Spark?


With the rise in multi-core CPUs, Java could not keep up with the change in its design to utilize that extra power available to its disposal because of the complexity surrounding concurrency and immutability. We will discuss this in detail, later. First let's understand the importance and usability of Java in the Hadoop ecosystem. As MapReduce was gaining popularity, Google introduced a framework called Flume Java that helped in pipelining multiple MapReduce jobs. Flume Java consists of immutable parallel collections capable of performing lazily evaluated optimized chained operations. That might sound eerily similar to what Apache Spark does, but then even before Apache Spark and Java Flume, there was Cascading, which built an abstraction over MapReduce to simplify the way MapReduce tasks are developed, tested, and run. All these frameworks were majorly a Java implementation to simplify MapReduce pipelines among other things.

These abstractions were simple in fact...

Generics


Generics were introduced in Java 1.5. Generics help the user to create the general purpose code that has abstract type in its definition. That abstract type can be replaced with any concrete type in the implementation.

For example, the list interface or its implementations, such as ArrayList, LinkedList and so on, are defined with generic type. Users can provide the concrete type such as Integer, Long, or String while implementing the list:

List<Integer> list1 =new ArrayList<Integer>(); 
List<String> list2 =new ArrayList<String>(); 

Here, list1 is the list of integers and list2 is the list of strings. With Java 7, the compiler can infer the type. So the preceding code can also be written as follows:

List<Integer> list1 =new ArrayList<>(); 
List<String> list2 =new ArrayList<>(); 

Another huge benefit of generic type is that it brings compile-time safety. Let's create a list without the use of generics:

List list =new ArrayList<>();...

Interfaces


Interfaces are the reference types in Java. They are used in Java to define contracts among classes. Any class that implements that interface has to adhere to the contract that the interface defines.

For example, we have an interface car as follows that consists of three abstract methods:

public interface Car { 
   void shape(); 
   void price(); 
   void color(); 
} 

Any class that implements this interface has to implement all the abstract methods of this interface unless it is an abstract class. Interfaces can only be implemented or extended by other interfaces, they cannot be instantiated.

Prior to Java 8, interfaces consisted only of abstract methods and final variables. In Java 8, interfaces may contain default and static methods as well.

Static method in an interface

The static method in an interface is similar to the static method in a class. Users cannot override them. So even if a class implements an interface, it cannot override a static method of an interface.

Like a static...

Lambda expressions


Lambda expressions are the brand new feature of Java. Lambda expressions are introduced in Java 8 and it is a step towards facilitating functional programming in Java.

Lambda expressions help you to define a method without declaring it. So, you do not need a name for the method, return-type, and so on. Lambda expressions, like anonymous inner classes, provide the way to pass behaviors to functions. Lambda, however, is a much more concise way of writing the code.

For example, the preceding example of an anonymous inner class can be converted to Lambda as follows:

public class MyFilterImpl { 
   public static void main(String[] args) { 
      File dir = new File("src/main/java"); 
      dir.list((dirname,name)->name.endsWith("java")); //Lambda Expression 
     } 
} 

Note that the signature of the Lambda expression is exactly matching the signature of the accept method in the FilenameFilter interface.

Note

One of the huge differences between Lambda and anonymous inner classes...

Lexical scoping


Lexical scoping is also referred to as Static scoping. As per lexical scoping, a variable will be accessible in the scope in which it is defined. Here, the scope of the variable is determined at compile time.

Let us consider the following example:

public class LexicalScoping { 
   int a = 1; 
   // a has class level scope. So It will be available to be accessed 
   // throughout the class 
 
   public void sumAndPrint() { 
      int b = 1; 
      int c = a + b; 
      // b and c are local variables of method. These will be accessible 
      // inside the method only 
   } 
   // b and c are no longer accessible 
} 

Variable a will be available throughout the class (let's not consider the difference of static and non-static as of now). However, variables b and c will be available inside the sumAndPrint method only.

Similarly, a variable given inside lambda expressions are accessible only to that Lambda. For example:

list.stream().map(n -> n*2 ); 

Here n is lexically scoped to...

Streams


A stream represents a collection of elements on which a chain of aggregate operations can be performed lazily. Streams have been optimized to cater to both sequential as well as parallel computation, keeping in mind the hardware capabilities of CPU cores. The Steams API was introduced in Java 8 to cater to functional programming needs of the developer. Streams are not Java based collections; however, they are capable enough to operate over collections by first converting them into streams. Some of the characteristics of streams that make them uniquely different from Java collection APIs are:

  • Streams do not store elements. It only transfer values received from sources such as I/O channels, generating functions, data structures (Collections API), and perform a set of pipelined computation on them.
  • Streams do not change the underlying data, they only process them and produce a new set of resultant data. When a distinct() or sorted() method is called on a stream, the source data does not...

Intermediate operations


Intermediate operations always return another stream and get lazily evaluated only when terminal operations are called. The feature of lazy evaluation optimizes intermediate operations when multiple operations are chained together as evaluation only takes place after terminal operation. Another scenario where lazy evaluation tends to be useful is in use cases of infinite or large streams as iteration over an entire stream may not be required or even possible, such as anyMatch, findFirst(), and so on. In these scenarios, short circuiting (discussed in the next section) takes place and the terminal operation exits the flow just after meeting the criteria rather than iterating over entire elements.

Intermediate operations can further be sub-divided into stateful and stateless operations. Stateful operations preserve the last seen value, as in methods such as sorted(), limit(), sequential(), and so on since they need them while processing the current record. For example...

Terminal operations


Terminal operations act as the trigger point in a pipelined stream operation to trigger execution. Terminal operations either return a void or a non-stream type object and once the pipelined operations have been executed on the stream, the stream becomes redundant. A Terminal operation is always required for a pipelined stream operation to be executed.

Stream operations can further be classified as short-circuiting operations. An intermediate operation is said to be short circuited when an infinite input produces a finite stream such as in the case of the limit() method. Similarly, short circuiting operations in the case of terminal operations is when an infinite input may terminate in finite time, as is the case for the methods anyMatch(), findAny(), and so on.

Streams also support parallelism , but in the case of stateful operations, since each parallel operation can have its own state, a parallel stateful operation may have to undergo multiple passes to complete the...

Summary


This chapter dealt with various concepts of Java and more particularly topics around generics and interfaces. We also discussed new features introduced in Java 8 such as Lambda expressions and streams. Introduction to the default method in Java 8 was also covered in detail.

In the next chapter, we will discuss the installation of Spark and Scala. We will also discuss Scala's REPL and the Spark UI along with exposure of Spark's REST API for working with Spark jobs, Sparks componentsm, and various configuration parameters.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark 2.x for Java Developers
Published in: Jul 2017Publisher: PacktISBN-13: 9781787126497
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Sourav Gulati

Sourav Gulati is associated with software industry for more than 7 years. He started his career with Unix/Linux and Java and then moved towards big data and NoSQL World. He has worked on various big data projects. He has recently started a technical blog called Technical Learning as well. Apart from IT world, he loves to read about mythology.
Read more about Sourav Gulati

author image
Sumit Kumar

Sumit Kumar is a developer with industry insights in telecom and banking. At different junctures, he has worked as a Java and SQL developer, but it is shell scripting that he finds both challenging and satisfying at the same time. Currently, he delivers big data projects focused on batch/near-real-time analytics and the distributed indexed querying system. Besides IT, he takes a keen interest in human and ecological issues.
Read more about Sumit Kumar