Home Data Data Engineering with Scala and Spark

Data Engineering with Scala and Spark

By Eric Tome , Rupam Bhattacharjee , David Radford
books-svg-icon Book
eBook $31.99 $21.99
Print $39.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $31.99 $21.99
Print $39.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Scala Essentials for Data Engineers
About this book
Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount. This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users. By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.
Publication date:
January 2024
Publisher
Packt
Pages
300
ISBN
9781804612583

 

Scala Essentials for Data Engineers

Welcome to the world of data engineering with Scala. But why Scala? The following are some of the reasons for learning Scala:

  • Scala provides type safety
  • Big corporations such as Netflix and Airbnb have a lot of data pipelines written in Scala
  • Scala is native to Spark
  • Scala allows data engineers to adopt a software engineering mindset

Scala is a high-level general-purpose programming language that runs on a standard Java platform. It was created by Martin Odersky in 2001. The name Scala stands for scalable language, and it provides excellent support for both object-oriented and functional programming styles.

This chapter is meant as a quick introduction to concepts that the subsequent chapters build upon. Specifically, this chapter covers the following topics:

  • Understanding functional programming
  • Understanding objects, classes, and traits
  • Higher-order functions (HOFs)
  • Examples of HOFs from the Scala collection library
  • Understanding polymorphic functions
  • Variance
  • Option types
  • Collections
  • Pattern matching
  • Implicits in Scala
 

Technical requirements

This chapter is long and contains lots of examples to explain the concepts that are introduced. All of the examples are self-contained, and we encourage you to try them yourself as you move through the chapter. You will need a working Scala environment to run these examples.

You can choose to configure it by following the steps outlined in Chapter 2 or use an online Scala playground such as Scastie (https://scastie.scala-lang.org/). We will use Scala 2.12 as the language version.

 

Understanding functional programming

Functional programming is based on the principle that programs are constructed using only pure functions. A pure function does not have any side effects and only returns a result. Some examples of side effects are modifying a variable, modifying a data structure in place, and performing I/O. We can think of a pure function as just like a regular algebraic function.

An example of a pure function is the length function on a string object. It only returns the length of the string and does nothing else, such as mutating a variable. Similarly, an integer addition function that takes two integers and returns an integer is a pure function.

Two important aspects of functional programming are referential transparency (RT) and the substitution model. An expression is referentially transparent if all of its occurrences can be substituted by the result of the expression without altering the meaning of the program.

In the following example, Example 1.1, we set x and then use it to set r1 and r2, both of which have the same value:

scala> val x: String = "hello"
x: String = hello
scala> val r1 = x + " world!"
r1: String = hello world!
scala> val r2 = x + " world!"
r2: String = hello world!

Example 1.1

Now, if we replace x with the expression referenced by x, r1 and r2 will be the same. In other words, the expression hello is referentially transparent.

Example 1.2 shows the output from a Scala interpreter:

scala> val r1 = "hello" + " world!"
r1: String = hello world!
scala> val r2 = "hello" + " world!"
r2: String = hello world!

Example 1.2

Let’s now look at the following example, Example 1.3, where x is an instance of StringBuilder instead of String:

scala> val x = new StringBuilder("who")
x: StringBuilder = who
scala> val y = x.append(" am i?")
y: StringBuilder = who am i?
scala> val r1 = y.toString
r1: String = who am i?
scala> val r2 = y.toString
r2: String = who am i?

Example 1.3

If we substitute y with the expression it refers to (val y = x.append(" am i?")), r1 and r2 will no longer be equal:

scala> val x = new StringBuilder("who")
x: StringBuilder = who
scala> val r1 = x.append(" am i?").toString
r1: String = who am i?
scala> val r2 = x.append(" am i?").toString
r2: String = who am i? am i?

Example 1.4

So, the expression x.append(" am i?") is not referentially transparent.

One of the advantages of the functional programming style is it allows you to apply local reasoning without having to worry about whether it updates any globally accessible mutable state. Also, since no variable in the global scope is updated, it considerably simplifies building a multi-threaded application.

Another advantage is pure functions are also easier to test as they do not depend on any state apart from the inputs supplied, and they generate the same output for the same input values.

We won’t delve deep into functional programming as it is outside of the scope of this book. Please refer to the Further reading section for additional material on functional programming. In the rest of this chapter, we will provide a high-level tour of some of the important language features that the subsequent chapters build upon.

In this section, we looked at a very high-level introduction to functional programming. Starting with the next section, we will look at Scala language features that enable both functional and object-oriented programming styles.

 

Understanding objects, classes, and traits

In this section, we are going to look at classes, traits, and objects. If you have used Java before, then some of the topics covered in this section will look familiar. However, there are several differences too. For example, Scala provides singleton objects, which automatically create a class and a single instance of that class in one go. Another example is Scala has case classes, which provide great support for pattern matching, allow you to create instances without the new keyword, and provide a default toString implementation that is quite handy when printing to the console.

We will first look at classes, followed by objects, and then wrap this section up with a quick tour of traits.

Classes

A class is a blueprint for objects, which are instances of that class. For example, we can create a Point class using the following code:

class Point(val x: Int, val y: Int) {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
  override def toString: String = s"($x, $y)"
}

Example 1.5

The Point class has four members—two immutable variables, x and y, as well as two methods, add and toString. We can create instances of the Point class as follows:

scala> val p1 = new Point(1,1)
p1: Point = (1, 1)
scala> val p2 = new Point(2,3)
p2: Point = (2, 3)

Example 1.6

We can then create a new instance, p3, by adding p1 and p2, as follows:

scala> val p3 = p1 add p2
p3: Point = (3, 4)

Example 1.7

Scala supports the infix notation, characterized by the placement of operators between operands, and automatically converts p1 add p2 to p1.add(p2). Another way to define the Point class is using a case class, as shown here:

case class Point(x: Int, y: Int) {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
}

Example 1.8

A case class automatically adds a factory method with the name of the class, which enables us to leave out the new keyword when creating an instance. A factory method is used to create instances of a class without requiring us to explicitly call the constructor method. Refer to the following example:

scala> val p1 = Point(1,1)
p1: Point = Point(1,1)
scala> val p2 = Point(2,3)
p2: Point = Point(2,3)

Example 1.9

The compiler also adds default implementations of various methods such as toString and hashCode, which the regular class definition lacks. So, we did not have to override the toString method, as was done earlier, and yet both p1 and p2 were printed neatly on the console (Example 1.9).

All arguments in the parameter list of a case class automatically get a val prefix, which makes them parametric fields. A parametric field is a shorthand that defines a parameter and a field with the same name.

To better understand the difference, let’s look at the following example:

scala> case class Point1(x: Int, y: Int) //x and y are parametric fields
defined class Point1
scala> class Point2(x: Int, y: Int) //x and y are regular parameters
defined class Point2
scala> val p1 = Point1(1, 2)
p1: Point1 = Point1(1,2)
scala> val p2 = new Point2(3, 4)
p2: Point2 = Point2@203ced18

Example 1.10

If we now try to access p1.x, it will work because x is a parametric field, whereas trying to access p2.x will result in an error. Example 1.11 illustrates this:

scala> println(p1.x)
1
scala> println(p2.x)
<console>:13: error: value x is not a member of Point2
       println(p2.x)
                  ^

Example 1.11

Trying to access p2.x will result in a compile error, value x is not a member of Point2. Case classes also have excellent support for pattern matching, as we will see in the Understanding pattern matching section.

Scala also provides an abstract class, which, unlike a regular class, can contain abstract methods. For example, we can define the following hierarchy:

abstract class Animal
abstract class Pet extends Animal {
  def name: String
}
class Dog(val name: String) extends Pet {
  override def toString = s"Dog($name)"
}
scala> val pluto = new Dog("Pluto")
pluto: Dog = Dog(Pluto)

Example 1.12

Animal is the base class. Pet extends Animal and declares an abstract method, name. Dog extends Pet and uses a parametric field, name (it is both a parameter as well as a field). Because Scala uses the same namespace for fields and methods, this allows the field name in the Dog class to provide a concrete implementation of the abstract method name in Pet.

Object

Unlike Java, Scala does not support static members in classes; instead, it has singleton objects. A singleton object is defined using the object keyword, as shown here:

class Point(val x: Int, val y: Int) {
  // new keyword is not required to create a Point object
  // apply method from companion object is invoked
  def add(that: Point): Point = Point(x + that.x, y + that.y)
  override def toString: String = s"($x, $y)"
}
object Point {
  def apply(x: Int, y: Int) = new Point(x, y)
}

Example 1.13

In this example, the Point singleton object shares the same name with the class and is called that class’s companion object. The class is called the companion class of the singleton object. For an object to qualify as a companion object of a given class, it needs to be in the same source file as the class itself.

Please note that the add method does not use the new keyword on the right-hand side. Point(x1, y1) is de-sugared into Point.apply(x1, y1), which returns a Point instance.

Singleton objects are also used to write an entrypoint for Scala applications. One option is to provide an explicit main method within the singleton object, as shown here:

object SampleScalaApplication {
  def main(args: Array[String]): Unit = {
    println(s"This is a sample Scala application")
  }
}

Example 1.14

The other option is to extend the App trait, which provides a main method implementation. We will cover traits in the next section. You can also refer to the Further reading section (the third point) for more information:

 object SampleScalaApplication extends App {
  println(s"This is a sample Scala application")
}

Example 1.15

Trait

Scala also has traits, which are used to define rich interfaces as well as stackable modifications. You can read more stackable modifications in the Further reading section (the fourth point) Unlike class inheritance, where each class inherits from just one super class, a class can mix in any number of traits. A trait can have abstract as well as concrete members. Here is a simplified example of the Ordered trait from the Scala standard library:

trait Ordered[T] {
  // compares receiver (this) with argument of the same type
  def compare(that: T): Int
  def <(that: T): Boolean = (this compare that) < 0
  def >(that: T): Boolean = (this compare that) > 0
  def <=(that: T): Boolean = (this compare that) <= 0
  def >=(that: T): Boolean = (this compare that) >= 0
}

Example 1.16

The Ordered trait takes a type parameter, T, and has an abstract method, compare. All of the other methods are defined in terms of that method. A class can add the functionalities defined by <, >, and so on, just by defining the compare method. The compare method should return a negative integer if the receiver is less than the argument, positive if the receiver is greater than the argument, and 0 if both objects are the same.

Going back to our Point example, we can define a rule to say that a point, p1, is greater than p2 if the distance of p1 from the origin is greater than that of p2:

case class Point(x: Int, y: Int) extends Ordered[Point] {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
  def compare(that: Point) = (x ^ 2 + y ^ 2) ^ 1 / 2 - (that.x ^ 2 + that.y ^ 2) ^ 1 / 2
}

Example 1.17

With the definition of compare now in place, we can perform a comparison between two arbitrary points, as follows:

scala> val p1 = Point(1,1)
p1: Point = Point(1,1)
scala> val p2 = Point(2,2)
p2: Point = Point(2,2)
scala> println(s"p1 is greater than p2: ${p1 > p2}")
p1 is greater than p2: false
example 1.18

In this section, we looked at objects, classes, and traits. In the next section, we are going to look at HOFs.

 

Working with higher-order functions (HOFs)

In Scala, functions are first-class citizens, which means function values can be assigned to variables, passed to functions as arguments, or returned by a function as a value. HOFs take one or more functions as arguments or return a function as a value.

A method can also be passed as an argument to an HOF because the Scala compiler will coerce a method into a function of the required type. For example, let’s define a function literal and a method, both of which take a pair of integers, perform an operation, and then return an integer:

//function literal
val add: (Int, Int) => Int = (x, y) => x + y
//a method
def multiply(x: Int, y: Int): Int = x * y

Example 1.19

Let’s now define a method that takes two integer arguments and performs an operation, op, on them:

def op(x: Int, y: Int) (f: (Int, Int) => Int): Int = f(x,y)

Example 1.20

We can pass any function (or method) of type (Int, Int) => Int to op, as the following example illustrates:

scala> op(1,2)(add)
res15: Int = 3
scala> op(2,3)(multiply)
res16: Int = 6

Example 1.21

This ability to pass functions as parameters is extremely powerful as it allows us to write generic code that can execute arbitrary user-supplied functions. In fact, many of the methods defined in the Scala collection library require functions as arguments, as we will see in the next section.

Examples of HOFs from the Scala collection library

Scala collections provide transformers that take a base collection, run some transformations over each of the collection’s elements, and return a new collection. For example, we can transform a list of integers by doubling each of its elements using the map method, which we will cover in a bit:

scala> List(1,2,3,4).map(_ * 2)
res17: List[Int] = List(2, 4, 6, 8)

Example 1.22

A traversable trait, which is a base trait for all kinds of Scala collections, implements behaviors common to all collections, in terms of a foreach method, with the following signature:

def foreach[U](f: A => U): Unit

Example 1.23

The argument f is a function of type A => U, which is shorthand for Function1[A,U], and thus foreach is an HOF. This is an abstract method that needs to be implemented by all classes that mix in Traversable. The return type is Unit, which means this method does not return any meaningful value and is primarily used for side effects.

Here is an example that prints the elements of a List:

scala> /** let's start with a foreach call that prints the numbers in a list
     |   * List(1,2,3,4).foreach((i: Int) => println(i))
     |   * we can skip the type argument and let Scala infer it
     |   * List(1,2,3,4).foreach( i => println(i))
     |   * Scala provides a shorthand to replace arguments using _
     |   * if the arguments are used only once on the right side
     |   * List(1,2,3,4).foreach(println(_))
     |   * finally Scala allows to leave the argument altogether
     |   * if there is only one argument used on the right side
     |   */
     | List(1,2,3,4).foreach(println)
1
2
3
4

Example 1.24

For the rest of the examples, we will continue to use the List collection type, but they are available for other types of collections, such as Array, Map, and Set.

map is similar to foreach, but instead of returning a unit, it returns a collection by applying the function f to each element of the base collection. Here is the signature for List[A]:

final def map[B](f: (A) ⇒ B): List[B]

Example 1.25

Using the list from the previous example, if we want to double each of the elements in the list, but return a list of Doubles instead of Ints, it can be achieved by using the following:

scala> List(1,2,3,4).map(_ * 2.0)
res22: List[Double] = List(2.0, 4.0, 6.0, 8.0)

Example 1.26

The preceding expression returns a list of Double and can be chained with foreach to print the values contained in the list:

scala> List(1,2,3,4).map(_ * 2.0).foreach(println)
2.0
4.0
6.0
8.0

Example 1.27

A close cousin of map is flatMap, which comprises of two parts—map and flatten. Before looking into flatMap, let’s look at flatten:

//converts a list of traversable collections into a list
//formed by the elements of the traversable collections
def flatten[B]: List[B]

Example 1.28

As the name suggests, it flattens the inner collections:

scala> List(Set(1,2,3), Set(4,5,6)).flatten
res24: List[Int] = List(1, 2, 3, 4, 5, 6)

Example 1.29

Now that we have seen what flatten does, let’s go back to flatMap.

Let’s say that for each element of List(1,2,3,4), we want to create List of elements from 0 to that number (both inclusive) and then combine all of those individual lists into a single list. Our first pass at it would look like the following:

scala> List(1,2,3,4).map(0 to _).flatten
res25: List[Int] = List(0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4)

Example 1.30

With flatMap, we can achieve the same result in one step:

scala> List(1,2,3,4).flatMap(0 to _)
res26: List[Int] = List(0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4)

Example 1.31

Scala collections also provide filter, which accepts a function that returns a Boolean as an argument, which is then used to filter elements of a given collection:

def filter(p: (A) ⇒ Boolean): List[A]

Example 1.32

For example, to filter all of the even integers from List of numbers from 1 to 100, try the following:

scala> List.tabulate(100)(_ + 1).filter(_ % 2 == 0)
res27: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100)

Example 1.33

There is also withFilter, which provides performance benefits over filter through the lazy evaluation of intermediate collections. It is part of the TraversableLike trait, with the FilterMonadic trait providing the abstract definition:

trait FilterMonadic[+A, +Repr] extends Any {
  //includes map, flatMap and foreach but are skipped here
  def withFilter(p: A => Boolean): FilterMonadic[A, Repr]
}

Example 1.34

TraversableLike defines the withFilter method through a member class, WithFilter, that extends FilterMonadic:

def withFilter(p: A => Boolean): FilterMonadic[A, Repr] = new WithFilter(p)
class WithFilter(p: A => Boolean) extends FilterMonadic[A, Repr] {
  // implementation of map, flatMap and foreach skipped here
  def withFilter(q: A => Boolean): WithFilter = new WithFilter(x =>
  p(x) && q(x)
  )
}

Example 1.35

Please note that withFilter returns an object of type FilterMonadic, which only has map, flatMap, foreach, and withFilter. These are the only methods that can be chained after a call to withFilter. For example, the following will not compile:

List.tabulate(50)(_ + 1).withFilter(_ % 2 == 0).forall(_ % 2 == 0)

Example 1.36

It is quite common to have a sequence of flatMap, filter, and map chained together and Scala provides syntactic sugar to support that through for comprehensions. To see it in action, let’s consider the following Person class and its instances:

case class Person(firstName: String, isFemale: Boolean, children: Person*)
val bob = Person("Bob", false)
val jennette = Person("Jennette", true)
val laura = Person("Laura", true)
val jean = Person("Jean", true, bob, laura)
val persons = List(bob, jennette, laura, jean)

Example 1.37

Person* represents a variable argument of type Person. A variable argument of type T needs to be the last argument in a class definition or method signature and accepts zero, one, or more instances of type T.

Now say we want to get pairs of mother and child, which would be (Jean, Bob) and (Jean, Laura). Using flatMap, filter, and map we can write it as follows:

scala> persons.filter(_.isFemale).flatMap(p => p.children.map(c => (p.firstName, c.firstName)))
res32: List[(String, String)] = List((Jean,Bob), (Jean,Laura))

Example 1.38

The preceding expression does its job, but it is not quite easy to understand what is happening. This is where for comprehension comes to the rescue:

scala> for {
     |   p <- persons
     |   if p.isFemale
     |   c <- p.children
     | } yield (p.firstName, c.firstName)
res33: List[(String, String)] = List((Jean,Bob), (Jean,Laura))

Example 1.39

It is much easier to understand what this snippet of code does. Behind the scenes, the Scala compiler will convert this expression into the first one (the only difference being filter will be replaced with withFilter).

Scala also provides methods to combine the elements of a collection using the fold and reduce families of functions. The primary difference between the two can be understood by comparing the signatures of foldLeft and reduceLeft:

def foldLeft[B](z: B)(op: (B, A) ⇒ B): B
def reduceLeft[A1 >: A](op: (A1, A1) ⇒ A1): A1

Example 1.40

Both of these methods take a binary operator to combine the elements from left to right. However, foldLeft takes a zero-argument, z, of type B (this value is returned if List is empty), and the output type can differ from the types of the elements in List. On the other hand, reduceLeft requires A1 to be a supertype of A (>: signifies a lower bound). So, we can sum up List[Int] and return the value as Double using foldLeft, as follows:

scala> List(1,2,3,4).foldLeft[Double](0) ( _ + _ )
res34: Double = 10.0

Example 1.41

We cannot do the same with reduceLeft (since Double is not a supertype of Int). Trying to do so will raise a compile-time error of type arguments [Double] do not conform to method reduce's type parameter bounds [A1 >: Int]:

scala> List(1,2,3,4).reduce[Double] ( _ + _ )
<console>:12: error: type arguments [Double] do not conform to method reduce's type parameter bounds [A1 >: Int]
       List(1,2,3,4).reduce[Double] ( _ + _ )
                           ^

Example 1.42

foldRight and reduceRight combine the elements of a collection from right to left. There is also fold and reduce, and for both, the order in which the elements are combined is unspecified and may be nondeterministic.

In this section, we have seen several examples of HOFs from the Scala collection library. By now, you should have noticed that each of these functions uses type parameters. These are called polymorphic functions, which is what we will cover next.

 

Understanding polymorphic functions

A function that works with multiple types of input arguments or can return a value of different types is called a polymorphic function. While writing a polymorphic function, we provide a comma-separated list of type parameters surrounded by square brackets after the name of the function. For example, we can write a function that returns the index of the first occurrence of an element within List:

scala> def findFirstIn[A](as: List[A], p: A => Boolean): Option[Int] =
     |   as.zipWithIndex.collect { case (e, i) if p(e) => i }.headOption
findFirstIn: [A](as: List[A], p: A => Boolean)Option[Int]
example 1.43

This function will work for any type of list: List[Int], List[String], and so on. For example, we can search for the index of element 5 in a list of integers from 1 to 20:

scala> import scala.util.Random
import scala.util.Random
scala> val ints = Random.shuffle((1 to 20).toList)
ints: List[Int] = List(7, 9, 3, 8, 6, 13, 12, 18, 14, 15, 1, 11, 10, 16, 2, 5, 20, 17, 4, 19)
scala> findFirstIn[Int](ints, _ == 5)
res38: Option[Int] = Some(15)

Example 1.44

In the next section, we are going to look at another property of type parameters, called variance, which defines subtyping relationships between objects, as we will see in the following section.

 

Variance

As mentioned earlier, functions are first-class objects in Scala. Scala automatically converts function literals into objects of the FunctionN type (N = 0 to 22). For example, consider the following anonymous function:

val f: Int => Any = (x: Int) => x

Example 1.45

This function will be converted automatically to the following:

val f = new Function1[Int, Any] {def apply(x: Int) = x}

Example 1.46

Please note that the preceding syntax represents an object of an anonymous class that extends Function1[Int, Any] and implements its abstract apply method. In other words, it is equivalent to the following:

class AnonymousClass extends Function1[Int, Any] {
  def apply(x: Int): Any = x
}
val f = new AnonymousClass

Example 1.47

If we refer to the type signature of the Function1 trait, we would see the following:

Function1[-T1, +T2]

Example 1.48

T1 represents the argument type and T2 represents the return type. The type variance of T1 is contravariant and that of T2 is covariant. In general, covariance designed by + means if a class or trait is covariant in its type parameter T, that is, C[+T], then C[T1] and C[T2] will adhere to the subtyping relationship between T1 and T2. For example, since Any is a supertype of Int, C[Any] will be a supertype of C[Int].

The order is reversed for contravariance. So, if we have C[-T], then C[Int] will be a supertype of C[Any].

Since we have Function1[-T1, +R], that would then mean type Function1[Int, Any] will be a supertype of, say, Function1[Any, String].

To see it in action, let’s define a method that takes a function of type Int => Any and returns Unit:

def caller(op: Int => Any): Unit = List
  .tabulate(5)(i => i + 1)
  .foreach(i => print(s"$i "))

Example 1.49

Let’s now define two functions:

scala> val f1: Int => Any = (x: Int) => x
f1: Int => Any = $Lambda$9151/1234201645@34f561c8
scala> val f2 : Any => String = (x: Any) => x.toString
f2: Any => String = $Lambda$9152/1734317897@699fe6f6

Example 1.50

A function (or method) with a parameter of type T can be invoked with an argument that is either of type T or its subtype. And since Int => Any is a supertype of Any => String, we should be able to pass both of these functions as arguments. As can be seen, both of them indeed work:

scala> caller(f1)
1 2 3 4 5
scala> caller(f2)
1 2 3 4 5

Example 1.51

 

Option type

Scala’s option type represents optional values. These values can be of two forms: Some(x), where x is the actual value, or None, which represents a missing value. Many of the Scala collection library methods return a value of the Option[T] type. The following are a few examples:

scala> List(1, 2, 3, 4).headOption
res45: Option[Int] = Some(1)
scala> List(1, 2, 3, 4).lastOption
res46: Option[Int] = Some(4)
scala> List("hello,", "world").find(_ == "world")
res47: Option[String] = Some(world)
scala> Map(1 -> "a", 2 -> "b").get(3)
res48: Option[String] = None

Example 1.52

Option also has a rich API and provides many of the functions from the collection library API through an implicit conversion function, option2Iterable, in the companion object. The following are a few examples of methods supported by the Option type:

scala> Some("hello, world!").headOption
res49: Option[String] = Some(hello, world!)
scala> None.getOrElse("Empty")
res50: String = Empty
scala> Some("hello, world!").map(_.replace("!", ".."))
res51: Option[String] = Some(hello, world..)
scala> Some(List.tabulate(5)(_ + 1)).flatMap(_.headOption)
res52: Option[Int] = Some(1)

Example 1.53

Collections

Scala comes with a powerful collection library. Collections are classified into mutable and immutable collections. A mutable collection can be updated in place, whereas an immutable collection never changes. When we add, remove, or update elements of an immutable collection, a new collection is created and returned, keeping the old collection unchanged.

All collection classes are found in the scala.collection package or one of its subpackages: mutable, immutable, and generic. However, for most of our programming needs, we refer to collections in either the mutable or immutable package.

A collection in the scala.collection.immutable package is guaranteed to be immutable and will never change after it is created. So, we will not have to make any defensive copies of an immutable collection, since accessing a collection multiple times will always yield the same set of elements.

On the other hand, collections in the scala.collection.mutable package provide methods that can update a collection in place. Since these collections are mutable, we need to defend against any inadvertent update, p, by other parts of the code base.

By default, Scala picks immutable collections. This easy access is provided through the Predef object, which is implicitly imported into every Scala source file. Refer to the following example:

object Predef {
  type Set[A] = immutable.Set[A]
  type Map[A, +B] = immutable.Map[A, B]
  val Map = immutable.Map
  val Set = immutable.Set
  // ...
}

Example 1.54

The Traversable trait is the base trait for all of the collection types. This is followed by Iterable, which is divided into three subtypes: Seq, Set, and Map. Both Set and Map provide sorted and unsorted variants. Seq, on the other hand, has IndexedSeq and LinearSeq. There is quite a bit of similarity among all these classes. For instance, an instance of any collection can be created by the same uniform syntax, writing the collection class name followed by its elements:

Traversable(1, 2, 3)
Map("x" -> 24, "y" -> 25, "z" -> 26)
Set("red", "green", "blue")
SortedSet("hello", "world")
IndexedSeq(1.0, 2.0)
LinearSeq(a, b, c)

Example 1.55

The following is the hierarchy for scala.collection.immutable collections taken from the docs.scala-lang.org website.

Figure 1.1 – Scala collection hierarchy

Figure 1.1 – Scala collection hierarchy

The Scala collection library is very rich and has various collection types suited to specific programming needs. If you want to delve deep into the Scala collection library, please refer to the Further reading section (the fifth point).

In this section, we looked at the Scala collection hierarchy. In the next section, we will gain a high-level understanding of pattern matching.

 

Understanding pattern matching

Scala has excellent support for pattern matching. The most prominent use is the match expression, which takes the following form:

selector match { alternatives }

selector is the expression that the alternatives will be tried against. Each alternative starts with the case keyword and includes a pattern, an arrow symbol =>, and one or more expressions, which will be evaluated if the pattern matches. The patterns can be of various types, such as the following:

  • Wildcard patterns
  • Constant patterns
  • Variable patterns
  • Constructor patterns
  • Sequence patterns
  • Tuple patterns
  • Typed patterns

Before going through each of these pattern types, let’s define our own custom List:

trait List[+A]
case class Cons[+A](head: A, tail: List[A]) extends List[A]
case object Nil extends List[Nothing]
object List {
  def apply[A](as: A*): List[A] = if (as.isEmpty) Nil else Cons(as.head, apply(as.tail: _*))
}

Example 1.56

Wildcard patterns

The wildcard pattern (_) matches any object and is used as a default, catch-all alternative. Consider the following example:

scala> def emptyList[A](l: List[A]): Boolean = l match {
     |   case Nil => true
     |   case _   => false
     | }
emptyList: [A](l: List[A])Boolean
scala> emptyList(List(1, 2))
res8: Boolean = false

Example 1.57

A wildcard can also be used to ignore parts of an object that we do not care about. Refer to the following code:

scala> def threeElements[A](l: List[A]): Boolean = l match {
     |   case Cons(_, Cons(_, Cons(_, Nil))) => true
     |   case _                            => false
     | }
threeElements: [A](l: List[A])Boolean
scala> threeElements(List(true, false))
res11: Boolean = false
scala> threeElements(Nil)
res12: Boolean = false
scala> threeElements(List(1, 2, 3))
res13: Boolean = true
scala> threeElements(List("a", "b", "c", "d"))
res14: Boolean = false

Example 1.58

In the preceding example, the threeElements method checks whether a given list has exactly three elements. The values themselves are not needed and are thus discarded in the pattern match.

Constant patterns

A constant pattern matches only itself. Any literal can be used as a constant – 1, true, and hi are all constant patterns. Any val or singleton object can also be used as a constant. The emptyList method from the previous example uses Nil to check whether the list is empty.

Variable patterns

Like a wildcard, a variable pattern matches any object and is bound to it. We can then use this variable to refer to the object:

scala> val ints = List(1, 2, 3, 4)
ints: List[Int] = Cons(1,Cons(2,Cons(3,Cons(4,Nil))))
scala> ints match {
     |   case Cons(_, Cons(_, Cons(_, Nil))) => println("A three element list")
     |   case l => println(s"$l is not a three element list")
     | }
Cons(1,Cons(2,Cons(3,Cons(4,Nil)))) is not a three element list

Example 1.59

In the preceding example, l is bound to the entire list, which then is printed to the console.

Constructor patterns

A constructor pattern looks like Cons(_, Cons(_, Cons(_, Nil))). It consists of the name of a case class (Cons), followed by a number of patterns in parentheses. These extra patterns can themselves be constructor patterns, and we can use them to check arbitrarily deep into an object. In this case, checks are performed at four levels.

Sequence patterns

Scala allows us to match against sequence types such as Seq, List, and Array among others. It looks similar to a constructor pattern. Refer to the following:

scala> def thirdElement[A](s: Seq[A]): Option[A] = s match {
     |   case Seq(_, _, a, _*) => Some(a)
     |   case _            => None
     | }
thirdElement: [A](s: Seq[A])Option[A]
scala> val intSeq = Seq(1, 2, 3, 4)
intSeq: Seq[Int] = List(1, 2, 3, 4)
scala> thirdElement(intSeq)
res16: Option[Int] = Some(3)
scala> thirdElement(Seq.empty[String])
res17: Option[String] = None

Example 1.60

As the example illustrates, thirdElement returns a value of type Option[A]. If a sequence has three or more elements, it will return the third element, whereas for any sequence with less than three elements, it will return None. Seq(_, _, a, _*) binds a to the third element if present. The _* pattern matches any number of elements.

Tuple patterns

We can pattern match against tuples too:

scala> val tuple3 = (1, 2, 3)
tuple3: (Int, Int, Int) = (1,2,3)
scala> def printTuple(a: Any): Unit = a match {
     |   case (a, b, c) => println(s"Tuple has $a, $b, $c")
     |   case _     =>
     | }
printTuple: (a: Any)Unit
scala> printTuple(tuple3)
Tuple has 1, 2, 3

Example 1.61

Running the preceding program will print Tuple has 1, 2, 3 to the console.

Typed patterns

A typed pattern allows us to check types in the pattern match and can be used for type tests and type casts:

scala> def getLength(a: Any): Int =
     |   a match {
     |     case s: String    => s.length
     |     case l: List[_]   => l.length //this is List from Scala collection library
     |     case m: Map[_, _] => m.size
     |     case _            => -1
     |   }
getLength: (a: Any)Int
scala> getLength("hello, world")
res3: Int = 12
scala> getLength(List(1, 2, 3, 4))
res4: Int = 4
scala> getLength(Map.empty[Int, String])
res5: Int = 0

Example 1.62

Please note that the argument a of type Any does not support methods such as length or size in the result expression. Scala automatically applies a type test and a type cast to match the target type. For example, case s: String => s.length is equivalent to the following snippet:

if (s.isInstanceOf[String]) {
  val x = s.asInstanceOf[String]
  x.length
}

Example 1.63

One important thing to note, though, is that Scala does not maintain type arguments during runtime. So, there is no way to check whether list has all integer elements or not. For example, the following will print A list of String to the console. The compiler will emit a warning to alert about the runtime behavior. Arrays are the only exception because the element type is stored with the array value:

scala> List.fill(5)(0) match {
     |   case _: List[String] => println("A list of String")
     |   case _           =>
     | }
<console>:13: warning: fruitless type test: a value of type List[Int] cannot also be a List[String] (the underlying of List[String]) (but still might match its erasure)
         case _: List[String] => println("A list of String")
                 ^
A list of String

Example 1.64

 

Implicits in Scala

Scala provides implicit conversions and parameters. Implicit conversion to an expected type is the first place the compiler uses implicits. For example, the following works:

scala> val d: Double = 2
d: Double = 2.0

Example 1.65

This works because of the following implicit method definition in the Int companion object (it was part of Predef prior to 2.10.x):

implicit def int2double(x: Int): Double = x.toDouble

Example 1.66

Another application of implicit conversion is the receiver of a method call. For example, let’s define a Rational class:

scala> class Rational(n: Int, d: Int) extends Ordered[Rational] {
     |
     |   require(d != 0)
     |   private val g = gcd(n.abs, d.abs)
     |   private def gcd(a: Int, b: Int): Int = if (b == 0) a else gcd(b, a % b)
     |   val numer = n / g
     |   val denom = d / g
     |   def this(n: Int) = this(n, 1)
     |   def +(that: Rational) = new Rational(
     |   this.numer * that.numer + this.denom * that.denom,
     |   this.denom * that.denom
     |   )
     |   def compare(that: Rational) = (this.numer * that.numer - this.denom * that.denom)
     |   override def toString = if (denom == 1) numer.toString else s"$numer/$denom"
     | }
defined class Rational

Example 1.67

Then declare a variable of the Rational type:

scala> val r1 = new Rational(1)
r1: Rational = 1
scala> 1 + r1
<console>:14: error: overloaded method value + with alternatives:
  (x: Double)Double <and>
  (x: Float)Float <and>
  (x: Long)Long <and>
  (x: Int)Int <and>
  (x: Char)Int <and>
  (x: Short)Int <and>
  (x: Byte)Int <and>
  (x: String)String
cannot be applied to (Rational)
       1 + r1
         ^

Example 1.68

If we try to add r1 to 1, we will get a compile-time error. The reason is the + method in Int does not support an argument of type Rational. In order to make it work, we can create an implicit conversion from Int to Rational:

scala> implicit def intToRational(n: Int): Rational = new Rational(n)
intToRational: (n: Int)Rational
scala> val r1 = new Rational(1)
r1: Rational = 1
scala> 1 + r1
res11: Rational = 2

Example 1.69

 

Summary

This was a long chapter and we covered a lot of topics. We started this chapter with a brief introduction to functional programming, looked at why it is useful, and reviewed examples of RT. We then looked at various language features and constructs, starting with classes, objects, and traits. We looked at HOFs, which are one of the fundamental building blocks of functional programming. We looked at polymorphic functions and saw how they enable us to write reusable code. Then, we looked at variance, which defines subtyping relationships between objects, took a detailed tour of pattern matching, and finally, ended with implicit conversion, which is a powerful language feature used in design patterns such as type classes.

In the next chapter, we are going to focus on setting up the environment, which will allow you to follow along with the rest of the chapters.

 

Further reading

About the Authors
  • Eric Tome

    Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.

    Browse publications by this author
  • Rupam Bhattacharjee

    Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.

    Browse publications by this author
  • David Radford

    David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master's degree in computer science and works as a senior solutions architect at Databricks.

    Browse publications by this author
Data Engineering with Scala and Spark
Unlock this book and the full library FREE for 7 days
Start now