Reader small image

You're reading from  Scala for Data Science

Product typeBook
Published inJan 2016
Reading LevelIntermediate
Publisher
ISBN-139781785281372
Edition1st Edition
Languages
Right arrow
Author (1)
Pascal Bugnion
Pascal Bugnion
author image
Pascal Bugnion

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University. Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.
Read more about Pascal Bugnion

Right arrow

Preface

Data science is fashionable. Data science startups are sprouting across the globe and established companies are scrambling to assemble data science teams. The ability to analyze large datasets is also becoming increasingly important in the academic and research world.

Why this explosion in demand for data scientists? Our view is that the emergence of data science can be viewed as the serendipitous collusion of several interlinked factors. The first is data availability. Over the last fifteen years, the amount of data collected by companies has exploded. In the world of research, cheap gene sequencing techniques have drastically increased the amount of genomic data available. Social and professional networking sites have built huge graphs interlinking a significant fraction of the people living on the planet. At the same time, the development of the World Wide Web makes accessing this wealth of data possible from almost anywhere in the world.

The increased availability of data has resulted in an increase in data awareness. It is no longer acceptable for decision makers to trust their experience and "gut feeling" alone. Increasingly, one expects business decisions to be driven by data.

Finally, the tools for efficiently making sense of and extracting insights from huge data sets are starting to mature: one doesn't need to be an expert in distributed computing to analyze a large data set any more. Apache Spark, for instance, greatly eases writing distributed data analysis applications. The explosion of cloud infrastructure facilitates scaling computing needs to cope with variable data amounts.

Scala is a popular language for data science. By emphasizing immutability and functional constructs, Scala lends itself well to the construction of robust libraries for concurrency and big data analysis. A rich ecosystem of tools for data science has therefore developed around Scala, including libraries for accessing SQL and NoSQL databases, frameworks for building distributed applications like Apache Spark and libraries for linear algebra and numerical algorithms. We will explore this rich and growing ecosystem in the fourteen chapters of this book.

What this book covers

We aim to give you a flavor for what is possible with Scala, and to get you started using libraries that are useful for building data science applications. We do not aim to provide an entirely comprehensive overview of any of these topics. This is best left to online documentation or to reference books. What we will teach you is how to combine these tools to build efficient, scalable programs, and have fun along the way.

Chapter 1, Scala and Data Science, is a brief description of data science, and of Scala's place in the data scientist's tool-belt. We describe why Scala is becoming increasingly popular in data science, and how it compares to alternative languages such as Python.

Chapter 2, Manipulating Data with Breeze, introduces Breeze, a library providing support for numerical algorithms in Scala. We learn how to perform linear algebra and optimization, and solve a simple machine learning problem using logistic regression.

Chapter 3, Plotting with breeze-viz, introduces the breeze-viz library for plotting two-dimensional graphs and histograms.

Chapter 4, Parallel Collections and Futures, describes basic concurrency constructs. We will learn to parallelize simple problems by distributing them over several threads using parallel collections, and apply what we have learned to build a parallel cross-validation pipeline. We then describe how to wrap computation in a future to execute it asynchronously. We apply this pattern to query a web API, sending several requests in parallel.

Chapter 5, Scala and SQL through JDBC, looks at interacting with SQL databases in a functional manner. We learn how to use common Scala patterns to wrap the Java interface exposed by JDBC. Besides learning about JDBC, this chapter introduces type classes, the loan pattern, implicit conversions, and other patterns that are frequently leveraged in libraries and existing Scala code.

Chapter 6, Slick - A Functional Interface for SQL, describes the Slick library for mapping data in SQL tables to Scala objects.

Chapter 7, Web APIs, describes how to query web APIs in a concurrent, fault-tolerant manner using futures. We learn to parse JSON responses and formulate complex HTTP requests with authentication. We walk through querying the GitHub API to obtain information about GitHub users programmatically.

Chapter 8, Scala and MongoDB, walks the reader through interacting with MongoDB, a leading NoSQL database. We build a pipeline that fetches user data from the GitHub API and stores it in a MongoDB database.

Chapter 9, Concurrency with Akka, introduces the Akka framework for building concurrent applications with actors. We use Akka to build a scalable crawler that explores the GitHub follower graph.

Chapter 10, Distributed Batch Processing with Spark, explores the Apache Spark framework for building distributed applications. We learn how to construct and manipulate distributed datasets in memory. We touch briefly on the internals of Spark, learning how the architecture allows for distributed, fault-tolerant computation.

Chapter 11, Spark SQL and DataFrames, describes DataFrames, one of the more powerful features of Spark for the manipulation of structured data. We learn how to load JSON and Parquet files into DataFrames.

Chapter 12, Distributed Machine Learning with MLlib, explores how to build distributed machine learning pipelines with MLlib, a library built on top of Apache Spark. We use the library to train a spam filter.

Chapter 13, Web APIs with Play, describes how to use the Play framework to build web APIs. We describe the architecture of modern web applications, and how these fit into the data science pipeline. We build a simple web API that returns JSON.

Chapter 14, Visualization with D3 and the Play Framework, builds on the previous chapter to program a fully fledged web application with Play and D3. We describe how to integrate JavaScript into a Play framework application.

Appendix, Pattern Matching and Extractors, describes how pattern matching provides the programmer with a powerful construct for control flow.

What you need for this book

The examples provided in this book require that you have a working Scala installation and SBT, the Simple Build Tool, a command line utility for compiling and running Scala code. We will walk you through how to install these in the next sections.

We do not require a specific IDE. The code examples can be written in your favorite text editor or IDE.

Installing the JDK

Scala code is compiled to Java byte code. To run the byte code, you must have the Java Virtual Machine (JVM) installed, which comes as part of a Java Development Kit (JDK). There are several JDK implementations and, for the purpose of this book, it does not matter which one you choose. You may already have a JDK installed on your computer. To check this, enter the following in a terminal:

$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

If you do not have a JDK installed, you will get an error stating that the java command does not exist.

If you do have a JDK installed, you should still verify that you are running a sufficiently recent version. The number that matters is the minor version number: the 8 in 1.8.0_66. Versions 1.8.xx of Java are commonly referred to as Java 8. For the first twelve chapters of this book, Java 7 will be sufficient (your version number should be something like 1.7.xx or newer). However, you will need Java 8 for the last two chapters, since the Play framework requires it. We therefore recommend that you install Java 8.

On Mac, the easiest way to install a JDK is using Homebrew:

$ brew install java

This will install Java 8, specifically the Java Standard Edition Development Kit, from Oracle.

Homebrew is a package manager for Mac OS X. If you are not familiar with Homebrew, I highly recommend using it to install development tools. You can find installation instructions for Homebrew on: http://brew.sh.

To install a JDK on Windows, go to http://www.oracle.com/technetwork/java/javase/downloads/index.html (or, if this URL does not exist, to the Oracle website, then click on Downloads and download Java Platform, Standard Edition). Select Windows x86 for 32-bit Windows, or Windows x64 for 64 bit. This will download an installer, which you can run to install the JDK.

To install a JDK on Ubuntu, install OpenJDK with the package manager for your distribution:

$ sudo apt-get install openjdk-8-jdk

If you are running a sufficiently old version of Ubuntu (14.04 or earlier), this package will not be available. In this case, either fall back to openjdk-7-jdk, which will let you run examples in the first twelve chapters, or install the Java Standard Edition Development Kit from Oracle through a PPA (a non-standard package archive):

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

You then need to tell Ubuntu to prefer Java 8 with:

$ sudo update-java-alternatives -s java-8-oracle

Installing and using SBT

The Simple Build Tool (SBT) is a command line tool for managing dependencies and building and running Scala code. It is the de facto build tool for Scala. To install SBT, follow the instructions on the SBT website (http://www.scala-sbt.org/0.13/tutorial/Setup.html).

When you start a new SBT project, SBT downloads a specific version of Scala for you. You, therefore, do not need to install Scala directly on your computer. Managing the entire dependency suite from SBT, including Scala itself, is powerful: you do not have to worry about developers working on the same project having different versions of Scala or of the libraries used.

Since we will use SBT extensively in this book, let's create a simple test project. If you have used SBT previously, do skip this section.

Create a new directory called sbt-example and navigate to it. Inside this directory, create a file called build.sbt. This file encodes all the dependencies for the project. Write the following in build.sbt:

// build.sbt

scalaVersion := "2.11.7"

This specifies which version of Scala we want to use for the project. Open a terminal in the sbt-example directory and type:

$ sbt

This starts an interactive shell. Let's open a Scala console:

> console

This gives you access to a Scala console in the context of your project:

scala> println("Scala is running!")
Scala is running!

Besides running code in the console, we will also write Scala programs. Open an editor in the sbt-example directory and enter a basic "hello, world" program. Name the file HelloWorld.scala:

// HelloWorld.scala

object HelloWorld extends App {
  println("Hello, world!")
}

Return to SBT and type:

> run

This will compile the source files and run the executable, printing "Hello, world!".

Besides compiling and running your Scala code, SBT also manages Scala dependencies. Let's specify a dependency on Breeze, a library for numerical algorithms. Modify the build.sbt file as follows:

// build.sbt

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
  "org.scalanlp" %% "breeze" % "0.11.2",
  "org.scalanlp" %% "breeze-natives" % "0.11.2"
)

SBT requires that statements be separated by empty lines, so make sure that you leave an empty line between scalaVersion and libraryDependencies. In this example, we have specified a dependency on Breeze version "0.11.2". How did we know to use these coordinates for Breeze? Most Scala packages will quote the exact SBT string to get the latest version in their documentation.

If this is not the case, or you are specifying a dependency on a Java library, head to the Maven Central website (http://mvnrepository.com) and search for the package of interest, for example "Breeze". The website provides a list of packages, including several named breeze_2.xx packages. The number after the underscore indicates the version of Scala the package was compiled for. Click on "breeze_2.11" to get a list of the different Breeze versions available. Choose "0.11.2". You will be presented with a list of package managers to choose from (Maven, Ivy, Leiningen, and so on). Choose SBT. This will print a line like:

libraryDependencies += "org.scalanlp" % "breeze_2.11" % "0.11.2"

These are the coordinates that you will want to copy to the build.sbt file. Note that we just specified "breeze", rather than "breeze_2.11". By preceding the package name with two percentage signs, %%, SBT automatically resolves to the correct Scala version. Thus, specifying %% "breeze" is identical to % "breeze_2.11".

Now return to your SBT console and run:

> reload

This will fetch the Breeze jars from Maven Central. You can now import Breeze in either the console or your scripts (within the context of this Scala project). Let's test this in the console:

> console
scala> import breeze.linalg._
import breeze.linalg._

scala> import breeze.numerics._
import breeze.numerics._

scala> val vec = linspace(-2.0, 2.0, 100)
vec: breeze.linalg.DenseVector[Double] = DenseVector(-2.0, -1.9595959595959596, ...

scala> sigmoid(vec)
breeze.linalg.DenseVector[Double] = DenseVector(0.11920292202211755, 0.12351078065 ...

You should now be able to compile, run and specify dependencies for your Scala scripts.

Who this book is for

This book introduces the data science ecosystem for people who already know some Scala. If you are a data scientist, or data engineer, or if you want to enter data science, this book will give you all the tools you need to implement data science solutions in Scala.

For the avoidance of doubt, let me also clarify what this book is not:

  • This is not an introduction to Scala. We assume that you already have a working knowledge of the language. If you do not, we recommend Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners.

  • This is not a book about machine learning in Scala. We will use machine learning to illustrate the examples, but the aim is not to teach you how to write your own gradient-boosted tree class. Machine learning is just one (important) part of data science, and this book aims to cover the full pipeline, from data acquisition to data visualization. If you are interested more specifically in how to implement machine learning solutions in Scala, I recommend Scala for machine learning, by Patrick R. Nicolas.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, and user input are shown as follows: "We can import modules with the import statement."

A block of code is set as follows:

def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {
  for { 
    (currentElem, index) <- collection.zipWithIndex
    if (currentElem == elem)
  } yield index
}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {
  for { 
    (currentElem, index) <- collection.zipWithIndex
    if (currentElem == elem)
  } yield index
}

Any command-line input or output is written as follows:

scala> val nTosses = 100
nTosses: Int = 100

scala> def trial = (0 until nTosses).count { i =>
   util.Random.nextBoolean() // count the number of heads
}
trial: Int

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

The code examples are also available on GitHub at www.github.com/pbugnion/s4ds.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala for Data Science
Published in: Jan 2016Publisher: ISBN-13: 9781785281372
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Pascal Bugnion

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University. Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.
Read more about Pascal Bugnion