Reader small image

You're reading from  Scala for Data Science

Product typeBook
Published inJan 2016
Reading LevelIntermediate
Publisher
ISBN-139781785281372
Edition1st Edition
Languages
Right arrow
Author (1)
Pascal Bugnion
Pascal Bugnion
author image
Pascal Bugnion

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University. Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.
Read more about Pascal Bugnion

Right arrow

Chapter 2. Manipulating Data with Breeze

Data science is, by and large, concerned with the manipulation of structured data. A large fraction of structured datasets can be viewed as tabular data: each row represents a particular instance, and columns represent different attributes of that instance. The ubiquity of tabular representations explains the success of spreadsheet programs like Microsoft Excel, or of tools like SQL databases.

To be useful to data scientists, a language must support the manipulation of columns or tables of data. Python does this through NumPy and pandas, for instance. Unfortunately, there is no single, coherent ecosystem for numerical computing in Scala that quite measures up to the SciPy ecosystem in Python.

In this chapter, we will introduce Breeze, a library for fast linear algebra and manipulation of data arrays as well as many other features necessary for scientific computing and data science.

Code examples


The easiest way to access the code examples in this book is to clone the GitHub repository:

$ git clone 'https://github.com/pbugnion/s4ds'

The code samples for each chapter are in a single, standalone folder. You may also browse the code online on GitHub.

Installing Breeze


If you have downloaded the code examples for this book, the easiest way of using Breeze is to go into the chap02 directory and type sbt console at the command line. This will open a Scala console in which you can import Breeze.

If you want to build a standalone project, the most common way of installing Breeze (and, indeed, any Scala module) is through SBT. To fetch the dependencies required for this chapter, copy the following lines to a file called build.sbt, taking care to leave an empty line after scalaVersion:

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
  "org.scalanlp" %% "breeze" % "0.11.2",
  "org.scalanlp" %% "breeze-natives" % "0.11.2"
)

Open a Scala console in the same directory as your build.sbt file by typing sbt console in a terminal. You can check that Breeze is working correctly by importing Breeze from the Scala prompt:

scala> import breeze.linalg._
import breeze.linalg._

Getting help on Breeze


This chapter gives a reasonably detailed introduction to Breeze, but it does not aim to give a complete API reference.

To get a full list of Breeze's functionality, consult the Breeze Wiki page on GitHub at https://github.com/scalanlp/breeze/wiki. This is very complete for some modules and less complete for others. The source code (https://github.com/scalanlp/breeze/) is detailed and gives a lot of information. To understand how a particular function is meant to be used, look at the unit tests for that function.

Basic Breeze data types


Breeze is an extensive library providing fast and easy manipulation of arrays of data, routines for optimization, interpolation, linear algebra, signal processing, and numerical integration.

The basic linear algebra operations underlying Breeze rely on the netlib-java library, which can use system-optimized BLAS and LAPACK libraries, if present. Thus, linear algebra operations in Breeze are often extremely fast. Breeze is still undergoing rapid development and can, therefore, be somewhat unstable.

Vectors

Breeze makes manipulating one- and two-dimensional data structures easy. To start, open a Scala console through SBT and import Breeze:

$ sbt console
scala> import breeze.linalg._
import breeze.linalg._

Let's dive straight in and define a vector:

scala> val v = DenseVector(1.0, 2.0, 3.0)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)

We have just defined a three-element vector, v. Vectors are just one-dimensional arrays of data exposing methods...

An example – logistic regression


Let's now imagine we want to build a classifier that takes a person's height and weight and assigns a probability to their being Male or Female. We will reuse the height and weight data introduced earlier in this chapter. Let's start by plotting the dataset:

Height versus weight data for 181 men and women

There are many different algorithms for classification. A first glance at the data shows that we can, approximately, separate men from women by drawing a straight line across the plot. A linear method is therefore a reasonable initial attempt at classification. In this section, we will use logistic regression to build a classifier.

A detailed explanation of logistic regression is beyond the scope of this book. The reader unfamiliar with logistic regression is referred to The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. We will just give a brief summary here.

Logistic regression estimates the probability of a given height and weight belonging...

Towards re-usable code


In the previous section, we performed all of the computation in a single script. While this is fine for data exploration, it means that we cannot reuse the logistic regression code that we have built. In this section, we will start the construction of a machine learning library that you can reuse across different projects.

We will factor the logistic regression algorithm out into its own class. We construct a LogisticRegression class:

import breeze.linalg._
import breeze.numerics._
import breeze.optimize._

class LogisticRegression(
    val training:DenseMatrix[Double], 
    val target:DenseVector[Double])
{

The class takes, as input, a matrix representing the training set and a vector denoting the target variable. Notice how we assign these to vals, meaning that they are set on class creation and will remain the same until the class is destroyed. Of course, the DenseMatrix and DenseVector objects are mutable, so the values that training and target point to might change...

Alternatives to Breeze


Breeze is the most feature-rich and approachable Scala framework for linear algebra and numeric computation. However, do not take my word for it: experiment with other libraries for tabular data. In particular, I recommend trying Saddle, which provides a Frame object similar to data frames in pandas or R. In the Java world, the Apache Commons Maths library provides a very rich toolkit for numerical computation. In Chapter 10, Distributed Batch Processing with Spark, Chapter 11, Spark SQL and DataFrames, and Chapter 12, Distributed Machine Learning with MLlib, we will explore Spark and MLlib, which allow the user to run distributed machine learning algorithms.

Summary


This concludes our brief overview of Breeze. We have learned how to manipulate basic Breeze data types, how to use them for linear algebra, and how to perform convex optimization. We then used our knowledge to clean a real dataset and performed logistic regression on it.

In the next chapter, we will discuss breeze-viz, a plotting library for Scala.

References


The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman, gives a lucid, practical description of the mathematical underpinnings of machine learning. Anyone aspiring to do more than mindlessly apply machine learning algorithms as black boxes ought to have a well-thumbed copy of this book.

Scala for Machine Learning, by Patrick R. Nicholas, describes practical implementations of many useful machine learning algorithms in Scala.

The Breeze documentation (https://github.com/scalanlp/breeze/wiki/Quickstart), API docs (http://www.scalanlp.org/api/breeze/#package), and source code (https://github.com/scalanlp/breeze) provide the most up-to-date sources of documentation on Breeze.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala for Data Science
Published in: Jan 2016Publisher: ISBN-13: 9781785281372
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Pascal Bugnion

Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University. Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.
Read more about Pascal Bugnion