It is critical for any computer scientist that they understand the different classes of machine learning algorithms and are able to select the ones that are relevant to the domain of their expertise and dataset. However, the application of these algorithms represents a small fraction of the overall effort needed to extract an accurate and performing model from input data. A common data mining workflow consists of the following sequential steps:
Defining the problem to solve.
Loading the data.
Cleaning the data.
Discovering patterns, affinities, clusters, and classes, if needed.
Selecting the model features and the appropriate machine learning algorithm(s).
Refining and validating the model.
Improving the computational performance of the implementation.
As we will emphasize throughout this book, each stage of the process is critical for building a model appropriate for the problem.
It is impossible to describe in every detail the key machine learning algorithms and their implementation in a single book. The sheer quantity of information and Scala code would overwhelm even the most dedicated readers. Each chapter focuses on the mathematics and code that are absolutely essential for the understanding of the topic. Developers are encouraged to browse through the following areas:
Scala coding conventions and standards used in the book in the Appendix
API Scala docs
Fully documented source code, available online
This first chapter introduces the following elements:
Basic concept of machine learning
Taxonomy of machine learning algorithms
Language, tools, frameworks, and libraries used throughout the book
A typical workflow of model training and prediction
A simple concrete application using binomial logistic regression
Each chapter contains a small section dedicated to the formulation of the algorithms for those interested in the mathematical concepts behind the science and art of machine learning. These sections are optional and defined within a tip box.
For example, the mathematical expression of the mean and the variance of a variable, X, as mentioned in a tip box will be as follows:
The recent explosion in the number of digital devices has generated an everincreasing amount of data. The best analogy I can find to describe the need, desire, and urgency for extracting knowledge from large datasets is the process of extracting a precious metal from a mine, and in some cases, extracting blood from a stone.
Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domainspecific, ranging from credit risk assessment, face recognition, maximization of quality of service, classification of pathological symptoms of disease, optimization of computer networks, and security intrusion detection, to customers' online behavior and purchase history.
Machine learning problems are categorized as classification, prediction, optimization, and regression.
The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding body temperature (continuous variable), congestion (discrete variables of HIGH, MEDIUM, and LOW), and the actual diagnosis (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnosis.
Once the model is trained using historical observations and validated against historical observations, it can be used to predict some outcome. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health.
Some global optimization problems are intractable using traditional linear and nonlinear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered.
Regression is a classification technique that is particularly suitable for a continuous model. Linear (least squares), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model or function, y= f (x), x={x_{i}} to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical.
Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular mapreduce model for distributed computation of large amounts of data.
Note
Disclaimer
This section does not constitute a formal introduction or description of the features of Scala. It merely mentions some of its features that are valuable to machine learning practitioners. Experienced Scala developers may skip to the next section.
Among the capabilities of the language, the following features are deemed essential in machine learning and statistical analysis.
There are many functional features in Scala which may unsettle software engineers with experience in objectoriented programming. This section deals specifically with monadic and functorial representations of data. Functors and monads are concepts defined in the field of mathematics known as category theory. Formerly:
A functor is a data type that defines how a transformation known as a map applies to it. Scala implements functors as type classes with a
map
method.A monad is a wrapper around an existing data type. It applies a transformation to a data of wrapper type and returns a value of the same wrapper type. Scala implements monads as type classes with
unit
andflatMap
methods. Monads extends functors in Scala.
Functors and monads are important concepts in functional programming.
Functors and monads are derived from category and group theory; they allow developers to create a highlevel abstraction, as illustrated in the following Scala libraries:
Scalaz: https://github.com/scalaz
Twitter's Algebird: https://github.com/twitter/algebird
Google's Breeze : https://github.com/dlwh/breeze
In mathematics, a category M is a structure that is defined by the following:
Objects of some type {x e X, y Є Y, z Є Z, …}
Morphisms or maps applied to these objects x Є X, y Є Y, f: x › y
Composition of morphisms f: x › y, g: y › z => g o f: x › z
Covariant, contravariant functors, and bifunctors are wellunderstood concepts in algebraic topology that are related to manifold and vector bundles. They are commonly used in differential geometry for the generation of nonlinear models.
Higher kinded types (HKTs) are abstractions of types. They generate a new type from existing types. Let's consider the following parameterized trait:
trait M[T] { . }
A higher kinded type H
over a trait M
is defined as follows:
trait H[M[_]]; class H[M[_]]
Functors and monads are higher kinded types.
How are higher kinded types relevant to data analysis and machine learning?
Scientists define observations as sets or vectors of features.
Classification problems rely on the estimation of the similarity between vectors of observations. One technique consists of comparing two vectors by computing the normalized inner (or dot) product. A covector is defined as a linear map α of vector to the inner product (field).
Let's define a vector as a constructor from any field, _ => Vector[_]
. A covector is then defined as the mapping function of a vector to its field: Vector[_]
.
Let's then define a twodimension (two types or fields) higher kinded structure, Hom
, that can be defined as either a vector or a covector by fixing one of the two types:
type Hom[T] = {
type Right[X] = Function1[X,T] // Covector
type Left[X] = Function1[T,X] // Vector
}
Note
Tensors and manifolds
Vector and covector are classes of tensor (contravariant and covariant). Tensors (fields) are used in manifold learning nonlinear models and in the generation of kernel functions. Manifolds are briefly introduced in the Manifolds section in. The topic of tensor fields in manifold learning is beyond the scope of this book.
The projections of the higherkind Hom
to Right
or Left
single parameter types are known as functors:
Covariant functor for the right projection
Contravariant functor for the left projection
A covariant functor is a mapping function, such as F: C => C, with the following properties:
If f: x › y is a morphism on C then F(x) › F(y) is also a morphism on C
If id: x › x is the identity morphism on C then F(id) is also an identity morphism on C
If g: y › z is also a morphism on C then F(g o f) = F(g) o F(f)
The definition of the covariant functor is F[U => V] := F[U] => F[V]
. Its implementation in Scala is:
trait Functor[M[_]]{ def map[U,V](m: M[U])(f: U=>V): M[V] }
For example, let's consider an observation defined as an n dimension vector of type T, Obs[T]
. The constructor for the observation can be represented as Function1[T,Obs]
. Its functor, ObsFunctor
, is implemented as:
trait ObsFunctor[T] extends Functor[(Hom[T])#Left] { self => override def map[U,V](vu: Function1[T,U])(f: U =>V): Function1[T,V] = f.compose(vu) }
The functor is qualified as a covariant functor because the morphism is applied to the return type of the element of Obs, Function1[T, Obs]
. The projection of the two parameters types Hom
to a vector is implemented as (Hom[T])#Left
.
A contravariant functor is a mapping function, F: C => C, with the following properties:
The definition of the contravariant functor is F[U => V] := F[V] => F[U], as follows:
trait CoFunctor[M[_]]{ def map[U,V](m: M[U])(f: V=>U): M[V]}
Note that the input and output types in the morphism f are reversed from the definition of a covariant functor. The constructor for the covector can be represented as Function1[Obs,T].
Its functor, CoObsFunctor
, is implemented as:
trait CoObsFunctor[T] extends CoFunctor[(Hom[T])#Right] { self => override def map[U,V](vu: Function1[U,T])(f: V =>U): Function1[V,T] = f.andThen(vu) }
Monads are structures in algebraic topology related to category theory. Monads extend the concept of functors to allow composition known as the monadic composition of morphisms on a single type. They enable the chaining or weaving of computation into a sequence of steps sometimes known as a data pipeline. The collections bundled with the Scala standard library (List
, Map
…) are constructed as monads [1:1].
Monads provide the ability for those collections to do the following:
Create the collection
Transform the elements of the collection
Flatten nested collections
The following Scala definition of a monad as a trait illustrates the concept of a higher kinded Monad
trait for type M:
trait Monad[M[_]] { def unit[T](a: T): M[T] def map[U,V](m: M[U])(f U =>V): M[V] def flatMap[U,V](m: M[U])(f: U =>M[V]): M[V] }
Monads are therefore critical in machine learning as they enable the composition of multiple data transformation functions into a sequence or workflow. This property is applicable to any type of complex scientific computation [1:2].
Note
Monadic composition of kernel functions
Monads are used in the composition of kernel functions in the Kernel functions monadic composition section in Chapter 12, Kernel Models and Support Vector Machines.
Machine learning models are generated through sequences of tasks or dataflows that demand a modular design.
As an objectoriented programming language, Scala allows developers to do the following:
Define highlevel component abstraction
Allow different developers to work concurrently on different components
Reuse code
Isolate functionality for easier debugging and testing (unit tests)
You may wonder how Scala fares as an objectoriented programming against Java.
Note
Scala versus Java
Scala is the purest form of object oriented language than Java. It does not support static methods (static methods are methods of singletons) and primitive types.
One important facet of object oriented programming is the ability to change modules or implement functionality on the fly, without the need to recompile the client code. This technique is known as dependency injection. Scala supports dependency injection using a combination of abstract variables, selfreferenced composition, and stackable traits [1:3]. One of the most commonly used dependency injection patterns, the cake pattern, is described in the Building workflows with mixins section in Chapter 2, Data Pipelines.
As seen previously, functors and monads enable the parallelization and chaining of data processing functions by leveraging the Scala higherorder methods. In terms of implementation, actors are one of the core elements that make Scala scalable. Actors provide Scala developers with a high level of abstraction to build scalable, distributed, and concurrent applications. Actors hide the nittygritty implementation of concurrency and the management of the underlying threads pool. Actors communicate through asynchronous immutable messages. A distributed computing Scala framework such as Akka or Apache Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Apache Spark are described in detail in the last chapter of this book [1:4].
Concisely, a workflow is implemented as a sequence of activities or computational tasks. These tasks consist of higherorder Scala methods such as flatMap
, map
, fold
, reduce
, collect
, join
, or filter
applied to a large collection of observations. Scala provides developers with the tools to partition datasets and execute the tasks through a cluster of actors. Scala also supports message dispatching and routing between local and remote actors. A developer may decide to deploy a workflow either locally or across multiple CPU cores and servers with very few code alterations.
The following figure visualizes the different elements of the definition and deployment of a workflow (or data pipeline):
In the preceding diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 in a similar way to a scheduler. These tasks are actually executed over multiple worker nodes, and are implemented by actors. The master node or actor exchanges messages with the workers to manage the state of the execution of the workflow, as well as its reliability, as illustrated in the Scalability with actors section of Chapter 16, Parallelism with Scala and Akka. The high availability of these tasks is maintained through a hierarchy of supervising actors.
Note
Domainspecific languages (DSLs)
Scala embeds DSLs natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. A notorious application of DSLs is the definition of the emulation of the syntax use in the MATLAB program, familiar to most data scientists.
A model can be predictive, descriptive, or adaptive.
Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors (or features). They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields, including marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a preselected training set.
Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first and important step in knowledge discovery. They are commonly generated through unsupervised learning.
A third category of models, known as adaptive modeling, is created through reinforcement learning. Reinforcement learning consists of one or several decisionmaking agents that recommend, and possibly execute, actions in an attempt to solve a problem, optimizing an objective function or resolving constraints.
The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications, such as genomics, social networking, advertising, or risk analysis generate a very large amount of data which can be analyzed or mined to extract knowledge or insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:5].
Data mining is the process of extracting or identifying patterns in a dataset.
The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: the discovery of data clusters and the discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals.
Unsupervised learning does not require labeled data (or expected values), and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it in future classifications.
The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the distance between observations within a cluster and maximizing the distance between observations across clusters. A clustering algorithm consists of the following steps:
Data clustering is also known as data segmentation or data partitioning.
Dimension reduction techniques aim to find the smallest, yet most relevant, set of features needed to build a reliable model. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs.
There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The taxonomy breaks down these techniques according to their purpose, although the list is far from being exhaustive, as shown in the following diagram:
The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to find a relation or function f: x → y using a training set {x, y}. Supervised learning is far more accurate than any other learning strategy as long as the input, labeled data is available and reliable. The downside is that a domain expert may be required to label (or tag) data as a training set.
Supervised machine learning algorithms can be broken down into two categories:
Generative models
Discriminative models
In order to simplify the description of a statistics formula, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X having a value x: p(X) = p(X=x):
The notation for the joint probability is p(X,Y) = p(X=x, Y=y)
The notation for the conditional probability is p(XY) = p(X=xY=y)
Generative models attempt to fit a joint probability distribution p(X,Y) of two events (or random variables), X and Y, representing two set of observed and hidden variables, x, y. Discriminative models compute the conditional probability p(Y X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through Bayes' rule. The conditional probability of an event Y given an event X is computed as the product of the conditional probability of the event X given the event Y and the probability of the event X, normalized by the probability of event Y [1:6].
Note
Bayes' rule
Joint probability for independent random variables X=x and Y=y:
Conditional probability of a random variable Y = y, given X = x:
Bayes' formula
Bayes' rule is the foundation of the Naïve Bayes classifier, which is described in the Introducing the multinomial Naïve Bayes section in Chapter 6, Naïve Bayes Classifiers.
Contrary to generative models, discriminative models compute the conditional probability p(YX) directly, using the same algorithm for training and classification.
Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here are some brief guidelines describing which type of models make sense according to the objective or criteria of the project:
Objective 
Generative models 
Discriminative models 

Accuracy 
Highly dependent on the training set. 
Depends on training set and algorithm configuration (that is, kernel functions). 
Modeling requirements 
There is a need to model both observed and hidden variables, which requires a significant amount of training. 
The quality of the training set does not have to be as rigorous as for generative models. 
Computation cost 
It is usually low. For example, any graphical method derived from Bayes' rule has low overhead. 
Most algorithms rely on optimization of a convex function with significant performance overhead. 
Constraints 
These models assume some degree of independence among the model features. 
Most discriminative algorithms accommodate dependencies between features. 
We can further refine the taxonomy of supervised learning algorithms by segregating arbitrary, between sequential and random variables for generative models and by breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification).The following figure illustrates a partial taxonomy of supervised learning algorithms:
Semisupervised learning is used to build models from a dataset with incomplete labels. Manifold learning and information geometry algorithms are commonly applied to large datasets that are partially labeled. The description of semisupervised learning techniques is beyond the scope of the book.
Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realm of robotics or game strategy. However, since the 1990s, geneticalgorithmbased classifiers have become increasingly popular in solving problems that require the collaboration of a system with a domain expert.
For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert, if necessary. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on a partially observable state.
Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the most performing rules and policies.
As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary. The following figure represents a partial taxonomy of the reinforcement learning algorithms:
The genetic algorithm is described in Chapter 13, Evolutionary Computing, and the Qlearning reinforcement method is introduced in Chapter 15, Reinforcement Learning.
This is a brief overview of machine learning algorithms with a suggested, approximate taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse the list of references at the end of the book to find the documentation appropriate to his/her level of interest and understanding.
There are numerous robust, accurate, and efficient Java libraries for mathematics, linear algebra, or optimization that have been widely used for many years:
JBlas/Linpack: https://github.com/mikiobraun/jblas
Parallel Colt: https://github.com/rwl/ParallelColt
Apache Commons Math: http://commons.apache.org/proper/commonsmath
There is absolutely no need to rewrite, debug, and test these components in Scala. Developers should consider creating a wrapper or interface to his/her favorite and reliable Java library. The book leverages the Apache Commons Math library for some specific linear algebra algorithms.
Before getting your hands dirty, you need to download and deploy the minimum set of tools and libraries; there is no need to reinvent the wheel, after all. A few key components have to be installed in order to compile and run the source code described throughout this book. We will focus on open source and commonly available libraries, although you are invited to experiment with the equivalent tools of your choice. The learning curve for the frameworks described here is minimal.
The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64. You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME
, PATH
, and CLASSPATH
have to be updated accordingly.
The code has been tested with Scala 2.11.4 and 2.11.8. We recommend using Scala version 2.11.4 or higher with SBT 0.13.1 or higher. Let's assume that the Scala runtime (REPL
) and libraries have been properly installed and that the environment variables SCALA_HOME
, and PATH
have been updated.
The Scala standard library can be downloaded as binaries or as part of the Typesafe Activator tool by visiting http://www.scalalang.org/download/.
The description and installation instructions for the Eclipse Scala IDE version 4.0 and higher is available at http://scalaide.org/docs/user/gettingstarted.html.
You can also download the IntelliJ IDEA Scala plugin version 13 or higher from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.
The ubiquitous Simple Build Tool (SBT) will be our primary building engine. It can be downloaded as part of the Typesafe activator or directly from http://www.scalasbt.org/download.html.
The syntax of the build file sbt/build.sbt
conforms to version 0.13 and is used to compile and assemble the source code presented throughout this book. To build Scala for machine learning, do the following:
Set the maximum size for the JVM heap to 2058 Mbytes or higher and the permanent memory to 512 Mbytes or higher (that is,
Xmx4096m Xms512m XX:MaxPermSize=512m
)To build the Scala for machine learning library package:
$(ROOT)/sbt clean publishlocal
To build the package including test and resource files:
$(ROOT)/sbt clean package
To generate Scala doc for the library:
$(ROOT)/sbt doc
To generate Scala doc for the example:
$(ROOT)/sbt test:doc
To generate report for compliance to Scala style guide:
$(ROOT)/sbt scalastyle
To compile all examples:
$(ROOT)/sbt test:compile
Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].
This is a lightweight library that provides developers with a foundation of small, readytouse Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.5 or higher.
The math library supports the following:
For more information, visit http://commons.apache.org/proper/commonsmath.
We need Apache Public License 2.0; the terms are available at https://www.apache.org/licenses/LICENSE2.0.
The installation and deployment of the Apache Commons Math library are quite simple. The steps are as follows:
Go to the download page at http://commons.apache.org/proper/commonsmath/download_math.cgi.
Download the latest
.jar
files in the binary section,commonsmath33.6bin.zip
(for version 3.6, for instance).Unzip and install the
.jar
file.Add
commonsmath33.6.jar
to theCLASSPATH
, as follows:For macOS X:
export CLASSPATH=$CLASSPATH:/Commons_Math_path /commonsmath33.6.jar
For Windows:
Go to System property  Advanced system settings  Advanced  Environment variables and then edit the entry
CLASSPATH
variable.
Add the
commonsmath33.6.jar
file to your IDE environment if needed:Eclipse Scala IDE:
Project
Properties
Java Build Path
Libraries
Add External JARs
IntelliJ IDEA:
File
Project Structure
Project Settings
Libraries

the source commonsmath33.6src.zip
from the source
section.
JFreeChart is an open source chart and plotting java library widely used in the Java programmer community. It was originally created by David Gilbert [1:8].
The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithm throughout the book, but you are encouraged to explore this great library on your own, as time permits.
It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.
To install and deploy JFreeChart, perform the following steps:
Download the latest version from Source Forge: https://sourceforge.net/projects/jfreechart/files/.
Unzip and deploy the
.jar
file.Add
jfreechart1.0.17.jar
(for version 1.0.17) to theCLASSPATH
, as follows:For macOS X:
export CLASSPATH=$CLASSPATH:/JFreeChart_path/jfreechart1.0.17.jar
For Windows:
Go to System property  Advanced system settings  Advanced  Environment variables and then edit the entry
CLASSPATH
variable.
Add the
jfreechart1.0.17.jar
file to your IDE environment:Eclipse Scala IDE:
Project
Properties
Java Build Path
Libraries
Add External JARs
IntelliJ IDEA:
File
Project Structure
Project Settings
Libraries
+
Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with instructions for downloading them. Libraries related to the conditional random fields and support vector machines are described in their respective chapters.
Note
Why aren't we using Scala algebra and Scala numerical libraries?
Libraries such as Breeze, ScalaNLP, and Algebird are interesting Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a highquality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using existing or legacy java libraries [1:9].
The Scala programming language is used to implement and evaluate the machine learning techniques covered in Scala for machine learning. The source code presented in the book has been reduced to the minimum essential to the understanding of machine learning algorithms. The formal implementation of these algorithms is available on the website of Packt Publishing, http://www.packtpub.com.
The source code presented throughout the book follows a simple style guide and set of conventions.
Most of the Scala classes discussed in the book are parameterized with a type associated to the discrete/categorical value (Int
) or continuous value (Double
) [1:10]. For this book, context bounds are used instead of view bounds, as follows:
class A[T: ToInt](param: Param//implicit conversion to Int class C[T: ToDouble](param: Param)//implicit conversion to Double
For the sake of readability of the implementation of algorithms, code nonessential to the understanding of a concept or algorithm, such as error checking, comments, exception, or import, is omitted. The following code elements are shown in the code snippets presented in the book:
Code documentation:
// ….. /* … */
Validation of
class
parameters and method arguments:require( Math.abs(x) < EPS, " …")
Class qualifiers and scope declaration:
final protected class SVM { … } private[this] val lsError = …
Method qualifiers:
final protected def dot: = …
Exceptions:
try { correlate … } catch { case e: MathException => …. } Try { .. } match { case Success(res) => case Failure(e => .. }
Logging and debugging code:
private val logger = Logger.getLogger("..") logger.info( … )
Nonessential annotation:
@inline def main = …. @throw(classOf[IllegalStateException])
Nonessential methods
The complete list of Scala code elements omitted in the code snippets in the book can be found in the Code snippets format section in the Appendix.
The algorithms presented in this book share the same primitive types, generic operators, and implicit conversions. For the sake of the readability of the code, the following primitive types will be used:
type DblPair = (Double, Double) type DblArray = Array[Double] type DblMatrix = Array[DblArray] type DblVec = Vector[Double] type XSeries[T] = Vector[T] // One dimensional vector type XVSeries[T] = Vector[Array[T]] // multidimensional vector
Time series, introduced in the Time series section in Chapter 3, Data Preprocessing, are implemented as XSeries[T]
or XVSeries[T]
of the parameterized type T.
Make a note of these six types; they are used across the entire book.
The conversion between the primitive types listed above and types introduced in the particular library (that is, the Apache Commons Math library) is described in the relevant chapters.
It is usually a good idea to reduce the number of states of an object. A method invocation transitions an object from one state to another. The larger the number of methods or states, the more cumbersome the testing process becomes.
For example, there is no point in creating a model that is not defined (trained). Therefore, making the training of a model as part of the constructor of the class it implements makes a lot of sense. Therefore, the only public methods of a machine learning algorithm are the following:
Classification or prediction
Validation
Retrieval of model parameters (weights, latent variables, hidden states, and so on) if needed
Note
Performance of Scala iterators
The evaluation of the performance of Scala highorder iterative methods is beyond the scope of this book. However, it is important to be aware of the tradeoff of each method. For instance, the monadic for expression is to be avoided as a counting iterator. The source code presented in this book uses the higherorder method foreach
for iterative counting.
This final section introduces the key elements of the training and classification workflow. A test case using a simple logistic regression is used to illustrate each step of the computational workflow.
The book relies on financial data in order to experiment with different learning strategies. The objective of the exercise is to build a model that can discriminate between volatile and nonvolatile trading sessions for stock or commodities. For the first example, we have selected a simplified version of the binomial logistic regression as our classifier, as we treat stock pricevolume action as a continuous or pseudocontinuous process.
Note
Introduction to logistic regression
Logistic regression is treated in depth in the Logistic regression section in Chapter 9, Regression and Regularization. The model treated in this example is the simple binomial logistic regression classifier for twodimension observations.
The classification of trading sessions according to their volatility and volume is as follows:
Scoping the problem.
Loading data.
Preprocessing raw data.
Discovering patterns, whenever possible.
Implementing the classifier.
Evaluating the model.
The objective here is to create a model for stock price using its daily trading volume and volatility. Throughout the book, we will rely on financial data to evaluate and discuss the merits of different data processing and machine learning methods. In this example, the data is extracted from Yahoo Finances using the CSV format with the following fields:
Date
Price at open
Highest price in session
Lowest price in session
Price at session close
Volume
Adjust price at session close
The enumerator YahooFinancials
extracts historical daily trading information from the Yahoo finance site:
type Features = Array[Double] type Weights = Array[Double] type ObsSet = Vector[Features] type Fields = Array[String] object YahooFinancials extends Enumeration { type YahooFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE=Value def toDouble(v: Value): Fields => Double = //1 (s: Fields) => s(v.id).toDouble def toArray(vs: Array[Value]): Fields => Features = //2 (s: Fields) => vs.map(v => s(v.id).toDouble) }
The method toDouble
converts an array of a string into a single value (line 1) and toArray
converts an array of a string into an array of values (line 2). The enumerator YahooFinancials
is described in detail in the Data sources section in the Appendix.
Let's create a simple program that loads the content of the file, executes some simple preprocessing functions, and creates a simple model. We selected the CSCO stock price between January 1, 2012 and December 1, 2013 as our data input.
Let's consider two variables, price and volume, as illustrated by the following screenshot. The top graph displays the variation of the price of Cisco stock over time and the bottom bar chart represents the daily trading volume on Cisco stock over time:
The second step is loading the dataset from local or remote data storage. Typically, a large dataset is loaded from a database or distributed filesystem such as Hadoop Distributed File System (HDFS). The load
method takes an absolute path name, extract
, and transforms the input data from a file into a time series of type Vector[DblPair]
:
def load(fileName: String): Try[Vector[DblPair]] = Try {
val src = Source.fromFile(fileName) //3
val data = extract(src.getLines.map(_.split(",")).drop(1))//4
src.close //5
data
}
The data file is extracted through a invocation of the static method Source.fromFile
(line 3), then the fields are extracted through a map
before the header (the first row in the file) is removed using drop
(line 4). The file has to be closed to avoid leaking the file handle (line 5).
Note
Data extraction
The method invocation pipeline Source.fromFile.getLines.map
returns an Iterator
, which can be traversed only once.
The purpose of the extract
method is to generate a time series of two variables (relative stock volatility and relative stock daily trading volume):
def extract(cols: Iterator[Fields]): ObsSet = { val features = Array[YahooFinancials](LOW, HIGH, VOLUME) //6 val conversion = toArray(features) //7 cols.map(conversion(_)).toVector .map(x => Array[Double](1.0  x(0)/x(1), x(2))) //8 }
The only purpose of the extract
method is to convert the raw textual data into a twodimension time series. The first step consists of selecting the three features to extract: LOW
(lowest stock price in the session), HIGH
(highest price in the session), and VOLUME
(trading volume for the session) (line 6). This feature set is used to convert each line of the fields into a corresponding set of three values (line 7). Finally, the feature set is reduced to two variables (line 8):
Relative volatility of stock price in a session, 1.0 – LOW/HIGH
Trading volume for the stock in the session, VOLUME
Note
Code readability
A long pipeline of Scala highorder methods makes the code and underlying code quite difficult to read. It is recommended to take long chains of method calls, such as the following:
val cols = Source.fromFile.getLines.map(_.split(",")).toArray.drop(1)
Then, break them down into several steps:
val lines = Source.fromFile.getLines val fields = lines.map(_.split(",")).toArray val cols = fields.drop(1)
We strongly encourage the reader to consult the excellent guide Effective Scala written by Marius Eriksen from Twitter. This is definitively a mustread for any Scala developer [1:11].
The next step is to normalize the data in the range [0.0, 1.0] to be trained by the binomial logistic regression. It is time to introduce an immutable and flexible normalization class.
Logistic regression relies on the sigmoid curve or logistic function described in the Logistic function section in Chapter 9, Regression and Regularization. The logistic function is used to segregate training data into classes. The output value of the logistic function ranges from 0 for x =  INFINTY to 1 for x = + INFINITY. Therefore, it makes sense to normalize the input data or observation over [0, 1].
Note
To normalize or not normalize?
The purpose of normalizing data is to impose a single range of values for all the features, so the model does not favor any particular feature. Normalization techniques include linear normalization and Zscore. Normalization is an expensive operation that is not always needed.
Normalization is a linear transformation on the raw data that can be generalized to any range [l, h].
Note
Linear normalization
M2: [0, 1] Normalization features {x_{i}} with minimum xmin, maximum xmax values:
M3: [l, h] Normalization of features {x_{i}}:
The normalization of input data in supervised learning has a specific requirement: the classification and prediction of new observations have to use the normalization parameters (min
, max
) extracted from the training set, so all observations share the same scaling factor.
Let's define the normalization class MinMax
. The class is immutable: the minimum, min
, and maximum, max
, values are computed within the constructor. The class takes a time series of the parameterized type T values
as an argument (line 8). The steps of the normalization process are defined as follows:
Initialize the minimum values for a given time series during instantiation (line 9).
Compute the normalization parameters (line 10) and normalize the input data (line 11).
Normalize any new data point reusing the normalization parameters (line 14):
class MinMax[T : ToDouble](val values: Vector[T]) { //8 val zero = (Double.MaxValue, Double.MaxValue) val (min, max) = values./:(zero){ case ((mn, mx),x) => { val z = implicitly[ToDouble[T]].apply(x) (if(z < mn) z else mn, if(z > mx) z else mx) //9 }} case class ScaleFactors( low:Double, high:Double, ratio: Double ) var scaleFactors: Option[ScaleFactors] = None //10 def normalize(low: Double, high: Double): Vector[Double]//11 def normalize(value: Double): Double }
The class constructor computes the tuple of minimum and maximum values minMax
using a fold (line 9). The scaling parameters scaleFactors
are computed during the normalization of the time series (line 11), described as follows. The method normalize
initializes the scaling factors parameters (line 12) before normalizing the input data (line 13):
def normalize(low: Double, high: Double): Vector[Double] = setScaleFactors(low, high).map( scale => { //12 values.map(x =>{ val z = implicitly[ToDouble[T]].apply(x) (z  min)*scale.ratio + scale.low //13 }) }).getOrElse(/* … */) def setScaleFactors(l: Double, h: Double): Option[ScaleFactors]={ // .. error handling code Some(ScaleFactors(l, h, (h  l)/(max  min)) }
Subsequent observations use the same scaling factors extracted from the input time series in normalize
(line 14):
def normalize(value: Double): Double = setScaleFactors.map(
scale =>
if(value < min) scale.low
else if (value > max) scale.high
else (value  min)* scale.high + scale.low
).getOrElse( /* … */)
The class MinMax
normalizes single variable observations.
Note
Statistics class
The class that extracts the basic statistics from a dataset, Stats
, introduced in the Profiling data
section in Chapter 2, Data Pipelines, inherits the class MinMax
.
The test case with the binomial logistic regression uses a multiple variable normalization, implemented by the class MinMaxVector
which takes observations of type Vector[Array[Double]]
as input:
class MinMaxVector(series: Vector[Double]) { val minMaxVector: Vector[MinMax[Double]] = //15 series.transpose.map(new MinMax[Double](_)) def normalize(low: Double, high: Double): Vector[Double] }
The constructor of the class MinMaxVector
transposes the vector of an array of observations in order to compute the minimum and maximum values for each dimension (line 15).
The price action chart has a very interesting characteristic.
At a closer look, a sudden change in price and increase in volume occurs about every 3 months or so. Experienced investors will undoubtedly recognize that these pricevolume patterns are related to the release of quarterly earnings of Cisco. Such a regular but unpredictable pattern can be a source of concern or opportunity if risk can be properly managed. The strong reaction of the stock price to the release of corporate earnings may scare some longterm investors while enticing day traders.
The following graph visualizes the potential correlation between sudden price change (volatility) and heavy trading volume:
The next section is not required for the understanding of the test case. It illustrates the capabilities of JFreeChart as a simple visualization and plotting library.
Although charting is not the primary goal of this book, we thought that you would benefit from a brief introduction to JFreeChart.
Note
Plotting classes
This section illustrates a simple Scala interface to JFreeChart java classes. Its reading is not required for the understanding of machine learning. The visualization of the results of a computation is beyond the scope of this book.
Some of the classes used in visualization are described in the Appendix.
The dataset (volatility, volume) is converted into internal JFreeChart data structures.
The following code snippet defines the key components of a simple scatter plot:
class ScatterPlot(config: PlotInfo, theme: PlotTheme) {//16 def display(xy: Vector[DblPair], width: Int, height) //17 // …. }
The class ScatterPlot
implements a simple configurable scatter plot with the following arguments:
config
: Information, labels, and fonts of the plottheme
: Predefined theme for the plot (black, white background, and so on)
The class PlotTheme
defines a specific theme or preconfiguration of the chart (line 16). The class offers a set of methods with the name display
to accommodate for a wide range of data structures and configuration (line 17).
Note
Visualization
The JFreeChart library is introduced as a robust charting tool. The code related to plots and charts is omitted throughout the book in order to keep the code snippets concise. On a few occasions, output data is formatted in an CSV file to be imported into a spreadsheet.
The ScatterPlot.display
method is used to display the normalized input data used in the binomial logistic regression, as follows:
val plot = new ScatterPlot(("CSCO 201213 Model features",
"Normalized session volatility", "Normalized session Volume"),
new BlackPlotTheme)
plot.display(volatilityVolume, 250, 340)
The invocation of the method display generates the following output:
The scatter plot shows some level of correlation between session volume and session volatility and confirms the initial finding in the stock price and volume chart. We can leverage this information to classify trading sessions by their volatility and volume. The next step is to create a twoclass model by loading a training set, observations, and expected values into our logistic regression algorithm. The classes are delimited by a decision boundary (also known as a hyperplane) drawn onto the scatter plot.
The objective of this training is to build a model that can discriminate between volatile and nonvolatile trading sessions. For the sake of the exercise, session volatility is defined as the relative difference between a session's highest price and lowest price. The total trading volume within a session constitutes the second parameter of the model. The relative price movement within a trading session (that is, closing price/open price 1) is our expected value or label.
Logistic regression is commonly used in statistics inference.
Note
Logistic regression model (M4)
Given a model with weight w_{i}, the margin f and the logistic function l are defined as:
The first weight w_{0} is known as the intercept. The binomial logistic regression is described in detail in the Logisticregression section in Chapter 9, Regularization and Regression.
The following implementation of the binomial logistic regression classifier exposes a single method, classify
, to comply with our desire to reduce the complexity and life cycle of objects. The model parameters, weights
, are computed during training when the class/model LogBinRegression
is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted.
The constructor LogBinRegression
has five arguments (line 18):
observations
: Vector observations representing volume and volatilityexpected
: A vector of expected values (relative price movement)maxIters
: The maximum number of iterations allowed for the optimizer to extract the regression weights during trainingeta
: Learning or training rateeps
: The maximum value of the error (predicted – expected) for which the model is valid:class LogBinRegression( observations: Vector[Features], expected: Vector[Double], maxIters: Int, eta: Double, eps: Double) { //18 val model: LogBinRegressionModel = train //19 def classify(obs: Feature): Try[(Int, Double)] //20 def train: LogBinRegressionModel def intercept(weights: Weights): Double … }
The model LogBinRegressionModel
is generated through training during the instantiation of the logistic regression class, LogBinRegression
(line 19):
case class LogBinRegressionModel(
weights: Weights,
losses: List[Double]
)
The model is fully defined by its weights
as described in the mathematical formula M4. The intercept weights(0)
represents the mean value of the prediction for observations whose variables are zero. The list losses
contain the logistic loss collected at each iteration. It is used for debugging purposes. The intercept does not have a specific meaning in most cases and it is not always computable.
Note
To intercept or not intercept?
The intercept corresponds to the value of weights when the observations have null values. It is a common practice to estimate, whenever possible, the intercept for binomial linear or logistic regressions independently from the slope of the model in the minimization of the error function. The multinomial regression models treat the intercept or weight w0 as part of the regression model, as described in the Ordinary least square regression section of Chapter 9, Regression and Regularization.
The following code snippet implements the computation of the intercept
given a model, Weights
:
def intercept(weights: Weights): Double = {
val zeroObs = obsSet.filter(_.exists(_ > 0.01))
if( zeroObs.size > 0)
zeroObs.aggregate(0.0)(
(s,z) => s + dot(z, weights), _ + _
)/zeroObs.size
else 0.0
}
The classify
method takes new observations as input and computes the index of the classes (0 or 1) that the observations belong to, along with the actual likelihood (line 20).
The goal of the training of a model using expected values is to compute the optimal weights that minimize the error or loss function.
Note
Least squares or logistic loss
The sum of least squares loss is more often used for regression problems while the logistic loss is more commonly applied to classification.
We select the Stochastic Gradient Descent (SGD) algorithm to minimize the cumulative error between the predicted and expected values for all the observations. Although there are quite a few alternative optimizers, the SGD is quite robust and simple enough for this first chapter. The algorithm consists of updating the weights wi of the regression model by minimizing the cost
.
For those interested in learning about about optimization techniques, the Summary of optimization techniques section in the Appendix presents an overview of the most commonly used optimizers. The stochastic descent gradient is used for the training of the multilayer perceptron (refer to the The training epoch subsection in the The multilayer perceptron (MLP) section of Chapter 10, Multilayer Perceptron for more detail).
The execution of the SGD algorithm follows these steps:
Initialize the weights of the regression model.
Shuffle the order of observations and expected pair of values.
Select the first pair of observations and expected value.
Compute the loss for this pair.
Update the model weights using the derivatives of the loss over each weight.
Repeat from step 3 until either the maximum number of iterations is reached or the incremental update of the loss is close to zero.
The purpose of shuffling the order of the observations between iterations is to avoid the minimization of the cost reaching a local minimum.
Note
Batch and SGD
The SGD is a variant of the gradient descent which updates the model weights after computing the error on each observation. Although the SGD requires a higher computation effort to process each observation, it converges toward the optimal value of weights fairly quickly after a small number of iterations. However, the SGD is sensitive to the initial value of the weights and the selection of the learning rate, which is usually defined by an adaptive formula.
The training method, train
, consists of iterating through the computation of the weight using a simple descent gradient method. The method train
computes the weights
, collects the logistic loss, losses
, at each iteration and returns an instance of the model LogBinRegressionModel
. The code is represented here:
def train: LogBinRegressionModel = { val init = Array.fill(nWeights)(Random.nextDouble) //22 val (weights, losses) = sgd( 0,init, List[Double]() ) new LogBinRegressionModel(weights, losses.reverse) //23 }
The method train
extracts the number of weights, nWeights
, for the regression model as the number of variables in each observation + 1 (line 21). The method initializes the weights
with random values over [0, 1] (line 22). The weights are computed through the tail recursive method sgd
and the method returns a new model for the binomial logistic regression (line 23).
Note
Unwrapping values from Try:
It is not usually recommended to invoke the method get
to a Try
value, unless it is enclosed in a Try
statement. The best course of action is to do the following:
 catch the failure with
match{ case Success(m) => .case Failure(e) =>}
 extract safely the result
getOrElse( /* … */ )
 propagate the results as a
Try type map( _.m)
Let's look at the computation for the weights
through the minimization of the loss function in the sgd
method:
val shuffled = shuffle(observations.zip(expected)) //24 @tailrec def sgd( nIters: Int, weights: Weights,//25 losses: List[Double]): (Weights, List[Double] ) = { //26 if(nIters >= maxIters) (weights, losses) //27 else { val (x, y) = shuffled(nIters % observations.size) val (newLoss, grad) = { val yDot = y * margin(x, weights) val gradient = derivativeLoss(y, yDot) (logisticLoss(yDot), // 28 Array[Double](gradient) ++ x.map(_ *gradient) )//29 } if(newLoss < eps) //30 (weights, newLoss :: losses) //31 else { val newWeights = weights.zip(grad).map{ case (w, df) => w  eta*df //33 } sgd( nIters+1, //34 newWeights, newLoss :: losses) } }
The sgd
method recurses on the following arguments:
The method returns the pair of weights
and the list of losses
computed at each iteration if the maximum number of iterations allowed for the optimization is reached (line 27). The client code evaluates either the size of the losses list or extracts its head value to validate whether SGD converged.
Note
SGD exit strategies
There are many different possible behaviors when the SGD reaches the maximum allowed number of iterations:
Returns the final weights with a warning or a flag
Throws an exception with a recovery mechanism
Allows more iterations
The formula, M4, for the computation of the loss (line 28) and the gradient of the loss over each weight in formula, M5 (line 29), relies on two simple methods: logisticLoss
and derivativeLoss
. The code is as follows:
def logisticLoss(z: Double): Double = log(1.0 + exp(z)) / observations.size //30 def derivedLoss(y: Double, yDot: Double):Double = y / (1.0 + exp(yDot))
The logistic loss is normalized by the number of observations (line 30).
The method evaluates new loss against the convergence criterion eps (line 31) and returns a version of the pair (weights
, losses
) (line 32) if the SGD converges. The formula M4 that updates the weights is implemented by zipping the weights and the gradient (line 33). The next invocation of SGD selects the next observation in the shuffled sequence of observations using a modulo operator to avoid overflowing (line 34).
Finally, here is an example of implementation of the margin formula:
def margin(observation: Features, weights: Weights):Double =
weights.drop(1).zip(observation.view)
.aggregate(weights.head)(dot, _ + _)
This implementation of the margin includes the intercept with its weight associated to the bias, a feature of the value 1.0.
Note
Bias value
The purpose of the bias value is to prepend 1.0 to the vector of an observation so that it can be directly processed (that is, zip, dot) with the weights. For instance, a regression model for twodimensional observations (x, y) has three weights (w_{0}, w_{1}, w_{2}). The bias value, +1, is prepended to the observations to compute the predicted value, 1.0. w_{0} + x.w_{1}, +y.w_{2}.
This technique is used in the computation of the activation function of the multilayer perceptron as described in the Multilayerperceptronsection in Chapter 9, Artificial.
The sequence of observations is randomly shuffled before the SGD is computed. This implementation of shuffling relies on the Scala standard library method, scala.util.Random.shuffle
[1:13].
Note
FisherYates shuffling
The Training and classification subsection in the The multilayer perceptron (MLP) section of Chapter 10, Multilayer Perceptron, describes an alternative and efficient shuffling algorithm.
In order to train the model, we need to label input data. The labeling process consists of associating the relative price movement during a session (price at close/price at open – 1) with one of two configurations:
Volatile trading sessions with high trading volume
Trading sessions with low volatility and low trading volume
In this particular case, the labeling is automated because the relative price movement is extractable from raw data.
Once the model has been successfully created through training, it is available to classify new observation. The runtime classification of observations using the binomial logistic regression is implemented by the method classify
:
def classify(obs: Features): Try[(Int, Double)] = val linear = margin(obs, model.weights) + model.weights(0) //37 val prediction = sigmoid(linear) (if(linear > 0.0) 1 else 0, prediction) //38 })
The method applies the logistic function to the linear inner product, linear
, of the new observation, obs
, and the weights
of the model (line 37). The method returns the tuple (the predicted class of the observation {0, 1}, prediction value), where the class is defined by comparing the prediction to the boundary value 0.0 (line 38).
The computation of the margin as product of weights and observations is as follows:
def margin(obs: Features, weights: Weights): Double =
weights.drop(1).zip(obs.view)
.aggregate(0.0){case (s, (w,x)) => s + w*x, _ + _ }
The margin
method is used in the classify
method.
The first step is to define the configuration parameters for the test: the maximum number of iterations, NITERS
, convergence criterion EPS
, learning rate ETA
, and decision boundary used to label training observations, BOUNDARY
, and the path to the training and test sets:
val NITERS = 4096; val EPS = 0.001; val ETA = 0.0001 val path_training = "supervised/regression/CSCO.csv" val path_test = "supervised/regression/CSCO2.csv"
The various activities of creating and testing the model, loading, normalizing data, training the model, loading, and classifying test data is organized as a workflow using the monadic composition of the Try
class:
for { path < getPath(path_training) (volatility, vol) < load(path) minMaxVec < Try(new MinMaxVector(volatility)) normVolatilityVol < Try(minMaxVec.normalize(0.0, 1.0)) classifier < logRegr(normVolatilityVol, vol) testValues < load(path_test) normTestValue0 < minMaxVec.normalize(testValues._1(0)) class0 < classifier.classify(normTestValue0) normTestValue1 < minMaxVec.normalize(testValues._1(1)) class1 < classifier.classify(normTestValue1) } yield { val modelStr = model.toString }
At first, the daily trading volatility and volume for the stock price (volatility, Vol
) pairs are loaded from file (line 39). The workflow initializes the multidimensional normalizer, MinMaxVec
(line 40), and uses it to normalize the training set (line 41). The logRegr
method instantiates the binomial logistic regression, classifier
(line 42). The test data, testValues
, is loaded from file (line 43), normalized using the MinMaxVec
, which has been already applied to training data (line 44) and classified (line 45).
The method load extracts the data
(observations) of type XVSeries[Double]
from the file. The heavy lifting is done by the extract
method (line 46), and then the file handle is closed (line 47) before returning the vector of raw observations:
type Labels = (Vector[Features], Vector[Double]) def load(fileName: String): Try[Labels] = { val src = Source.fromFile(fileName) val data = extract(src.getLines.map( _.split(",")).drop(1)) //46 src.close; data //47 }
The method logRegr
, implemented in the following code snippet, has two purposes:
Labeling automatic observations,
obs
, to generatereal
values after normalization (line 48)Initializing (the instantiation and training of the model) the binomial logistic regression (line 49):
def logRegr(x: Vector[Features]): Try[LogBinRegression] = Try { val (obs, real) = x val normReal = normalize(real) .getOrElse(Vector.empty[Double]) //48 new LogBinRegression(obs, normReal, NITERS, ETA, EPS) //49 }
Note
Validation
The simple classification in this test case is provided for illustrating the runtime application of the model. It does not constitute a validation of the model by any stretch of imagination. The next chapter digs into validation methodologies (refer to the Accessing a model section of Chapter 2, Data Pipelines, for more detail).
The training run is performed with three different values of the learning rate. The following chart illustrates the convergence of the batch gradient descent in the minimization of the cost given different values of learning rates:
As expected, the execution of the optimizer with a higher learning rate produces the steepest descent in the cost function.
The execution of the test produces the following model:
iters = 495 weights: 0.859,3.617,6.927 input (0.0088, 4.10E7) normalized (0.063,0.061) class 1 prediction 0.5156 input (0.0694, 3.68E8) normalized (0.517,0.641) class 0 prediction 0.0012
These values may differ between experiments as the initial weights of the model are initialized randomly.
Note
Learning more about regressive models
The binomial logistic regression is merely used to illustrate the concept of training and prediction. It is described in detail in the Logistic regression section in Chapter 9, Regularization and Regression.
We hope you enjoyed this introduction to machine learning. You learned how to leverage your skills in Scala programming to create a simple logistic regression program for predicting stock price/volume action. Here are the highlights of this introductory chapter:
From monadic composition, highorder collection methods for parallelization to configurability and reusability patterns, Scala is the perfect fit to implement data mining and machine learning algorithms for largescale projects.
There are many logical steps required to create and deploy a machine learning model.
The implementation of the binomial logistic regression classifier presented as part of the test case is simple enough to encourage you to learn how to write and apply more advanced machine learning algorithms.
To the delight of Scala programming aficionados, the next chapter will dig deeper into building a flexible workflow by leveraging monadic data transformation and stackable traits.