Logistic regression is commonly used in statistics inference.

The following implementation of the binomial logistic regression classifier exposes a single `classify`

method to comply with our desire to reduce the complexity and life cycle of objects. The model `weights`

parameters are computed during training when the `LogBinRegression`

class/model is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted.

The goal of the training of a model using expected values is to compute the optimal weights that minimizes the **error** or **cost function**. We select the **batch gradient descent** algorithm to minimize the cumulative error between the predicted and expected values for all the observations. Although there are quite a few alternative optimizers, the gradient descent is quite robust and simple enough for this first chapter. The algorithm consists of updating the weights *w*_{i} of the regression model by minimizing the cost.

### Note

**Cost function**

M5: Cost (or *compound error = predicted – expected*):

M6: The batch gradient descent method to update model weights *w*_{i} is as follows:

For those interested in learning about of optimization techniques, the *Summary of optimization techniques* section in the Appendix A, *Basic Concepts* presents an overview of the most commonly used optimizers. The batch descent gradient method is also used for the training of the multilayer perceptron (refer to *The training epoch* section under *The multilayer perceptron* section in Chapter 9, *Artificial Neural Networks*).

The execution of the batch gradient descent algorithm follows these steps:

Initialize the weights of the regression model.

Shuffle the order of observations and expected values.

Aggregate the cost or error for the entire observation set.

Update the model weights using the cost as the objective function.

Repeat from step 2 until either the maximum number of iterations is reached or the incremental update of the cost is close to zero.

The purpose of **shuffling** the order of the observations between iterations is to avoid the minimization of the cost reaching a local minimum.

### Tip

**Batch and stochastic gradient descent**

The stochastic gradient descent is a variant of the gradient descent that updates the model weights after computing the error on each observation. Although the stochastic gradient descent requires a higher computation effort to process each observation, it converges toward the optimal value of weights fairly quickly after a small number of iterations. However, the stochastic gradient descent is sensitive to the initial value of the weights and the selection of the learning rate, which is usually defined by an adaptive formula.

The `train`

method consists of iterating through the computation of the weight using a simple descent gradient method. The method computes `weights`

and returns an instance of the `LogBinRegressionModel`

model:

The `train`

method extracts the number of weights, `nWeights`

, for the regression model as the *number of variables in each observation + 1* (line `21`

). The method initializes `weights`

with random values over [0, 1] (line `22`

). The weights are computed through the tail recursive `gradientDescent`

method, and the method returns a new model for the binomial logistic regression (line `23`

).

### Tip

**Unwrapping values from Try**

It is usually not recommended to invoke the `get`

method to a `Try`

value, unless it is enclosed in a `Try`

statement. The best course of action is to do the following:

1. Catch the failure with `match{ case Success(m) => ..case Failure(e) =>}`

2. Extract the `getOrElse( /* … */ )`

result safely

3. Propagate the results as a `Try`

type `map( _.m)`

Let's take a look at the computation for `weights`

through the minimization of the cost function in the `gradientDescent`

method:

The `gradientDescent`

method recurses on the vector of pairs (observations and expected values), `obsAndLbl`

, `cost`

, and the model `weights`

(line `24`

). It throws an exception if the maximum number of iterations allowed for the optimization is reached (line `25`

). It shuffles the order of the observations (line `26`

) before computing the `errorGrad`

derivatives of the cost over each weights (line `27`

). The computation of the derivative of the cost (or *error = predicted value – expected value*) in formula **M5** returns a pair of cumulative cost and derivative values using the formula (line `28`

).

Next, the method computes the overall compound cost using the formula **M4** (line `29`

), converts it to a relative incremental `relativeError`

cost that is compared to the `eps`

convergence criteria (line `30`

). The method extracts `derivatives`

of cost over weights by transposing the matrix of errors, and then prepends the bias `1.0`

value to match the array of weights (line `31`

).

### Note

**Bias value**

The purpose of the bias value is to prepend `1.0`

to the vector of observation so it can be directly processed (for example, zip and dot) with the weights. For instance, a regression model for two-dimensional observations (x, y) has three weights (*w*_{0}, w_{1}, w_{2}). The bias value +1 is prepended to the observations to compute the predicted value 1.0: *w*_{0} + x.w_{1}, +
y.w_{2}.

This technique is used in the computation of the activation function of the multilayer perceptron, as described in the *The multilayer perceptron* section in Chapter 9, *Artificial Neural Networks*.

The formula **M6** updates the weights for the next iteration (line `32`

) before invoking the method with new weights, cost, and iteration count (line `33`

).

Let's take a look at the shuffling of the order of observations using a random sequence generator. The following implementation is an alternative to the Scala standard library method `scala.util.Random.shuffle`

for shuffling elements of collections. The purpose is to change the order of observations and labels between iterations in order to prevent the optimizer to reach a local minimum. The `shuffle`

method permutes the order in the `labelObs`

vector of observations by partitioning it into segments of random size and reversing the order of the other segment:

Once the order of the observations is updated, the vector of pair (observations, labels) is easily built through a map (line `34`

). The actual shuffling of the index is performed in the following `shuffle`

recursive function:

The maximum size of partition of the `maxChunkSize`

vector observations is randomly computed (line `35`

). The method extracts the next slice (`start`

, `end`

) (line `36`

). The slice is either added to the existing indices vector and returned once all the observations have been shuffled (line `37`

) or passed to the next invocation.

The `slice`

method returns an array of indices over the range (`start`

, `end`

) either in the right order if the number of segments processed is odd, or in reverse order if the number of segment processed is even:

### Note

**Iterative versus tail recursive computation**

The tail recursion in Scala is a very efficient alternative to the iterative algorithm. Tail recursion avoids the need to create a new stack frame for each invocation of the method. It is applied to the implementation of many machine learning algorithms presented throughout the book.

In order to train the model, we need to label the input data. The labeling process consists of associating the relative price movement during a session (price at *close/price at open – 1*) with one of the following two configurations:

The two classes of training observations are segregated by a decision boundary drawn on the scatter plot in the previous section. The labeling process is usually quite cumbersome and should be automated as much as possible.

### Note

**Automated labeling**

Although quite convenient, automated creation of training labels is not without risk as it may mislabel singular observations. This technique is used in this test for convenience, but it is not recommended unless a domain expert reviews the labels manually.

Once the model is successfully created through training, it is available to classify new observation. The runtime classification of observations using the binomial logistic regression is implemented by the `classify`

method:

The method applies the logistic function to the linear inner product, `linear`

, of the new `obs`

and `weights`

observations of the model (line `37`

). The method returns the tuple (the predicted class of the observation {0, 1}, prediction value) where the class is defined by comparing the prediction to the boundary value `0.0`

(line `38`

).

The computation of the `dot`

product of weights and observations uses the bias value as follows:

The alternative implementation of the `dot`

product of weights and observations consists of extracting the first `w.head`

weight:

The `dot`

method is used in the `classify`

method.