16
Derivatives and Gradients
Now that we understand why multivariate functions and high-dimensional spaces are more complex than the single-variable case we studied earlier, it’s time to see how to do things in the general case.
To recap quickly, our goal in machine learning is to optimize functions with millions of variables. For instance, think about a neural network N(x,w) trained for binary classification, where
- x ∈ℝn is the input data,
- w ∈ℝm is the vector compressing all of the weight parameters,
- and N(x,w) ∈ [0,1] is the prediction, representing the probability of belonging to the positive class.
In the case of, say, binary cross-entropy loss, we have the loss function
where xi is the i-th data point with ground truth yi ∈{0,1}. See, I told you that we have to write much more in multivariable calculus. (We’ll talk about binary cross-entropy loss in Chapter 20.)
Training the neural network is the same as finding a...