20.6 Information theory
If you have already trained a machine learning model in your practice, chances are you are already familiar with the mean-squared error
where f : ℝn →ℝ represents our model, x ∈ℝn is the vector of one-dimensional observations, and y ∈ℝn is the ground truth. After learning all about the expected value, this sum should be familiar: if we assume a probabilistic viewpoint and let X and Y be the random variables describing the data, then the mean-squared error can be written as the expected value
However, the mean-squared error is not suitable for classification problems. For instance, if the task is to classify the object of an image, the output is a discrete probability distribution for each sample. In this situation, we could use the so-called cross-entropy, defined by
where p ∈ℝn denotes the one-hot encoded vector of the class label for a single data sample, and q ∈ℝn is the class...