CORRELATION
Correlation refers to the extent to which a pair of variables are related, which is a number between -1 and 1 inclusive. The most significant correlation values are -1, 0, and 1.
A correlation of 1 means that both variables increase and decrease in the same direction. A correlation of -1 means that both variables increase and decrease in the opposite direction. A correlation of 0 means that the variables are independent of each other.
Pandas provides the corr() method that generates a matrix containing the correlation between any pair of features in a data frame. Note that the diagonal values of this matrix are related to the variance of the features in the data frame.
A correlation matrix can be derived from a covariance matrix: each entry in the former matrix is a covariance value divided by the standard deviation of the two features in the row and column of a particular entry.
This concludes the portion of the chapter pertaining to dependencies among features in a dataset. The next section discusses different types of currencies that can appear in a dataset, along with a Python code sample for currency conversion.
What Is a Good Correlation Value?
Although there is no exact value that determines whether a correlation is weak, moderate, or strong, there are some guidelines, as shown here:
• between 0.0 and 0.2: weak
• between 0.2 and 0.5: moderate
• between 0.5 and 0.7: moderately strong
• between 0.7 and 1.0: strong
The preceding ranges are for positive correlations, and the corresponding values for negative correlations are shown here:
• between -0.2 and 0: weak
• between -0.5 and -0.2: moderate
• between -0.7 and -0.5: moderately strong
• between -0.7 and -1.0: strong
However, treat the values in the preceding lists as guidelines: some people classify values between 0.0 and 0.4 as weak correlations, and values between 0.8 and 1.0 as strong correlations. In addition, a correlation of 0.0 means that there is no correlation at all (extra weak?).
Discrimination Threshold
Logistic regression (discussed in Chapter 6) is based on the sigmoid function (which in turn involves Euler’s constant) whereby any real number is mapped to a number in the interval (0,1). Consequently, logistic regression is well-suited for classifying binary class membership: i.e., data points that belong to one of two classes. For datasets that contain two class values, let’s call them 0 and 1, logistic regression provides a probability that a data point belongs to class 1 or class 1, where the range of probability values includes all the numbers in the interval [0,1].
The discrimination threshold is the value whereby larger probabilities are associated with class 1 and smaller probabilities are associated with class 0. Some datasets have a discrimination threshold of 0.5, but in general, this value can be much closer to 0 or 1. Relevant examples include health-related datasets (healthy versus cancer), sports events (win versus lose), and even the DMV (department of motor vehicles), where the latter require 85% accuracy in order to pass the test in some US states.