“BINNING” DATA VALUES
The concept of “binning” refers to subdividing a set of values into multiple intervals, and then treating all the numbers in the same interval as though they had the same value. In addition, there are at least three techniques for binning data, as shown here:
• bins of equal widths
• bins of equal frequency
• bins based on k-means
More information about binning numerical features can be found here: https://towardsdatascience.com/from-numerical-to-categorical-3252cf805ea2
As a simple example of bins of equal widths, suppose that a feature in a dataset contains the age of people in a dataset. The range of values is approximately between 0 and 120, and we could “bin” them into 12 equal intervals, where each consists of 10 values: 0 through 9, 10 through 19, 20 through 29, and so forth.
As another example, using quartiles is even more coarse-grained than the earlier age-related binning example. The issue with binning pertains to the unintended consequences of classifying people in different bins, even though they are in close proximity to each other. For instance, some people struggle financially because they earn a meager wage, and they are also disqualified from financial assistance because their salary is higher than the cut-off point for receiving any assistance.
Scikit-learn provides the KBinsKDiscretizer class that uses a clustering algorithm for binning data:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html
In case you’re interested, a highly technical paper (PDF) with information about clustering and binning can be accessed here:
https://www.stat.cmu.edu/tr/tr870/tr870.pdf
Programmatic Binning Techniques
Earlier in this chapter you saw a Pandas-based example of generating a histogram using data from a Titanic dataset. The number of bins was chosen on an ad hoc basis, with no relation to the data itself. However, there are several techniques that enable you to programmatically determine the optimal number of bins, some of which are shown as follows:
• Doane’s formula
• Freedman–Diaconis’ c hoice
• Rice’s rule
• Scott’s normal reference rule
• square-root choice
• Sturge’s rule
Doane’s formula for calculating the number of bins depends on the number of observations n and the kurtosis (discussed in Chapter 4) of the data, and it’s reproduced here:
1 + log(n) + log(1 + kurtosis(data) * sqrt(n / 6.0))
Freedman–Diaconis’ choice specifies the number of bins for a sample x, and it’s based on the IQR (interquartile range) and the number of observations n, as shown in the following formula:
k = 2 * IRQ(x)/[cube root of n]
Sturge’s rule to determine the number of bins k for Gaussian-based data is based on the number of observations n, and it’s expressed as follows:
k = 1 + 3.322 * log n
In addition, after specifying the number of bins k, set the minimum bin width mbw as follows:
mbw = (Max Observed Value – Min Observed Value) / k
Experiment with the preceding formulas to determine which one provides the best visual display for your data. For more information about calculating the optimal number of bins, perform an online search for blog posts and articles.
Potential Issues When Binning Data Values
Partitioning the values of people’s ages as described in the preceding section can be problematic. In particular, suppose that person A, person B, and person C are 29, 30, and 39, respectively. Then person A and person B are probably much more similar to each other than person B and person C, but because of the way in which the ages are partitioned, B is classified as closer to C than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative), as discussed in this blog post (along with some alternatives to binning):
https://medium.com/@peterflom/why-binning-continuous-data-is-almost-always-a-mistake-ad0b3a1d141f