Mutual information
Definition
Mutual Information is a metric that quantifies the amount of information obtained about one feature through observing another feature. It's particularly useful when the relationship between the features is complex and possibly non-linear. It can be calculated for categorical vs categorical feature, and numerical vs categorical feature, but not numerical vs numerical features (thus, you must always have at least one categorical feature).
In the figure above, blue highlights categorical features; red highlights numerical features. The top half (where the diagonal cells are shaded in with a mesh) is categorical vs categorical and the bottom half is numerical vs categorical.
Although Mutual Information can be calculated between categorical features, and numerical vs categorical, these two computations are quite different, and are arguably not comparable, so it is recommended to treat them as two separate measurements.
Range of scores: 0 to infinity
A score of 0 indicates that the features are independent and there is no relationship between them. In other words, knowing the value of one feature does not provide any information about the possible value of the other feature. As the Mutual Information score increases, it indicates a stronger dependence between the features.
Since Mutual Information does not have an upper bound it may not be straightforward to interpret what constitutes a "high" Mutual Information score, as it can be context-dependent and may vary based on specific dataset and problem at hand. If you need a metric that can capture complex non-linear relationship yet also has a fixed range, Maximal Information Coefficient (MIC) or Randomised Dependence Coefficient (RDC) may be better suited.
How it works
The Mutual Information score is calculated based on the concept of entropy, which in information theory, is a measure of the uncertainty or randomness of a feature. The Mutual Information between two features is calculated as the difference between the sum of the entropies of each individual feature and the entropy of the joint feature (which represents both features considered together). This calculation effectively measures the reduction in uncertainty about one feature given the knowledge of another. If the features are independent, their joint entropy is simply the sum of their individual entropies, and thus the Mutual Information is 0. If the features are identical, knowing one feature completely determines the other, and the Mutual Information is equal to the entropy of each feature.