Phi-K Correlation Coefficient
Definition
The Phi-K Correlation Coefficient (Phi-K) is a statistical metric designed to measure the strength and significance of relationships between features. It is versatile, allowing comparisons between numerical vs numerical, numerical vs categorical, and categorical vs categorical features. Phi-K is specifically built to capture both linear and non-linear relationships, making it highly effective across diverse types of data distributions.
Phi-K is calculated using a combination of Pearson's chi-squared statistic (for categorical data) and a binning process (for numerical data), followed by a correction mechanism to account for the sample size and inherent randomness. This enables it to provide meaningful insights into relationships regardless of the feature types or the nature of their relationships.
Range of scores: 0 to 1
The Phi-K coefficient is always bounded between 0 and 1.
A score of 0 indicates complete independence between the two features. This means there is no discernible relationship or dependency between them. A score of 1 indicates a perfect relationship, meaning that knowing one feature fully determines the other. Because the metric is bounded, it is easier to interpret than metrics like Mutual Information, which have no upper limit. Phi-K’s consistent range makes it a reliable choice for comparing feature relationships across different datasets.
How it works
The Phi-K Correlation Coefficient uses a process that combines statistical significance testing and non-linear adjustments to measure relationships. Its calculation involves the following steps:
Data Transformation: For numerical features, the data is binned into discrete intervals, while categorical features remain unchanged. This ensures that both feature types can be analyzed uniformly.
Chi-Squared Calculation: A chi-squared test is performed to compute the association between the features, capturing their statistical dependency.
Non-Linear Adjustment: Phi-K applies a correction to adjust for non-linear patterns or relationships that might be missed by standard linear correlation metrics (like Pearson’s correlation).
Normalization: The resulting score is normalized to fall within the range of 0 to 1, ensuring consistency and interpretability across different datasets and feature types.
By combining these steps, Phi-K effectively accounts for both linear and non-linear dependencies, making it suitable for exploratory data analysis, especially when feature relationships are complex or unknown.