column_associations#
- skrub.column_associations(df)[source]#
Get measures of statistical associations between all pairs of columns.
At the moment, the only reported metric is Cramer’s V statistic. More may be added in the future.
The result is returned as a dataframe with columns:
[‘left_column_name’, ‘left_column_idx’, ‘right_column_name’, ‘right_column_idx’, ‘cramer_v’]
As the function is commutative, each pair of columns appears only once (either col_1, col_2 or col_2, col_1 but not both). The results are sorted from most associated to least associated.
To compute the Cramer V statistic, all columns are discretized. Numeric columns are binned with 10 bins. For categorical columns, only the 10 most frequent categories are considered. In both cases, nulls are treated as a separate category, ie a separate row in the contingency table. Thus associations betwen the values of 2 columns or between their missingness patterns may be captured.
- Parameters:
- dfdataframe
The dataframe whose columns will be compared to each other.
- Returns:
- dataframe
The computed associations.