column_associations#

skrub.column_associations(df)[source]#

Get measures of statistical associations between all pairs of columns.

At the moment, the only reported metric is Cramer’s V statistic. More may be added in the future.

The result is returned as a dataframe with columns:

[‘left_column_name’, ‘left_column_idx’, ‘right_column_name’, ‘right_column_idx’, ‘cramer_v’]

As the function is commutative, each pair of columns appears only once (either col_1, col_2 or col_2, col_1 but not both). The results are sorted from most associated to least associated.

To compute the Cramer V statistic, all columns are discretized. Numeric columns are binned with 10 bins. For categorical columns, only the 10 most frequent categories are considered. In both cases, nulls are treated as a separate category, ie a separate row in the contingency table. Thus associations betwen the values of 2 columns or between their missingness patterns may be captured.

Parameters:
dfdataframe

The dataframe whose columns will be compared to each other.

Returns:
dataframe

The computed associations.