column_associations#
- skrub.column_associations(df)[source]#
Get measures of statistical associations between all pairs of columns.
At the moment, the only reported metric is Cramer’s V statistic. More may be added in the future.
The result is returned as a dataframe with columns:
['left_column_name', 'left_column_idx', 'right_column_name', 'right_column_idx', 'cramer_v']
As the function is commutative, each pair of columns appears only once (either
col_1
,col_2
orcol_2
,col_1
but not both). The results are sorted from most associated to least associated.To compute the Cramer’s V statistic, all columns are discretized. Numeric columns are binned with 10 bins. For categorical columns, only the 10 most frequent categories are considered. In both cases, nulls are treated as a separate category, ie a separate row in the contingency table. Thus associations between the values of 2 columns or between their missingness patterns may be captured.
- Parameters:
- dfdataframe
The dataframe whose columns will be compared to each other.
- Returns:
- dataframe
The computed associations.
Notes
Cramér’s V is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive).
Examples
>>> import numpy as np >>> import pandas as pd >>> import skrub >>> pd.set_option('display.width', 200) >>> pd.set_option('display.max_columns', 10) >>> pd.set_option('display.precision', 4) >>> rng = np.random.default_rng(33) >>> df = pd.DataFrame({f"c_{i}": rng.random(size=20)*10 for i in range(5)}) >>> df["c_str"] = [f"val {i}" for i in range(df.shape[0])] >>> df.shape (20, 6) >>> df.head() c_0 c_1 c_2 c_3 c_4 c_str 0 4.4364 4.0114 6.9271 7.0970 4.8913 val 0 1 5.6849 0.7192 7.6430 4.6441 2.5116 val 1 2 9.0810 9.4011 1.9257 5.7429 6.2358 val 2 3 2.5425 2.9678 9.7801 9.9879 6.0709 val 3 4 5.8878 9.3223 5.3840 7.2006 2.1494 val 4 >>> associations = skrub.column_associations(df) >>> associations left_column_name left_column_idx right_column_name right_column_idx cramer_v 0 c_3 3 c_str 5 0.8215 1 c_1 1 c_4 4 0.8215 2 c_0 0 c_1 1 0.8215 3 c_2 2 c_str 5 0.7551 4 c_0 0 c_str 5 0.7551 5 c_0 0 c_3 3 0.7551 6 c_1 1 c_3 3 0.6837 7 c_0 0 c_4 4 0.6837 8 c_4 4 c_str 5 0.6837 9 c_3 3 c_4 4 0.6053 10 c_2 2 c_3 3 0.6053 11 c_1 1 c_str 5 0.6053 12 c_0 0 c_2 2 0.6053 13 c_2 2 c_4 4 0.5169 14 c_1 1 c_2 2 0.4122 >>> pd.reset_option('display.width') >>> pd.reset_option('display.max_columns') >>> pd.reset_option('display.precision')