How to find correlated columns in a datarame#

In addition to TableReport’s Associations tab, you can compute associations using the column_associations() function, which returns a dataframe containing the associations.

Reported metrics include Cramer’s V statistic and Pearson’s Correlation Coefficient. The result is returned as a dataframe that contains the column name and idx for the left and right table and both associations; results are sorted in descending order by Cramer’s V association.

This can be useful to have access to the information used in the TableReport for later use (e.g., to select which columns to drop).

from skrub import column_associations
from skrub.datasets import fetch_employee_salaries
import pandas as pd
path = fetch_employee_salaries().path
df = pd.read_csv(path)
column_associations(df).head()

      left_column_name  left_column_idx        right_column_name  right_column_idx  cramer_v  pearson_corr
0           department                1          department_name                 2  1.000000           NaN
1  assignment_category                4    current_annual_salary                 8  0.635525           NaN
2             division                3      assignment_category                 4  0.601097           NaN
3  assignment_category                4  employee_position_title                 5  0.496814           NaN
4             division                3  employee_position_title                 5  0.416034           NaN