Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Hands-On with Column Selection and Transformers#

In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_pipeline() to create pipelines.

In this new example, we show how to create more flexible pipelines by selecting and transforming dataframe columns using arbitrary logic.

We begin with loading a dataset with heterogeneous datatypes, and replacing Pandas’s display with the TableReport display via skrub.set_config().

import skrub
from skrub.datasets import fetch_employee_salaries

skrub.set_config(use_table_report=True)
data = fetch_employee_salaries()
X, y = data.X, data.y
X

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
division	assignment_category	0.606
assignment_category	employee_position_title	0.488
department_name	employee_position_title	0.416
department	employee_position_title	0.416
division	employee_position_title	0.412
department	assignment_category	0.401
department_name	assignment_category	0.401
department	division	0.373
department_name	division	0.373
gender	department	0.368
gender	department_name	0.368
gender	employee_position_title	0.282
employee_position_title	date_first_hired	0.258
gender	assignment_category	0.247
gender	division	0.247
department	date_first_hired	0.159
department_name	date_first_hired	0.159
division	date_first_hired	0.147
date_first_hired	year_first_hired	0.143

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Our goal is now to apply a StringEncoder to two columns of our choosing: division and employee_position_title.

We can achieve this using ApplyToCols, whose job is to apply a transformer to multiple columns independently, and let unmatched columns through without changes. This can be seen as a handy drop-in replacement of the ColumnTransformer.

Since we selected two columns and set the number of components to 30 each, ApplyToCols will create 2*30 embedding columns in the dataframe Xt, which we prefix with lsa_.

from skrub import ApplyToCols, StringEncoder

apply_string_encoder = ApplyToCols(
    StringEncoder(n_components=30),
    cols=["division", "employee_position_title"],
    rename_columns="lsa_{}",
)
Xt = apply_string_encoder.fit_transform(X)
Xt

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	lsa_division_00	Float32DType	False	0 (0.0%)	685 (7.4%)	0.223	0.281	8.44e-05	0.134	1.13
4	lsa_division_01	Float32DType	False	0 (0.0%)	685 (7.4%)	0.179	0.280	-0.551	0.177	0.784
5	lsa_division_02	Float32DType	False	0 (0.0%)	686 (7.4%)	0.0284	0.314	-0.615	-0.00739	0.977
6	lsa_division_03	Float32DType	False	0 (0.0%)	685 (7.4%)	0.0428	0.305	-0.443	-0.00381	1.10
7	lsa_division_04	Float32DType	False	0 (0.0%)	689 (7.5%)	-0.0379	0.233	-0.774	-0.00801	0.362
8	lsa_division_05	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0430	0.218	-0.678	-0.0435	0.691
9	lsa_division_06	Float32DType	False	0 (0.0%)	693 (7.5%)	0.0115	0.212	-0.338	-0.00442	1.20
10	lsa_division_07	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00129	0.206	-0.822	-0.00478	0.942
11	lsa_division_08	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0148	0.199	-0.826	-0.0126	0.674
12	lsa_division_09	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0293	0.188	-0.178	0.00183	1.31
13	lsa_division_10	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00125	0.185	-0.551	0.00335	1.07
14	lsa_division_11	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0192	0.175	-0.637	0.00624	0.612
15	lsa_division_12	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0127	0.169	-0.626	-0.00488	0.919
16	lsa_division_13	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0291	0.157	-0.398	-0.000241	1.01
17	lsa_division_14	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0125	0.154	-0.503	-0.00472	0.559
18	lsa_division_15	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0102	0.151	-0.580	0.00733	0.752
19	lsa_division_16	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00136	0.148	-0.525	-0.00238	0.554
20	lsa_division_17	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0205	0.140	-0.424	0.000563	0.490
21	lsa_division_18	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0152	0.140	-0.611	0.00727	0.507
22	lsa_division_19	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00203	0.140	-0.409	0.00206	0.556
23	lsa_division_20	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00859	0.132	-0.660	-0.00153	0.556
24	lsa_division_21	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00641	0.131	-0.495	0.000214	0.822
25	lsa_division_22	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00667	0.127	-0.446	-0.00127	0.542
26	lsa_division_23	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00280	0.121	-0.688	0.00141	0.510
27	lsa_division_24	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00260	0.119	-0.577	-0.000677	0.746
28	lsa_division_25	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00333	0.116	-0.382	0.00563	0.743
29	lsa_division_26	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.000233	0.115	-0.335	-0.00435	0.711
30	lsa_division_27	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00835	0.113	-0.428	-0.000539	0.494
31	lsa_division_28	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00439	0.108	-0.364	-0.00349	0.498
32	lsa_division_29	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00285	0.107	-0.306	-0.00169	0.535
33	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
34	lsa_employee_position_title_00	Float32DType	False	0 (0.0%)	443 (4.8%)	0.230	0.315	0.000534	0.0888	1.11
35	lsa_employee_position_title_01	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0831	0.338	-0.342	0.00643	1.09
36	lsa_employee_position_title_02	Float32DType	False	0 (0.0%)	443 (4.8%)	0.110	0.301	-0.0618	0.0156	1.16
37	lsa_employee_position_title_03	Float32DType	False	0 (0.0%)	443 (4.8%)	0.108	0.275	-0.164	0.0336	0.984
38	lsa_employee_position_title_04	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0766	0.242	-0.617	0.0431	0.658
39	lsa_employee_position_title_05	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0532	0.208	-0.315	0.000350	0.954
40	lsa_employee_position_title_06	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0332	0.192	-0.323	0.00455	0.782
41	lsa_employee_position_title_07	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0123	0.190	-0.547	-0.00157	0.640
42	lsa_employee_position_title_08	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0359	0.178	-0.312	0.00364	0.590
43	lsa_employee_position_title_09	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00945	0.181	-0.620	-0.00300	0.812
44	lsa_employee_position_title_10	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0198	0.175	-0.503	-0.00949	0.730
45	lsa_employee_position_title_11	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.000952	0.173	-0.458	0.0114	0.815
46	lsa_employee_position_title_12	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00365	0.168	-0.332	-0.00364	0.961
47	lsa_employee_position_title_13	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0136	0.165	-0.331	0.00565	0.649
48	lsa_employee_position_title_14	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00150	0.161	-0.375	-0.0214	0.465
49	lsa_employee_position_title_15	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0191	0.155	-0.203	0.000368	1.10
50	lsa_employee_position_title_16	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0113	0.147	-0.311	-8.50e-05	0.839
51	lsa_employee_position_title_17	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00431	0.140	-0.260	-0.00158	0.757
52	lsa_employee_position_title_18	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00522	0.137	-0.377	0.00168	1.00
53	lsa_employee_position_title_19	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0104	0.133	-0.343	-0.00183	0.467
54	lsa_employee_position_title_20	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00299	0.130	-0.436	-0.00929	0.486
55	lsa_employee_position_title_21	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00399	0.128	-0.537	-0.0142	0.553
56	lsa_employee_position_title_22	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00137	0.126	-0.335	-0.000569	0.500
57	lsa_employee_position_title_23	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00775	0.122	-0.327	-0.00587	0.945
58	lsa_employee_position_title_24	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00127	0.122	-0.590	0.00416	0.689
59	lsa_employee_position_title_25	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00930	0.119	-0.189	0.00540	0.661
60	lsa_employee_position_title_26	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.000896	0.114	-0.383	0.00149	0.491
61	lsa_employee_position_title_27	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00440	0.108	-0.472	-0.00210	0.413
62	lsa_employee_position_title_28	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00130	0.104	-0.296	-0.00318	0.627
63	lsa_employee_position_title_29	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00570	0.0974	-0.848	0.0106	0.174
64	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
65	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Please enable javascript

In addition to the ApplyToCols class, the ApplyToFrame class is useful for transformers that work on multiple columns at once, such as the PCA which reduces the number of components.

To select columns without hardcoding their names, we introduce selectors, which allow for flexible matching pattern and composable logic.

The regex selector below will match all columns prefixed with "lsa", and pass them to ApplyToFrame which will assemble these columns into a dataframe and finally pass it to the PCA.

from sklearn.decomposition import PCA

from skrub import ApplyToFrame
from skrub import selectors as s

apply_pca = ApplyToFrame(PCA(n_components=8), cols=s.regex("lsa"))
Xt = apply_pca.fit_transform(Xt)
Xt

	gender	department	department_name	assignment_category	date_first_hired	year_first_hired	pca0	pca1	pca2	pca3	pca4	pca5	pca6	pca7
0	F	POL	Department of Police	Fulltime-Regular	09/22/1986	1,986	0.113	-0.0332	0.0341	-0.133	-0.246	0.122	0.0572	0.375
1	M	POL	Department of Police	Fulltime-Regular	09/12/1988	1,988	0.499	0.140	-0.101	-0.0477	-0.167	0.235	-0.0253	0.275
2	F	HHS	Department of Health and Human Services	Fulltime-Regular	11/19/1989	1,989	-0.121	-0.124	0.391	0.0342	0.109	-0.369	-0.0455	0.365
3	M	COR	Correction and Rehabilitation	Fulltime-Regular	05/05/2014	2,014	-0.0736	-0.0882	0.167	-0.184	-0.398	-0.116	-0.322	-0.0892
4	M	HCA	Department of Housing and Community Affairs	Fulltime-Regular	03/05/2007	2,007	-0.0832	-0.0605	0.192	-0.238	-0.0737	-0.0455	0.206	-0.0658

9,223	F	HHS	Department of Health and Human Services	Fulltime-Regular	11/03/2015	2,015	-0.106	-0.124	0.403	0.618	-0.0466	0.352	0.0126	-0.407
9,224	F	FRS	Fire and Rescue Services	Fulltime-Regular	11/28/1988	1,988	-0.126	0.121	0.0635	-0.0403	-0.135	0.0858	0.0703	0.130
9,225	M	HHS	Department of Health and Human Services	Parttime-Regular	04/30/2001	2,001	-0.117	-0.131	0.306	0.203	0.0372	-0.121	-0.0166	0.0249
9,226	M	CCL	County Council	Fulltime-Regular	09/05/2006	2,006	-0.126	0.0356	0.224	-0.410	0.663	0.381	-0.562	0.0167
9,227	M	DLC	Department of Liquor Control	Fulltime-Regular	01/30/2012	2,012	-0.130	0.0600	0.127	-0.163	-0.0902	-0.00557	0.151	0.0481

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
4	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
5	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
6	pca0	Float32DType	False	0 (0.0%)	2779 (30.1%)	1.87e-08	0.476	-0.517	-0.107	1.48
7	pca1	Float32DType	False	0 (0.0%)	2779 (30.1%)	-8.06e-09	0.445	-0.985	-0.0618	1.28
8	pca2	Float32DType	False	0 (0.0%)	2778 (30.1%)	2.29e-08	0.406	-1.08	0.136	0.729
9	pca3	Float32DType	False	0 (0.0%)	2779 (30.1%)	1.76e-08	0.318	-0.658	-0.0401	1.33
10	pca4	Float32DType	False	0 (0.0%)	2779 (30.1%)	4.96e-09	0.277	-1.02	0.0128	0.875
11	pca5	Float32DType	False	0 (0.0%)	2779 (30.1%)	4.55e-09	0.266	-0.715	-0.0413	1.15
12	pca6	Float32DType	False	0 (0.0%)	2779 (30.1%)	-5.17e-09	0.265	-1.12	0.0125	0.704
13	pca7	Float32DType	False	0 (0.0%)	2779 (30.1%)	-9.25e-09	0.249	-0.724	-0.0384	0.772

Column 1	Column 2	Cramér's V	Pearson's Correlation
department	department_name	1.00
pca1	pca2	0.616	-0.0224
pca0	pca2	0.543	0.00577
assignment_category	pca7	0.538
pca2	pca3	0.530	0.0212
assignment_category	pca5	0.508
pca4	pca6	0.504	-0.0348
assignment_category	pca3	0.503
pca5	pca7	0.499	-0.0270
department	pca2	0.485
department_name	pca2	0.485
department_name	pca1	0.480
department	pca1	0.480
assignment_category	pca2	0.477
pca6	pca7	0.463	-0.00789
pca3	pca7	0.445	-0.0125
pca0	pca1	0.439	-0.0103
department	pca0	0.403
department_name	pca0	0.403
department	assignment_category	0.389

Please enable javascript

These two selectors are scikit-learn transformers and can be chained together within a Pipeline.

from sklearn.pipeline import make_pipeline

model = make_pipeline(
    apply_string_encoder,
    apply_pca,
).fit_transform(X)

Note that selectors also come in handy in a pipeline to select or drop columns, using SelectCols and DropCols!

from sklearn.preprocessing import StandardScaler

from skrub import SelectCols

# Select only numerical columns
pipeline = make_pipeline(
    SelectCols(cols=s.numeric()),
    StandardScaler(),
).set_output(transform="pandas")
pipeline.fit_transform(Xt)

	year_first_hired	pca0	pca1	pca2	pca3	pca4	pca5	pca6	pca7
0	-1.89	0.239	-0.0746	0.0841	-0.418	-0.889	0.458	0.216	1.50
1	-1.67	1.05	0.314	-0.249	-0.150	-0.604	0.882	-0.0956	1.10
2	-1.57	-0.254	-0.279	0.963	0.107	0.393	-1.39	-0.172	1.47
3	1.12	-0.155	-0.198	0.410	-0.579	-1.44	-0.436	-1.22	-0.358
4	0.365	-0.175	-0.136	0.473	-0.747	-0.266	-0.171	0.777	-0.264

9,223	1.22	-0.223	-0.278	0.993	1.94	-0.168	1.32	0.0478	-1.63
9,224	-1.67	-0.264	0.273	0.156	-0.127	-0.489	0.323	0.266	0.522
9,225	-0.279	-0.246	-0.294	0.753	0.637	0.134	-0.454	-0.0626	0.100
9,226	0.258	-0.265	0.0800	0.552	-1.29	2.39	1.43	-2.12	0.0669
9,227	0.901	-0.274	0.135	0.313	-0.512	-0.326	-0.0209	0.572	0.193

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	year_first_hired	Float64DType	False	51 (0.6%)	-1.15e-14	1.00	-4.14	0.150	1.33
1	pca0	Float64DType	False	2779 (30.1%)	-1.06e-18	1.00	-1.09	-0.225	3.12
2	pca1	Float64DType	False	2779 (30.1%)	-3.85e-18	1.00	-2.21	-0.139	2.87
3	pca2	Float64DType	False	2778 (30.1%)	3.08e-18	1.00	-2.65	0.335	1.79
4	pca3	Float64DType	False	2779 (30.1%)	-5.39e-18	1.00	-2.07	-0.126	4.19
5	pca4	Float64DType	False	2779 (30.1%)	7.70e-18	1.00	-3.67	0.0463	3.16
6	pca5	Float64DType	False	2779 (30.1%)	1.08e-17	1.00	-2.69	-0.155	4.33
7	pca6	Float64DType	False	2779 (30.1%)	-7.70e-19	1.00	-4.23	0.0473	2.66
8	pca7	Float64DType	False	2779 (30.1%)	-1.31e-17	1.00	-2.91	-0.154	3.10

Column 1	Column 2	Cramér's V	Pearson's Correlation
pca1	pca2	0.616	-0.00315
pca0	pca2	0.553	-0.00383
pca2	pca3	0.530	0.0321
pca4	pca6	0.495	0.0138
pca3	pca7	0.486	-0.0315
pca5	pca7	0.481	-0.0804
pca6	pca7	0.452	-0.0118
pca0	pca1	0.442	0.000272
pca3	pca4	0.413	-0.000147
pca3	pca6	0.410	0.00696
pca2	pca7	0.398	-0.0139
pca1	pca3	0.388	-0.0103
pca3	pca5	0.386	-0.00198
pca4	pca5	0.376	0.0468
pca1	pca7	0.341	-0.00312
pca5	pca6	0.332	-0.0222
pca4	pca7	0.326	0.0358
pca2	pca5	0.324	-0.00407
pca2	pca4	0.318	0.00493
pca0	pca3	0.316	-0.00766

Please enable javascript

Let’s run through one more example to showcase the expressiveness of the selectors. Suppose we want to apply an OrdinalEncoder on categorical columns with low cardinality (e.g., fewer than 40 unique values).

We define a column filter using skrub selectors with a lambda function. Note that the same effect can be obtained directly by using cardinality_below().

from sklearn.preprocessing import OrdinalEncoder

low_cardinality = s.filter(lambda col: col.nunique() < 40)
ApplyToCols(OrdinalEncoder(), cols=s.string() & low_cardinality).fit_transform(X)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	0.00	32.0	14.0	MSB Information Mgmt and Tech Division Records Management Section	0.00	Office Services Coordinator	09/22/1986	1,986
1	1.00	32.0	14.0	ISB Major Crimes Division Fugitive Section	0.00	Master Police Officer	09/12/1988	1,988
2	0.00	19.0	10.0	Adult Protective and Case Management Services	0.00	Social Worker IV	11/19/1989	1,989
3	1.00	6.00	4.00	PRRS Facility and Security	0.00	Resident Supervisor II	05/05/2014	2,014
4	1.00	18.0	11.0	Affordable Housing Programs	0.00	Planning Specialist III	03/05/2007	2,007

9,223	0.00	19.0	10.0	School Based Health Centers	0.00	Community Health Nurse II	11/03/2015	2,015
9,224	0.00	17.0	20.0	Human Resources Division	0.00	Fire/Rescue Division Chief	11/28/1988	1,988
9,225	1.00	19.0	10.0	Child and Adolescent Mental Health Clinic Services	1.00	Medical Doctor IV - Psychiatrist	04/30/2001	2,001
9,226	1.00	3.00	6.00	Council Central Staff	0.00	Manager II	09/05/2006	2,006
9,227	1.00	11.0	12.0	Licensure, Regulation and Education	0.00	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	Float64DType	False	17 (0.2%)	2 (< 0.1%)	0.595	0.491	0.00	1.00	1.00
1	department	Float64DType	False	0 (0.0%)	37 (0.4%)	18.8	8.97	0.00	17.0	36.0
2	department_name	Float64DType	False	0 (0.0%)	37 (0.4%)	14.4	6.22	0.00	14.0	36.0
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	Float64DType	False	0 (0.0%)	2 (< 0.1%)	0.0904	0.287	0.00	0.00	1.00
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V	Pearson's Correlation
assignment_category	employee_position_title	0.677
department	department_name	0.658	0.363
division	assignment_category	0.619
division	employee_position_title	0.521
department_name	employee_position_title	0.395
department	employee_position_title	0.336
department_name	division	0.324
department	division	0.305
gender	department_name	0.281	0.189
department	assignment_category	0.276	0.0753
gender	employee_position_title	0.269
department_name	assignment_category	0.263	-0.0975
gender	division	0.246
employee_position_title	date_first_hired	0.238
gender	assignment_category	0.227	-0.227
gender	department	0.202	-0.106
date_first_hired	year_first_hired	0.158
department_name	date_first_hired	0.155
employee_position_title	year_first_hired	0.138
department	date_first_hired	0.118

Please enable javascript

Notice how we composed the selector with string() using a logical operator. This resulting selector matches string columns with cardinality below 40.

We can also define the opposite selector high_cardinality using the negation operator ~ and apply a skrub.StringEncoder to vectorize those columns.

from sklearn.ensemble import HistGradientBoostingRegressor

high_cardinality = ~low_cardinality
pipeline = make_pipeline(
    ApplyToCols(
        OrdinalEncoder(),
        cols=s.string() & low_cardinality,
    ),
    ApplyToCols(
        StringEncoder(),
        cols=s.string() & high_cardinality,
    ),
    HistGradientBoostingRegressor(),
).fit(X, y)
pipeline

Pipeline(steps=[('applytocols-1',
                 ApplyToCols(cols=(string() & filter(<lambda>)),
                             transformer=OrdinalEncoder())),
                ('applytocols-2',
                 ApplyToCols(cols=(string() & (~filter(<lambda>))),
                             transformer=StringEncoder())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('applytocols-1', ...), ('applytocols-2', ...), ...]
	transform_input	None
	memory	None
	verbose	False

applytocols-1: ApplyToCols

Parameters

	transformer	OrdinalEncoder()
	cols	(string() & filter(<lambda>))
	allow_reject	False
	keep_original	False
	rename_columns	'{}'
	n_jobs	None

transformer: OrdinalEncoder

OrdinalEncoder()

OrdinalEncoder

?Documentation for OrdinalEncoder

Parameters

	categories	'auto'
	dtype	<class 'numpy.float64'>
	handle_unknown	'error'
	unknown_value	None
	encoded_missing_value	nan
	min_frequency	None
	max_categories	None

applytocols-2: ApplyToCols

Parameters

	transformer	StringEncoder()
	cols	(string() & (...er(<lambda>)))
	allow_reject	False
	keep_original	False
	rename_columns	'{}'
	n_jobs	None

transformer: StringEncoder

StringEncoder()

StringEncoder

Parameters

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None

HistGradientBoostingRegressor

?Documentation for HistGradientBoostingRegressor

Parameters

	loss	'squared_error'
	quantile	None
	learning_rate	0.1
	max_iter	100
	max_leaf_nodes	31
	max_depth	None
	min_samples_leaf	20
	l2_regularization	0.0
	max_features	1.0
	max_bins	255
	categorical_features	'from_dtype'
	monotonic_cst	None
	interaction_cst	None
	warm_start	False
	early_stopping	'auto'
	scoring	'loss'
	validation_fraction	0.1
	n_iter_no_change	10
	tol	1e-07
	verbose	0
	random_state	None

Interestingly, the pipeline above is similar to the datatype dispatching performed by TableVectorizer, also used in tabular_pipeline().

Click on the dropdown arrows next to the datatype to see the columns are mapped to the different transformers in TableVectorizer.

from skrub import tabular_pipeline

tabular_pipeline("regressor").fit(X, y)

Total running time of the script: (0 minutes 11.147 seconds)

Gallery generated by Sphinx-Gallery

	gender	department	department_name	lsa_division_00	lsa_division_01	lsa_division_02	lsa_division_03	lsa_division_04	lsa_division_05	lsa_division_06	lsa_division_07	lsa_division_08	lsa_division_09	lsa_division_10	lsa_division_11	lsa_division_12	lsa_division_13	lsa_division_14	lsa_division_15	lsa_division_16	lsa_division_17	lsa_division_18	lsa_division_19	lsa_division_20	lsa_division_21	lsa_division_22	lsa_division_23	lsa_division_24	lsa_division_25	lsa_division_26	lsa_division_27	lsa_division_28	lsa_division_29	assignment_category	lsa_employee_position_title_00	lsa_employee_position_title_01	lsa_employee_position_title_02	lsa_employee_position_title_03	lsa_employee_position_title_04	lsa_employee_position_title_05	lsa_employee_position_title_06	lsa_employee_position_title_07	lsa_employee_position_title_08	lsa_employee_position_title_09	lsa_employee_position_title_10	lsa_employee_position_title_11	lsa_employee_position_title_12	lsa_employee_position_title_13	lsa_employee_position_title_14	lsa_employee_position_title_15	lsa_employee_position_title_16	lsa_employee_position_title_17	lsa_employee_position_title_18	lsa_employee_position_title_19	lsa_employee_position_title_20	lsa_employee_position_title_21	lsa_employee_position_title_22	lsa_employee_position_title_23	lsa_employee_position_title_24	lsa_employee_position_title_25	lsa_employee_position_title_26	lsa_employee_position_title_27	lsa_employee_position_title_28	lsa_employee_position_title_29	date_first_hired	year_first_hired
	gender	department	department_name	lsa_division_00	lsa_division_01	lsa_division_02	lsa_division_03	lsa_division_04	lsa_division_05	lsa_division_06	lsa_division_07	lsa_division_08	lsa_division_09	lsa_division_10	lsa_division_11	lsa_division_12	lsa_division_13	lsa_division_14	lsa_division_15	lsa_division_16	lsa_division_17	lsa_division_18	lsa_division_19	lsa_division_20	lsa_division_21	lsa_division_22	lsa_division_23	lsa_division_24	lsa_division_25	lsa_division_26	lsa_division_27	lsa_division_28	lsa_division_29	assignment_category	lsa_employee_position_title_00	lsa_employee_position_title_01	lsa_employee_position_title_02	lsa_employee_position_title_03	lsa_employee_position_title_04	lsa_employee_position_title_05	lsa_employee_position_title_06	lsa_employee_position_title_07	lsa_employee_position_title_08	lsa_employee_position_title_09	lsa_employee_position_title_10	lsa_employee_position_title_11	lsa_employee_position_title_12	lsa_employee_position_title_13	lsa_employee_position_title_14	lsa_employee_position_title_15	lsa_employee_position_title_16	lsa_employee_position_title_17	lsa_employee_position_title_18	lsa_employee_position_title_19	lsa_employee_position_title_20	lsa_employee_position_title_21	lsa_employee_position_title_22	lsa_employee_position_title_23	lsa_employee_position_title_24	lsa_employee_position_title_25	lsa_employee_position_title_26	lsa_employee_position_title_27	lsa_employee_position_title_28	lsa_employee_position_title_29	date_first_hired	year_first_hired
0	F	POL	Department of Police	0.218	0.353	-0.0416	-0.0900	-0.446	-0.238	-0.185	-0.0514	-0.229	-0.0691	-0.0257	-0.125	-0.0817	0.0487	0.0790	0.118	0.0216	-0.232	-0.198	0.126	-0.358	0.0606	-0.186	-0.149	0.157	0.0173	0.0320	0.0144	-0.0160	-0.0109	Fulltime-Regular	0.398	-0.146	0.180	-0.0653	0.0948	0.0932	0.782	0.305	-0.285	-0.127	-0.119	-0.416	-0.0521	-0.172	0.203	0.00907	-0.118	-0.0229	0.0897	-0.116	-0.0295	0.0977	-0.0360	-0.0154	0.0202	-0.0287	0.0449	-0.0976	-0.0937	0.0320	09/22/1986	1,986
1	M	POL	Department of Police	0.163	0.232	-0.0295	-0.0615	-0.383	-0.0161	-0.0946	-0.0577	0.108	-0.0467	0.0326	-0.129	-0.0127	0.0292	-0.102	-0.0436	-0.0377	-0.0533	-0.275	0.252	0.142	-0.120	0.00396	0.0174	-0.0557	-0.0101	-0.0638	-0.0715	0.0878	-0.0918	Fulltime-Regular	0.847	-0.118	-0.0486	-0.111	-0.0529	-0.0486	-0.108	0.00810	-0.0236	-0.0552	-0.0566	-0.00862	0.0561	-0.0586	-0.0214	-0.00298	-0.00646	-0.0784	-0.148	-0.0202	-0.104	0.170	-0.00655	-0.0697	-0.0957	0.661	0.197	0.0741	0.0771	0.0986	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	0.132	0.255	0.350	-0.0107	-0.0790	-0.404	-0.152	-0.0184	-0.423	-0.0727	-0.0265	-0.0904	0.0102	0.0803	-0.0171	0.0322	0.0651	-0.278	-0.142	-0.00551	-0.0349	0.0833	-0.0872	-0.0253	-0.398	-0.0527	-0.0591	-0.226	-0.178	0.219	Fulltime-Regular	0.0480	0.0161	0.00704	0.0872	0.120	0.0590	0.213	-0.235	0.375	0.732	-0.460	0.0477	0.0600	0.0295	0.176	0.0317	0.0129	-0.0176	-0.0549	-0.0607	-0.0867	0.0284	0.0297	0.00634	-0.00491	0.0608	0.0337	-0.0440	0.00250	0.112	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	0.0582	0.0880	0.0638	0.000583	-0.295	-0.130	0.575	0.0640	-0.0902	0.0136	-0.00552	0.0259	0.0386	0.426	-0.160	-0.174	0.0335	0.0239	-0.00851	0.0139	-0.153	-0.0422	0.171	0.0591	0.0167	0.107	-0.0683	0.0454	-0.0408	0.0299	Fulltime-Regular	0.0461	0.0237	0.0694	0.0389	0.0573	0.0490	0.131	0.0620	0.0208	0.0182	0.0629	-0.0171	-0.0608	-0.00769	-0.0256	-0.0199	0.116	0.0470	-0.0893	0.130	0.152	-0.0128	0.0455	0.176	-0.00269	0.0736	-0.236	-0.310	0.598	0.0313	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	0.0146	0.0259	-0.00150	0.0268	-0.0367	-0.0312	-0.0198	0.0507	-0.0216	0.0109	-0.000768	0.0265	0.0219	0.104	-0.0254	0.0487	0.0335	-0.0851	0.00753	-0.0710	0.0704	0.0543	-0.0883	-0.0456	0.00131	0.0352	-0.0946	-0.0531	0.00231	-0.0863	Fulltime-Regular	0.0903	0.0242	0.0259	0.243	0.390	-0.0631	-0.0142	-0.146	-0.0338	0.0443	0.00448	-0.0805	0.0182	0.122	-0.191	0.0626	-0.0622	0.0510	0.0725	-0.0912	0.0428	0.0405	-0.255	-0.0398	0.0141	-0.136	0.164	0.155	0.176	0.0127	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	0.147	0.199	0.370	0.0113	0.00129	0.550	-0.0127	0.0527	-0.210	0.0391	-0.0799	0.225	-0.0199	0.0516	0.0461	0.00930	-0.0492	-0.119	0.0451	-0.137	0.0323	-0.293	-0.149	-0.0936	0.0927	0.0899	-0.130	-0.0822	0.156	-0.133	Fulltime-Regular	0.0500	0.0114	0.0106	0.0892	0.122	0.520	0.271	0.0834	-0.199	-0.111	-0.00623	0.815	0.237	0.187	-0.160	-0.0112	-0.129	0.0219	-0.00219	-0.0852	-0.178	-0.0203	0.0971	0.0869	0.0493	-0.00940	0.211	-0.177	-0.0105	-0.0367	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	0.105	0.157	0.0427	-0.0275	-0.237	-0.0218	-0.0836	-0.0329	0.0707	-0.0263	0.0240	-0.0618	-0.0814	0.0185	0.0187	0.0377	-0.0320	0.00919	-0.0793	0.0856	-0.0294	0.0463	0.103	-0.0483	-0.0452	-0.0308	-0.0238	-0.0425	0.0859	-0.0864	Fulltime-Regular	0.0622	0.229	0.00245	-0.0313	0.0356	-0.00290	0.00727	0.0515	0.0472	-0.00976	0.0784	-0.00967	-0.131	0.252	0.152	0.0308	-0.0991	-0.0161	-0.00193	-6.40e-05	0.0176	0.00428	0.0403	0.0222	-0.0235	0.0105	-0.0377	-0.0309	0.0245	-0.00804	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	0.142	0.253	0.431	-0.00542	0.0660	-0.0169	-0.0222	0.0217	-0.112	0.0670	0.233	0.136	0.0419	0.0895	0.0877	0.0543	0.0240	-0.177	-0.0477	-0.0385	0.00817	-0.0396	0.0113	-0.00547	-0.0746	0.0438	-0.112	0.00601	0.0835	-0.0807	Parttime-Regular	0.00774	-5.36e-05	0.0428	0.0203	0.0429	0.00315	0.0190	-0.00684	0.0168	0.0226	0.00579	-0.0132	0.00341	0.00496	-0.00566	0.000366	-0.00198	0.0253	-0.00153	0.00518	0.0198	-0.0121	-0.0159	0.00732	0.00809	0.0207	0.00866	-0.0238	-0.0312	0.0187	04/30/2001	2,001
9,226	M	CCL	County Council	0.0668	0.137	-0.0695	-0.0385	0.00973	-0.00640	0.0185	0.00934	-0.0521	0.0368	0.00266	0.0847	-0.0996	0.220	0.0583	0.0169	0.0497	-0.354	0.0961	-0.409	0.339	-0.495	0.398	-0.688	0.0946	-0.298	-0.0889	0.208	-0.195	0.0652	Fulltime-Regular	0.201	0.112	0.0314	0.881	-0.611	0.117	0.0993	-0.0987	0.0928	-0.0567	0.0543	-0.0508	-0.0906	0.0323	-0.102	-0.0345	-0.0376	-0.0373	-0.0342	-0.0783	0.0572	0.0389	-0.0415	0.0259	0.00365	0.0496	0.0713	-0.0209	-0.0209	-0.0638	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	0.122	0.232	-0.105	-0.0828	-0.0926	-0.0381	-0.0450	-0.00404	0.0190	0.0154	0.00626	0.116	0.0126	0.102	0.0632	0.108	0.0383	-0.0585	-0.0194	0.0546	-0.0692	-0.0119	0.0856	0.0449	-0.00141	0.0844	-0.00839	0.0172	0.0508	-0.0774	Fulltime-Regular	0.0339	0.0105	0.0290	0.137	0.230	-0.0199	-0.00293	-0.0603	-0.0148	0.0251	0.0193	-0.0732	-0.00263	0.0958	-0.129	-0.00911	-0.0180	0.242	-0.0297	0.0179	-0.0281	-0.0511	-0.112	-0.00994	0.00465	0.0339	0.0338	-0.0579	-0.0518	0.0253	01/30/2012	2,012

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

	steps	[('tablevectorizer', ...), ('histgradientboostingregressor', ...)]
	transform_input	None
	memory	None
	verbose	False

	cardinality_threshold	40
	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	n_jobs	None

Hands-On with Column Selection and Transformers#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

lsa_division_00

lsa_division_01

lsa_division_02

lsa_division_03

lsa_division_04

lsa_division_05

lsa_division_06

lsa_division_07

lsa_division_08

lsa_division_09

lsa_division_10

lsa_division_11

lsa_division_12

lsa_division_13

lsa_division_14

lsa_division_15

lsa_division_16

lsa_division_17

lsa_division_18

lsa_division_19

lsa_division_20

lsa_division_21

lsa_division_22

lsa_division_23

lsa_division_24

lsa_division_25

lsa_division_26

lsa_division_27

lsa_division_28

lsa_division_29

assignment_category

lsa_employee_position_title_00

lsa_employee_position_title_01

lsa_employee_position_title_02

lsa_employee_position_title_03

lsa_employee_position_title_04

lsa_employee_position_title_05

lsa_employee_position_title_06

lsa_employee_position_title_07

lsa_employee_position_title_08

lsa_employee_position_title_09

lsa_employee_position_title_10

lsa_employee_position_title_11

lsa_employee_position_title_12

lsa_employee_position_title_13

lsa_employee_position_title_14

lsa_employee_position_title_15

lsa_employee_position_title_16

lsa_employee_position_title_17

lsa_employee_position_title_18

lsa_employee_position_title_19

lsa_employee_position_title_20

lsa_employee_position_title_21

lsa_employee_position_title_22

lsa_employee_position_title_23

lsa_employee_position_title_24

lsa_employee_position_title_25

lsa_employee_position_title_26

lsa_employee_position_title_27