Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Hands-On with Column Selection and Transformers#

In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_pipeline() to create pipelines.

In this new example, we show how to create more flexible pipelines by selecting and transforming dataframe columns using arbitrary logic.

We begin with loading a dataset with heterogeneous datatypes, and replacing Pandas’s display with the TableReport display via skrub.patch_display().

import pandas as pd

import skrub
from skrub.datasets import fetch_employee_salaries

skrub.patch_display()
file_path = fetch_employee_salaries().path
data = pd.read_csv(file_path)
X = data.drop(columns="current_annual_salary")
y = data["current_annual_salary"]
X

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2,001
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	StringDtype	False	17 (0.2%)	2 (< 0.1%)
1	department	StringDtype	False	0 (0.0%)	37 (0.4%)
2	department_name	StringDtype	False	0 (0.0%)	37 (0.4%)
3	division	StringDtype	False	0 (0.0%)	694 (7.5%)
4	assignment_category	StringDtype	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	StringDtype	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	StringDtype	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Column 1	Column 2	Cramér's V
department	department_name	1.00
division	assignment_category	0.593
assignment_category	employee_position_title	0.497
department_name	assignment_category	0.422
department	assignment_category	0.422
department	employee_position_title	0.413
department_name	employee_position_title	0.413
division	employee_position_title	0.410
department	division	0.381
department_name	division	0.381
gender	department	0.380
gender	department_name	0.380
gender	assignment_category	0.294
gender	employee_position_title	0.275
gender	division	0.265
employee_position_title	date_first_hired	0.179
date_first_hired	year_first_hired	0.151
department	date_first_hired	0.150
department_name	date_first_hired	0.150
employee_position_title	year_first_hired	0.131
gender	date_first_hired	0.104
division	year_first_hired	0.0862
department	year_first_hired	0.0811
department_name	year_first_hired	0.0811
assignment_category	date_first_hired	0.0756
division	date_first_hired	0.0728
gender	year_first_hired	0.0641
assignment_category	year_first_hired	0.0519

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Our goal is now to apply a StringEncoder to two columns of our choosing: division and employee_position_title.

We can achieve this using ApplyToCols, whose job is to apply a transformer to multiple columns independently, and let unmatched columns through without changes. This can be seen as a handy drop-in replacement of the ColumnTransformer.

Since we selected two columns and set the number of components to 30 each, ApplyToCols will create 2*30 embedding columns in the dataframe Xt, which we prefix with lsa_.

from skrub import ApplyToCols, StringEncoder

apply_string_encoder = ApplyToCols(
    StringEncoder(n_components=30),
    cols=["division", "employee_position_title"],
    rename_columns="lsa_{}",
)
Xt = apply_string_encoder.fit_transform(X)
Xt

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	StringDtype	False	17 (0.2%)	2 (< 0.1%)
1	department	StringDtype	False	0 (0.0%)	37 (0.4%)
2	department_name	StringDtype	False	0 (0.0%)	37 (0.4%)
3	lsa_division_00	Float32DType	False	0 (0.0%)	685 (7.4%)	0.223	0.281	8.44e-05	0.134	1.13
4	lsa_division_01	Float32DType	False	0 (0.0%)	685 (7.4%)	0.179	0.280	-0.551	0.177	0.784
5	lsa_division_02	Float32DType	False	0 (0.0%)	685 (7.4%)	0.0284	0.314	-0.615	-0.00738	0.977
6	lsa_division_03	Float32DType	False	0 (0.0%)	685 (7.4%)	0.0428	0.305	-0.443	-0.00381	1.10
7	lsa_division_04	Float32DType	False	0 (0.0%)	688 (7.5%)	-0.0379	0.233	-0.774	-0.00801	0.362
8	lsa_division_05	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0430	0.218	-0.678	-0.0435	0.691
9	lsa_division_06	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0115	0.212	-0.338	-0.00442	1.20
10	lsa_division_07	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00129	0.206	-0.822	-0.00478	0.942
11	lsa_division_08	Float32DType	False	0 (0.0%)	692 (7.5%)	-0.0148	0.199	-0.826	-0.0126	0.674
12	lsa_division_09	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0293	0.188	-0.178	0.00183	1.31
13	lsa_division_10	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00125	0.185	-0.551	0.00335	1.07
14	lsa_division_11	Float32DType	False	0 (0.0%)	693 (7.5%)	0.0192	0.175	-0.637	0.00608	0.612
15	lsa_division_12	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0127	0.169	-0.626	-0.00488	0.919
16	lsa_division_13	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0291	0.157	-0.398	-0.000230	1.01
17	lsa_division_14	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0125	0.154	-0.503	-0.00473	0.559
18	lsa_division_15	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0102	0.151	-0.580	0.00736	0.753
19	lsa_division_16	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00131	0.148	-0.525	-0.00243	0.554
20	lsa_division_17	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.0205	0.140	-0.423	0.000758	0.490
21	lsa_division_18	Float32DType	False	0 (0.0%)	694 (7.5%)	0.0152	0.140	-0.611	0.00745	0.508
22	lsa_division_19	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00209	0.140	-0.407	0.000610	0.558
23	lsa_division_20	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00864	0.132	-0.660	-0.00157	0.555
24	lsa_division_21	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00641	0.131	-0.491	0.00138	0.822
25	lsa_division_22	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00685	0.127	-0.455	-0.00134	0.539
26	lsa_division_23	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.00294	0.121	-0.685	0.00157	0.505
27	lsa_division_24	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00303	0.119	-0.725	-0.00224	0.617
28	lsa_division_25	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00285	0.116	-0.416	0.00636	0.750
29	lsa_division_26	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.000121	0.115	-0.337	-0.00348	0.688
30	lsa_division_27	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00794	0.113	-0.442	-0.00146	0.476
31	lsa_division_28	Float32DType	False	0 (0.0%)	694 (7.5%)	0.00360	0.109	-0.375	-0.00393	0.573
32	lsa_division_29	Float32DType	False	0 (0.0%)	694 (7.5%)	-0.000215	0.107	-0.317	-0.00172	0.545
33	assignment_category	StringDtype	False	0 (0.0%)	2 (< 0.1%)
34	lsa_employee_position_title_00	Float32DType	False	0 (0.0%)	443 (4.8%)	0.230	0.315	0.000534	0.0888	1.11
35	lsa_employee_position_title_01	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0831	0.338	-0.342	0.00643	1.09
36	lsa_employee_position_title_02	Float32DType	False	0 (0.0%)	443 (4.8%)	0.110	0.301	-0.0618	0.0156	1.16
37	lsa_employee_position_title_03	Float32DType	False	0 (0.0%)	443 (4.8%)	0.108	0.275	-0.164	0.0336	0.984
38	lsa_employee_position_title_04	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0766	0.242	-0.617	0.0431	0.658
39	lsa_employee_position_title_05	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0532	0.208	-0.315	0.000349	0.954
40	lsa_employee_position_title_06	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0332	0.192	-0.323	0.00455	0.782
41	lsa_employee_position_title_07	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0123	0.190	-0.547	-0.00157	0.640
42	lsa_employee_position_title_08	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0359	0.178	-0.312	0.00360	0.590
43	lsa_employee_position_title_09	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00944	0.181	-0.620	-0.00301	0.812
44	lsa_employee_position_title_10	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0198	0.175	-0.503	-0.00949	0.730
45	lsa_employee_position_title_11	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.000953	0.173	-0.458	0.0114	0.815
46	lsa_employee_position_title_12	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00365	0.168	-0.332	-0.00364	0.961
47	lsa_employee_position_title_13	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0136	0.165	-0.331	0.00564	0.649
48	lsa_employee_position_title_14	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00150	0.161	-0.375	-0.0214	0.465
49	lsa_employee_position_title_15	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0191	0.155	-0.203	0.000354	1.10
50	lsa_employee_position_title_16	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0113	0.147	-0.311	-8.11e-06	0.839
51	lsa_employee_position_title_17	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00431	0.140	-0.260	-0.00162	0.757
52	lsa_employee_position_title_18	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00522	0.137	-0.377	0.00172	1.00
53	lsa_employee_position_title_19	Float32DType	False	0 (0.0%)	443 (4.8%)	0.0104	0.133	-0.343	-0.00210	0.467
54	lsa_employee_position_title_20	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00300	0.130	-0.436	-0.00930	0.485
55	lsa_employee_position_title_21	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00396	0.128	-0.538	-0.0142	0.553
56	lsa_employee_position_title_22	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00136	0.126	-0.335	-0.000361	0.500
57	lsa_employee_position_title_23	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00775	0.122	-0.328	-0.00613	0.945
58	lsa_employee_position_title_24	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00134	0.122	-0.590	0.00416	0.689
59	lsa_employee_position_title_25	Float32DType	False	0 (0.0%)	443 (4.8%)	0.00925	0.119	-0.189	0.00503	0.661
60	lsa_employee_position_title_26	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.000847	0.114	-0.383	0.00144	0.493
61	lsa_employee_position_title_27	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00444	0.108	-0.473	-0.00100	0.414
62	lsa_employee_position_title_28	Float32DType	False	0 (0.0%)	443 (4.8%)	0.000971	0.104	-0.298	-0.00388	0.618
63	lsa_employee_position_title_29	Float32DType	False	0 (0.0%)	443 (4.8%)	-0.00536	0.0975	-0.850	0.0105	0.175
64	date_first_hired	StringDtype	False	0 (0.0%)	2264 (24.5%)
65	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016

Please enable javascript

The ApplyToCols class can detect automatically whether the transformer is a SingleColumnTransformer (i.e., it can only be applied to one column at a time) or not, and apply it accordingly. The StringEncoder is a SingleColumnTransformer and thus applied to each column independently.

The ApplyToCols class can also be used with transformers that can be applied to multiple columns at once, such as the PCA. Here, we want to use PCA to reduce the number of dimensions of the new lsa_ columns.

To select columns without hardcoding their names, we introduce selectors, which allow for flexible matching pattern and composable logic.

The regex selector below will match all columns prefixed with "lsa", and pass them to ApplyToCols which will assemble these columns into a dataframe and finally pass it to the PCA

Note that ApplyToCols will automatically detect that PCA is not a SingleColumnTransformer and apply it to the whole sub-dataframe of columns chosen by the selector at once.

from sklearn.decomposition import PCA

from skrub import selectors as s

apply_pca = ApplyToCols(PCA(n_components=8), cols=s.regex("lsa"))
Xt = apply_pca.fit_transform(Xt)
Xt

	gender	department	department_name	assignment_category	date_first_hired	year_first_hired	pca0	pca1	pca2	pca3	pca4	pca5	pca6	pca7
0	F	POL	Department of Police	Fulltime-Regular	09/22/1986	1,986	0.113	-0.0331	0.0343	-0.133	-0.246	0.122	0.0579	0.375
1	M	POL	Department of Police	Fulltime-Regular	09/12/1988	1,988	0.499	0.140	-0.101	-0.0479	-0.167	0.236	-0.0248	0.275
2	F	HHS	Department of Health and Human Services	Fulltime-Regular	11/19/1989	1,989	-0.122	-0.124	0.393	0.0332	0.110	-0.367	-0.0472	0.364
3	M	COR	Correction and Rehabilitation	Fulltime-Regular	05/05/2014	2,014	-0.0733	-0.0882	0.167	-0.184	-0.398	-0.115	-0.323	-0.0888
4	M	HCA	Department of Housing and Community Affairs	Fulltime-Regular	03/05/2007	2,007	-0.0834	-0.0603	0.192	-0.237	-0.0749	-0.0451	0.205	-0.0660

9,223	F	HHS	Department of Health and Human Services	Fulltime-Regular	11/03/2015	2,015	-0.105	-0.124	0.403	0.618	-0.0456	0.352	0.0133	-0.406
9,224	F	FRS	Fire and Rescue Services	Fulltime-Regular	11/28/1988	1,988	-0.125	0.122	0.0631	-0.0400	-0.136	0.0860	0.0707	0.130
9,225	M	HHS	Department of Health and Human Services	Parttime-Regular	04/30/2001	2,001	-0.116	-0.131	0.306	0.202	0.0381	-0.121	-0.0168	0.0251
9,226	M	CCL	County Council	Fulltime-Regular	09/05/2006	2,006	-0.127	0.0353	0.225	-0.412	0.663	0.382	-0.561	0.0160
9,227	M	DLC	Department of Liquor Control	Fulltime-Regular	01/30/2012	2,012	-0.130	0.0597	0.127	-0.163	-0.0907	-0.00561	0.152	0.0481

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	StringDtype	False	17 (0.2%)	2 (< 0.1%)
1	department	StringDtype	False	0 (0.0%)	37 (0.4%)
2	department_name	StringDtype	False	0 (0.0%)	37 (0.4%)
3	assignment_category	StringDtype	False	0 (0.0%)	2 (< 0.1%)
4	date_first_hired	StringDtype	False	0 (0.0%)	2264 (24.5%)
5	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
6	pca0	Float32DType	False	0 (0.0%)	2779 (30.1%)	2.11e-08	0.476	-0.516	-0.107	1.48
7	pca1	Float32DType	False	0 (0.0%)	2779 (30.1%)	2.15e-08	0.445	-0.985	-0.0617	1.28
8	pca2	Float32DType	False	0 (0.0%)	2779 (30.1%)	6.20e-10	0.406	-1.08	0.136	0.729
9	pca3	Float32DType	False	0 (0.0%)	2779 (30.1%)	-3.41e-08	0.318	-0.659	-0.0400	1.33
10	pca4	Float32DType	False	0 (0.0%)	2779 (30.1%)	-4.34e-09	0.277	-1.02	0.0133	0.877
11	pca5	Float32DType	False	0 (0.0%)	2779 (30.1%)	-1.51e-08	0.266	-0.714	-0.0411	1.15
12	pca6	Float32DType	False	0 (0.0%)	2779 (30.1%)	8.53e-10	0.265	-1.12	0.0128	0.704
13	pca7	Float32DType	False	0 (0.0%)	2779 (30.1%)	3.88e-09	0.249	-0.724	-0.0382	0.772

Column 1	Column 2	Cramér's V	Pearson's Correlation
department	department_name	1.00
pca1	pca2	0.617	0.00847
pca0	pca2	0.537	0.00792
pca2	pca3	0.536	0.00363
pca4	pca6	0.508	0.0273
department_name	pca2	0.493
department	pca2	0.493
department_name	pca1	0.490
department	pca1	0.490
pca5	pca7	0.485	-0.0180
assignment_category	pca5	0.471
assignment_category	pca7	0.465
pca6	pca7	0.458	0.0239
assignment_category	pca3	0.454
pca3	pca7	0.446	-0.0145
pca0	pca1	0.439	0.00163
assignment_category	pca2	0.424
department	assignment_category	0.422
department_name	assignment_category	0.422
department_name	pca0	0.402
department	pca0	0.402
gender	department	0.380
gender	department_name	0.380
pca4	pca5	0.379	0.00966
pca2	pca7	0.378	0.0113
pca3	pca6	0.378	-0.00490
department	pca4	0.377
department_name	pca4	0.377
pca3	pca5	0.373	-0.000446
assignment_category	pca1	0.370
department	pca7	0.365
department_name	pca7	0.365
pca1	pca3	0.349	-0.00197
department	pca6	0.347
department_name	pca6	0.347
pca1	pca7	0.328	-0.00213
pca4	pca7	0.327	0.0195
pca3	pca4	0.326	0.0277
pca5	pca6	0.324	0.00838
pca2	pca5	0.321	-0.00382
pca2	pca4	0.319	-0.0149
department	pca3	0.312
department_name	pca3	0.312
gender	pca2	0.308
department	pca5	0.301
department_name	pca5	0.301
pca0	pca3	0.296	-0.000132
gender	assignment_category	0.294
pca0	pca4	0.286	0.00936
pca0	pca7	0.278	-0.00750
pca1	pca5	0.267	-0.00307
pca2	pca6	0.265	0.00585
gender	pca1	0.258
pca0	pca6	0.257	-0.00983
gender	pca3	0.256
assignment_category	pca0	0.256
gender	pca0	0.252
date_first_hired	pca1	0.251
assignment_category	pca4	0.243
gender	pca5	0.242
gender	pca7	0.232
pca0	pca5	0.231	-0.00537
date_first_hired	pca2	0.224
gender	pca6	0.222
pca1	pca4	0.217	0.00294
assignment_category	pca6	0.212
date_first_hired	pca0	0.193
pca1	pca6	0.186	-0.00145
date_first_hired	year_first_hired	0.151
department	date_first_hired	0.150
department_name	date_first_hired	0.150
gender	pca4	0.126
date_first_hired	pca7	0.121
date_first_hired	pca4	0.111
gender	date_first_hired	0.104
date_first_hired	pca3	0.102
year_first_hired	pca6	0.0938	0.0180
year_first_hired	pca0	0.0938	-0.0210
year_first_hired	pca2	0.0916	-0.125
year_first_hired	pca1	0.0892	-0.0369
year_first_hired	pca4	0.0839	-0.0514
date_first_hired	pca5	0.0824
year_first_hired	pca7	0.0822	-0.0689
department	year_first_hired	0.0811
department_name	year_first_hired	0.0811
year_first_hired	pca5	0.0775	-0.0246
date_first_hired	pca6	0.0773
assignment_category	date_first_hired	0.0756
year_first_hired	pca3	0.0690	0.0638
gender	year_first_hired	0.0641
assignment_category	year_first_hired	0.0519

Please enable javascript

These two selectors are scikit-learn transformers and can be chained together within a Pipeline.

from sklearn.pipeline import make_pipeline

model = make_pipeline(
    apply_string_encoder,
    apply_pca,
).fit_transform(X)

Under the hood of ApplyToCols

ApplyToCols is implemented using the ApplyToEachCol and ApplyToSubFrame classes. The former applies a transformer to each column independently, while the latter applies a transformer to a sub-dataframe. Normally, users don’t need to worry about these two classes, but they can be useful when more control is needed.

Note that selectors also come in handy in a pipeline to select or drop columns, using SelectCols and DropCols.

from sklearn.preprocessing import StandardScaler

from skrub import SelectCols

# Select only numerical columns
pipeline = make_pipeline(
    SelectCols(cols=s.numeric()),
    StandardScaler(),
).set_output(transform="pandas")
pipeline.fit_transform(Xt)

	year_first_hired	pca0	pca1	pca2	pca3	pca4	pca5	pca6	pca7
0	-1.89	0.238	-0.0744	0.0845	-0.418	-0.889	0.458	0.219	1.50
1	-1.67	1.05	0.316	-0.249	-0.151	-0.602	0.886	-0.0937	1.10
2	-1.57	-0.257	-0.279	0.967	0.104	0.397	-1.38	-0.178	1.46
3	1.12	-0.154	-0.198	0.410	-0.578	-1.44	-0.433	-1.22	-0.356
4	0.365	-0.175	-0.136	0.473	-0.744	-0.270	-0.169	0.776	-0.265

9,223	1.22	-0.222	-0.278	0.992	1.94	-0.165	1.32	0.0501	-1.63
9,224	-1.67	-0.263	0.273	0.155	-0.126	-0.490	0.323	0.267	0.523
9,225	-0.279	-0.244	-0.294	0.753	0.635	0.138	-0.454	-0.0636	0.101
9,226	0.258	-0.266	0.0794	0.554	-1.29	2.39	1.44	-2.12	0.0643
9,227	0.901	-0.272	0.134	0.313	-0.512	-0.327	-0.0211	0.573	0.193

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	year_first_hired	Float64DType	False	51 (0.6%)	-1.15e-14	1.00	-4.14	0.150	1.33
1	pca0	Float64DType	False	2779 (30.1%)	1.31e-17	1.00	-1.09	-0.224	3.12
2	pca1	Float64DType	False	2779 (30.1%)	-1.39e-17	1.00	-2.21	-0.139	2.87
3	pca2	Float64DType	False	2779 (30.1%)	1.00e-17	1.00	-2.65	0.334	1.79
4	pca3	Float64DType	False	2779 (30.1%)	2.00e-17	1.00	-2.07	-0.126	4.19
5	pca4	Float64DType	False	2779 (30.1%)	-1.39e-17	1.00	-3.67	0.0479	3.17
6	pca5	Float64DType	False	2779 (30.1%)	6.93e-18	1.00	-2.68	-0.155	4.33
7	pca6	Float64DType	False	2779 (30.1%)	3.85e-18	1.00	-4.23	0.0485	2.66
8	pca7	Float64DType	False	2779 (30.1%)	-7.31e-18	1.00	-2.91	-0.153	3.10

Column 1	Column 2	Cramér's V	Pearson's Correlation
pca1	pca2	0.617	0.00847
pca0	pca2	0.537	0.00792
pca2	pca3	0.536	0.00363
pca4	pca6	0.508	0.0273
pca5	pca7	0.485	-0.0180
pca6	pca7	0.458	0.0239
pca3	pca7	0.446	-0.0145
pca0	pca1	0.439	0.00163
pca4	pca5	0.379	0.00966
pca2	pca7	0.378	0.0113
pca3	pca6	0.378	-0.00490
pca3	pca5	0.373	-0.000446
pca1	pca3	0.349	-0.00197
pca1	pca7	0.328	-0.00213
pca4	pca7	0.327	0.0195
pca3	pca4	0.326	0.0277
pca5	pca6	0.324	0.00838
pca2	pca5	0.321	-0.00382
pca2	pca4	0.319	-0.0149
pca0	pca3	0.296	-0.000132
pca0	pca4	0.286	0.00936
pca0	pca7	0.278	-0.00750
pca1	pca5	0.267	-0.00307
pca2	pca6	0.265	0.00585
pca0	pca6	0.257	-0.00983
pca0	pca5	0.231	-0.00537
pca1	pca4	0.217	0.00294
pca1	pca6	0.186	-0.00145
year_first_hired	pca6	0.0938	0.0180
year_first_hired	pca0	0.0938	-0.0210
year_first_hired	pca2	0.0916	-0.125
year_first_hired	pca1	0.0892	-0.0369
year_first_hired	pca4	0.0839	-0.0514
year_first_hired	pca7	0.0822	-0.0689
year_first_hired	pca5	0.0775	-0.0246
year_first_hired	pca3	0.0690	0.0638

Please enable javascript

Let’s run through one more example to showcase the expressiveness of the selectors. Suppose we want to apply an OrdinalEncoder on categorical columns with low cardinality (e.g., fewer than 40 unique values).

We define a column filter using skrub selectors with a lambda function. Note that the same effect can be obtained directly by using cardinality_below().

from sklearn.preprocessing import OrdinalEncoder

low_cardinality = s.filter(lambda col: col.nunique() < 40)
ApplyToCols(OrdinalEncoder(), cols=s.string() & low_cardinality).fit_transform(X)

	division	employee_position_title	date_first_hired	year_first_hired	gender	department	department_name	assignment_category
0	MSB Information Mgmt and Tech Division Records Management Section	Office Services Coordinator	09/22/1986	1,986	0.00	32.0	14.0	0.00
1	ISB Major Crimes Division Fugitive Section	Master Police Officer	09/12/1988	1,988	1.00	32.0	14.0	0.00
2	Adult Protective and Case Management Services	Social Worker IV	11/19/1989	1,989	0.00	19.0	10.0	0.00
3	PRRS Facility and Security	Resident Supervisor II	05/05/2014	2,014	1.00	6.00	4.00	0.00
4	Affordable Housing Programs	Planning Specialist III	03/05/2007	2,007	1.00	18.0	11.0	0.00

9,223	School Based Health Centers	Community Health Nurse II	11/03/2015	2,015	0.00	19.0	10.0	0.00
9,224	Human Resources Division	Fire/Rescue Division Chief	11/28/1988	1,988	0.00	17.0	20.0	0.00
9,225	Child and Adolescent Mental Health Clinic Services	Medical Doctor IV - Psychiatrist	04/30/2001	2,001	1.00	19.0	10.0	1.00
9,226	Council Central Staff	Manager II	09/05/2006	2,006	1.00	3.00	6.00	0.00
9,227	Licensure, Regulation and Education	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012	1.00	11.0	12.0	0.00

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	division	StringDtype	False	0 (0.0%)	694 (7.5%)
1	employee_position_title	StringDtype	False	0 (0.0%)	443 (4.8%)
2	date_first_hired	StringDtype	False	0 (0.0%)	2264 (24.5%)
3	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
4	gender	Float64DType	False	17 (0.2%)	2 (< 0.1%)	0.595	0.491	0.00	1.00	1.00
5	department	Float64DType	False	0 (0.0%)	37 (0.4%)	18.8	8.97	0.00	17.0	36.0
6	department_name	Float64DType	False	0 (0.0%)	37 (0.4%)	14.4	6.22	0.00	14.0	36.0
7	assignment_category	Float64DType	False	0 (0.0%)	2 (< 0.1%)	0.0904	0.287	0.00	0.00	1.00

Column 1	Column 2	Cramér's V	Pearson's Correlation
department	department_name	0.652	0.389
division	assignment_category	0.593
employee_position_title	assignment_category	0.497
division	employee_position_title	0.410
employee_position_title	department_name	0.389
division	department_name	0.340
employee_position_title	department	0.328
division	department	0.318
department	assignment_category	0.316	0.0679
gender	assignment_category	0.294	-0.294
gender	department_name	0.284	0.184
employee_position_title	gender	0.275
division	gender	0.265
department_name	assignment_category	0.256	-0.0874
gender	department	0.223	-0.123
employee_position_title	date_first_hired	0.179
date_first_hired	department_name	0.153
date_first_hired	year_first_hired	0.151
employee_position_title	year_first_hired	0.131
date_first_hired	department	0.117
date_first_hired	gender	0.104
division	year_first_hired	0.0862
date_first_hired	assignment_category	0.0756
division	date_first_hired	0.0728
year_first_hired	department	0.0723	-0.0516
year_first_hired	gender	0.0641	0.0395
year_first_hired	department_name	0.0610	0.00346
year_first_hired	assignment_category	0.0519	0.0217

Please enable javascript

Notice how we composed the selector with string() using a logical operator. This resulting selector matches string columns with cardinality below 40.

We can also define the opposite selector high_cardinality using the negation operator ~ and apply a StringEncoder to vectorize those columns.

from sklearn.ensemble import HistGradientBoostingRegressor

high_cardinality = ~low_cardinality
pipeline = make_pipeline(
    ApplyToCols(
        OrdinalEncoder(),
        cols=s.string() & low_cardinality,
    ),
    ApplyToCols(
        StringEncoder(),
        cols=s.string() & high_cardinality,
    ),
    HistGradientBoostingRegressor(),
).fit(X, y)
pipeline

Pipeline(steps=[('applytocols-1',
                 ApplyToCols(cols=(string() & filter(<lambda>)),
                             transformer=OrdinalEncoder())),
                ('applytocols-2',
                 ApplyToCols(cols=(string() & (~filter(<lambda>))),
                             transformer=StringEncoder())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Interestingly, the pipeline above is similar to the datatype dispatching performed by TableVectorizer, also used in tabular_pipeline().

Click on the dropdown arrows next to the datatype to see the columns are mapped to the different transformers in TableVectorizer.

from skrub import tabular_pipeline

tabular_pipeline("regressor").fit(X, y)

Total running time of the script: (0 minutes 15.630 seconds)

Estimated memory usage: 500 MB

Gallery generated by Sphinx-Gallery

	gender	department	department_name	lsa_division_00	lsa_division_01	lsa_division_02	lsa_division_03	lsa_division_04	lsa_division_05	lsa_division_06	lsa_division_07	lsa_division_08	lsa_division_09	lsa_division_10	lsa_division_11	lsa_division_12	lsa_division_13	lsa_division_14	lsa_division_15	lsa_division_16	lsa_division_17	lsa_division_18	lsa_division_19	lsa_division_20	lsa_division_21	lsa_division_22	lsa_division_23	lsa_division_24	lsa_division_25	lsa_division_26	lsa_division_27	lsa_division_28	lsa_division_29	assignment_category	lsa_employee_position_title_00	lsa_employee_position_title_01	lsa_employee_position_title_02	lsa_employee_position_title_03	lsa_employee_position_title_04	lsa_employee_position_title_05	lsa_employee_position_title_06	lsa_employee_position_title_07	lsa_employee_position_title_08	lsa_employee_position_title_09	lsa_employee_position_title_10	lsa_employee_position_title_11	lsa_employee_position_title_12	lsa_employee_position_title_13	lsa_employee_position_title_14	lsa_employee_position_title_15	lsa_employee_position_title_16	lsa_employee_position_title_17	lsa_employee_position_title_18	lsa_employee_position_title_19	lsa_employee_position_title_20	lsa_employee_position_title_21	lsa_employee_position_title_22	lsa_employee_position_title_23	lsa_employee_position_title_24	lsa_employee_position_title_25	lsa_employee_position_title_26	lsa_employee_position_title_27	lsa_employee_position_title_28	lsa_employee_position_title_29	date_first_hired	year_first_hired
	gender	department	department_name	lsa_division_00	lsa_division_01	lsa_division_02	lsa_division_03	lsa_division_04	lsa_division_05	lsa_division_06	lsa_division_07	lsa_division_08	lsa_division_09	lsa_division_10	lsa_division_11	lsa_division_12	lsa_division_13	lsa_division_14	lsa_division_15	lsa_division_16	lsa_division_17	lsa_division_18	lsa_division_19	lsa_division_20	lsa_division_21	lsa_division_22	lsa_division_23	lsa_division_24	lsa_division_25	lsa_division_26	lsa_division_27	lsa_division_28	lsa_division_29	assignment_category	lsa_employee_position_title_00	lsa_employee_position_title_01	lsa_employee_position_title_02	lsa_employee_position_title_03	lsa_employee_position_title_04	lsa_employee_position_title_05	lsa_employee_position_title_06	lsa_employee_position_title_07	lsa_employee_position_title_08	lsa_employee_position_title_09	lsa_employee_position_title_10	lsa_employee_position_title_11	lsa_employee_position_title_12	lsa_employee_position_title_13	lsa_employee_position_title_14	lsa_employee_position_title_15	lsa_employee_position_title_16	lsa_employee_position_title_17	lsa_employee_position_title_18	lsa_employee_position_title_19	lsa_employee_position_title_20	lsa_employee_position_title_21	lsa_employee_position_title_22	lsa_employee_position_title_23	lsa_employee_position_title_24	lsa_employee_position_title_25	lsa_employee_position_title_26	lsa_employee_position_title_27	lsa_employee_position_title_28	lsa_employee_position_title_29	date_first_hired	year_first_hired
0	F	POL	Department of Police	0.218	0.353	-0.0416	-0.0900	-0.446	-0.238	-0.185	-0.0513	-0.229	-0.0690	-0.0257	-0.125	-0.0817	0.0489	0.0790	0.118	0.0212	-0.231	-0.197	0.127	-0.357	0.0603	-0.190	-0.151	-0.152	0.00963	0.0310	0.00560	-0.0285	-0.0117	Fulltime-Regular	0.398	-0.146	0.180	-0.0653	0.0948	0.0932	0.782	0.305	-0.285	-0.127	-0.119	-0.416	-0.0521	-0.172	0.203	0.00908	-0.118	-0.0229	0.0897	-0.116	-0.0295	0.0976	-0.0360	-0.0154	0.0202	-0.0287	0.0453	-0.0964	-0.0940	0.0351	09/22/1986	1,986
1	M	POL	Department of Police	0.163	0.232	-0.0295	-0.0615	-0.383	-0.0161	-0.0946	-0.0577	0.108	-0.0466	0.0327	-0.129	-0.0128	0.0293	-0.103	-0.0437	-0.0392	-0.0517	-0.273	0.251	0.146	-0.117	-0.00275	0.0329	0.0568	-0.00178	-0.0142	-0.0781	0.100	-0.0648	Fulltime-Regular	0.847	-0.118	-0.0486	-0.111	-0.0529	-0.0486	-0.108	0.00809	-0.0236	-0.0552	-0.0566	-0.00863	0.0561	-0.0586	-0.0214	-0.00300	-0.00646	-0.0785	-0.148	-0.0202	-0.104	0.170	-0.00675	-0.0696	-0.0957	0.661	0.196	0.0715	0.0774	0.0974	09/12/1988	1,988
2	F	HHS	Department of Health and Human Services	0.132	0.255	0.350	-0.0107	-0.0790	-0.404	-0.152	-0.0184	-0.423	-0.0726	-0.0265	-0.0905	0.0101	0.0806	-0.0181	0.0315	0.0655	-0.276	-0.139	-0.00810	-0.0380	0.0808	-0.0761	-0.0160	0.378	0.00669	-0.0115	-0.241	-0.246	0.157	Fulltime-Regular	0.0480	0.0160	0.00704	0.0872	0.120	0.0590	0.213	-0.235	0.376	0.732	-0.460	0.0477	0.0600	0.0295	0.176	0.0318	0.0129	-0.0176	-0.0549	-0.0607	-0.0867	0.0282	0.0299	0.00569	-0.00473	0.0607	0.0337	-0.0402	0.00271	0.111	11/19/1989	1,989
3	M	COR	Correction and Rehabilitation	0.0582	0.0880	0.0638	0.000581	-0.295	-0.130	0.575	0.0640	-0.0902	0.0136	-0.00555	0.0259	0.0386	0.426	-0.160	-0.174	0.0337	0.0241	-0.00819	0.0129	-0.153	-0.0447	0.174	0.0614	-0.0166	0.120	-0.0487	0.0614	-0.00252	0.0123	Fulltime-Regular	0.0461	0.0237	0.0694	0.0389	0.0573	0.0490	0.131	0.0620	0.0208	0.0182	0.0629	-0.0171	-0.0608	-0.00768	-0.0256	-0.0199	0.116	0.0469	-0.0894	0.131	0.152	-0.0125	0.0457	0.176	-0.00261	0.0742	-0.235	-0.320	0.596	0.0178	05/05/2014	2,014
4	M	HCA	Department of Housing and Community Affairs	0.0146	0.0259	-0.00150	0.0268	-0.0367	-0.0312	-0.0198	0.0507	-0.0216	0.0109	-0.000835	0.0265	0.0217	0.104	-0.0260	0.0488	0.0335	-0.0834	0.00874	-0.0729	0.0649	0.0495	-0.0827	-0.0350	-0.0253	0.0708	-0.0578	-0.0261	0.00334	-0.0729	Fulltime-Regular	0.0903	0.0242	0.0259	0.243	0.390	-0.0631	-0.0142	-0.146	-0.0338	0.0443	0.00448	-0.0805	0.0182	0.122	-0.191	0.0626	-0.0622	0.0510	0.0723	-0.0911	0.0427	0.0407	-0.254	-0.0395	0.0135	-0.138	0.167	0.156	0.178	0.0203	03/05/2007	2,007

9,223	F	HHS	Department of Health and Human Services	0.147	0.199	0.370	0.0113	0.00128	0.550	-0.0127	0.0527	-0.210	0.0391	-0.0799	0.225	-0.0198	0.0514	0.0463	0.00927	-0.0495	-0.118	0.0447	-0.137	0.0338	-0.294	-0.146	-0.0938	-0.0867	0.0872	-0.119	-0.0807	0.202	-0.101	Fulltime-Regular	0.0500	0.0114	0.0106	0.0892	0.122	0.520	0.271	0.0834	-0.199	-0.111	-0.00622	0.815	0.237	0.187	-0.160	-0.0113	-0.129	0.0219	-0.00224	-0.0851	-0.177	-0.0202	0.0971	0.0873	0.0492	-0.00918	0.210	-0.180	-0.0123	-0.0343	11/03/2015	2,015
9,224	F	FRS	Fire and Rescue Services	0.105	0.157	0.0427	-0.0275	-0.237	-0.0218	-0.0836	-0.0329	0.0707	-0.0263	0.0240	-0.0619	-0.0814	0.0185	0.0188	0.0378	-0.0322	0.00967	-0.0790	0.0856	-0.0287	0.0478	0.102	-0.0449	0.0427	-0.0343	-0.0265	-0.0424	0.113	-0.0554	Fulltime-Regular	0.0622	0.229	0.00245	-0.0313	0.0356	-0.00290	0.00727	0.0515	0.0472	-0.00977	0.0784	-0.00967	-0.131	0.252	0.152	0.0308	-0.0992	-0.0162	-0.00196	0.000132	0.0174	0.00392	0.0397	0.0227	-0.0239	0.0107	-0.0385	-0.0329	0.0228	-0.00111	11/28/1988	1,988
9,225	M	HHS	Department of Health and Human Services	0.142	0.253	0.431	-0.00543	0.0660	-0.0169	-0.0222	0.0217	-0.112	0.0670	0.233	0.136	0.0419	0.0895	0.0879	0.0540	0.0238	-0.180	-0.0490	-0.0372	0.0115	-0.0392	0.0100	-0.0121	0.0795	0.0324	-0.120	0.0183	0.130	-0.0734	Parttime-Regular	0.00774	-5.33e-05	0.0428	0.0203	0.0429	0.00315	0.0190	-0.00684	0.0168	0.0226	0.00579	-0.0132	0.00341	0.00497	-0.00566	0.000371	-0.00198	0.0253	-0.00153	0.00520	0.0197	-0.0123	-0.0159	0.00736	0.00787	0.0205	0.00861	-0.0233	-0.0319	0.0208	04/30/2001	2,001
9,226	M	CCL	County Council	0.0668	0.137	-0.0695	-0.0385	0.00973	-0.00640	0.0185	0.00934	-0.0521	0.0368	0.00269	0.0847	-0.0996	0.219	0.0582	0.0166	0.0495	-0.356	0.0962	-0.407	0.342	-0.491	0.401	-0.685	-0.0919	-0.289	-0.131	0.172	-0.252	0.0244	Fulltime-Regular	0.201	0.112	0.0314	0.881	-0.611	0.117	0.0993	-0.0987	0.0927	-0.0567	0.0543	-0.0508	-0.0906	0.0323	-0.102	-0.0344	-0.0376	-0.0372	-0.0342	-0.0783	0.0574	0.0388	-0.0412	0.0248	0.00412	0.0496	0.0724	-0.0172	-0.0206	-0.0664	09/05/2006	2,006
9,227	M	DLC	Department of Liquor Control	0.122	0.232	-0.105	-0.0828	-0.0926	-0.0381	-0.0450	-0.00405	0.0190	0.0154	0.00623	0.116	0.0127	0.102	0.0632	0.107	0.0392	-0.0591	-0.0211	0.0545	-0.0672	-0.0137	0.0923	0.0434	-0.000109	0.0719	-0.0150	0.0551	0.0610	-0.0646	Fulltime-Regular	0.0339	0.0105	0.0290	0.137	0.230	-0.0199	-0.00293	-0.0603	-0.0148	0.0251	0.0193	-0.0732	-0.00262	0.0958	-0.129	-0.00914	-0.0181	0.242	-0.0296	0.0179	-0.0280	-0.0510	-0.112	-0.00997	0.00466	0.0338	0.0339	-0.0599	-0.0522	0.0234	01/30/2012	2,012

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('applytocols-1', ...), ('applytocols-2', ...), ...]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute.	'auto'
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'use_encoded_value'}, default='error' When set to 'error' an error will be raised in case an unknown categorical feature is present during transform. When set to 'use_encoded_value', the encoded value of unknown categories will be set to the value given for the parameter `unknown_value`. In :meth:`inverse_transform`, an unknown category will be denoted as None. .. versionadded:: 0.24	'error'
	unknown_value unknown_value: int or np.nan, default=None When the parameter handle_unknown is set to 'use_encoded_value', this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in `fit`. If set to np.nan, the `dtype` parameter must be a float dtype. .. versionadded:: 0.24	None
	encoded_missing_value encoded_missing_value: int or np.nan, default=np.nan Encoded value of missing categories. If set to `np.nan`, then the `dtype` parameter must be a float dtype. .. versionadded:: 1.1	nan
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.3 Read more in the :ref:`User Guide `.	None
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output categories for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. `max_categories` do not take into account missing or unknown categories. Setting `unknown_value` or `encoded_missing_value` to an integer will increase the number of unique integer codes by one each. This can result in up to `max_categories + 2` integer codes. .. versionadded:: 1.3 Read more in the :ref:`User Guide `.	None

	transformer	StringEncoder()
	cols	(string() & (...er(<lambda>)))
	exclude_cols	None
	allow_reject	False
	keep_original	False
	rename_columns	'{}'
	n_jobs	None

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None
	vocabulary	None

	transformer	OrdinalEncoder()
	cols	(string() & filter(<lambda>))
	exclude_cols	None
	allow_reject	False
	keep_original	False
	rename_columns	'{}'
	n_jobs	None

	loss loss: {'squared_error', 'absolute_error', 'gamma', 'poisson', 'quantile'}, default='squared_error' The loss function to use in the boosting process. Note that the "squared error", "gamma" and "poisson" losses actually implement "half least squares loss", "half gamma deviance" and "half poisson deviance" to simplify the computation of the gradient. Furthermore, "gamma" and "poisson" losses internally use a log-link, "gamma" requires ``y > 0`` and "poisson" requires ``y >= 0``. "quantile" uses the pinball loss. .. versionchanged:: 0.23 Added option 'poisson'. .. versionchanged:: 1.1 Added option 'quantile'. .. versionchanged:: 1.3 Added option 'gamma'.	'squared_error'
	quantile quantile: float, default=None If loss is "quantile", this parameter specifies which quantile to be estimated and must be between 0 and 1.	None
	learning_rate learning_rate: float, default=0.1 The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage.	0.1
	max_iter max_iter: int, default=100 The maximum number of iterations of the boosting process, i.e. the maximum number of trees.	100
	max_leaf_nodes max_leaf_nodes: int or None, default=31 The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.	31
	max_depth max_depth: int or None, default=None The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn't constrained by default.	None
	min_samples_leaf min_samples_leaf: int, default=20 The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.	20
	l2_regularization l2_regularization: float, default=0 The L2 regularization parameter penalizing leaves with small hessians. Use ``0`` for no regularization (default).	0.0
	max_features max_features: float, default=1.0 Proportion of randomly chosen features in each and every node split. This is a form of regularization, smaller values make the trees weaker learners and might prevent overfitting. If interaction constraints from `interaction_cst` are present, only allowed features are taken into account for the subsampling. .. versionadded:: 1.4	1.0
	max_bins max_bins: int, default=255 The maximum number of bins to use for non-missing values. Before training, each feature of the input array `X` is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. In addition to the ``max_bins`` bins, one more bin is always reserved for missing values. Must be no larger than 255.	255
	categorical_features categorical_features: array-like of {bool, int, str} of shape (n_features) or shape (n_categorical_features,), default='from_dtype' Indicates the categorical features. - None : no feature will be considered categorical. - boolean array-like : boolean mask indicating categorical features. - integer array-like : integer indices indicating categorical features. - str array-like: names of categorical features (assuming the training data has feature names). - `"from_dtype"`: dataframe columns with dtype "category" are considered to be categorical features. The input must be an object exposing a ``__dataframe__`` method such as pandas or polars DataFrames to use this feature. For each categorical feature, there must be at most `max_bins` unique categories. Negative values for categorical features encoded as numeric dtypes are treated as missing values. All categorical values are converted to floating point numbers. This means that categorical values of 1.0 and 1 are treated as the same category. Read more in the :ref:`User Guide ` and :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`. .. versionadded:: 0.24 .. versionchanged:: 1.2 Added support for feature names. .. versionchanged:: 1.4 Added `"from_dtype"` option. .. versionchanged:: 1.6 The default value changed from `None` to `"from_dtype"`.	'from_dtype'
	monotonic_cst monotonic_cst: array-like of int of shape (n_features) or dict, default=None Monotonic constraint to enforce on each feature are specified using the following integer values: - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If a dict with str keys, map feature to monotonic constraints by name. If an array, the features are mapped to constraints by position. See :ref:`monotonic_cst_features_names` for a usage example. Read more in the :ref:`User Guide `. .. versionadded:: 0.23 .. versionchanged:: 1.2 Accept dict of constraints with feature names as keys.	None
	interaction_cst interaction_cst: {"pairwise", "no_interactions"} or sequence of lists/tuples/sets of int, default=None Specify interaction constraints, the sets of features which can interact with each other in child node splits. Each item specifies the set of feature indices that are allowed to interact with each other. If there are more features than specified in these constraints, they are treated as if they were specified as an additional set. The strings "pairwise" and "no_interactions" are shorthands for allowing only pairwise or no interactions, respectively. For instance, with 5 features in total, `interaction_cst=[{0, 1}]` is equivalent to `interaction_cst=[{0, 1}, {2, 3, 4}]`, and specifies that each branch of a tree will either only split on features 0 and 1 or only split on features 2, 3 and 4. See :ref:`this example` on how to use `interaction_cst`. .. versionadded:: 1.2	None
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See :term:`the Glossary `.	False
	early_stopping early_stopping: 'auto' or bool, default='auto' If 'auto', early stopping is enabled if the sample size is larger than 10000 or if `X_val` and `y_val` are passed to `fit`. If True, early stopping is enabled, otherwise early stopping is disabled. .. versionadded:: 0.23	'auto'
	scoring scoring: str or callable or None, default='loss' Scoring method to use for early stopping. Only used if `early_stopping` is enabled. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: the :ref:`coefficient of determination ` (:math:`R^2`) is used. - 'loss': early stopping is checked w.r.t the loss value.	'loss'
	validation_fraction validation_fraction: int or float or None, default=0.1 Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. The value is ignored if either early stopping is not performed, e.g. `early_stopping=False`, or if `X_val` and `y_val` are passed to fit.	0.1
	n_iter_no_change n_iter_no_change: int, default=10 Used to determine when to "early stop". The fitting process is stopped when none of the last ``n_iter_no_change`` scores are better than the ``n_iter_no_change - 1`` -th-to-last one, up to some tolerance. Only used if early stopping is performed.	10
	tol tol: float, default=1e-7 The absolute tolerance to use when comparing scores during early stopping. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.	1e-07
	verbose verbose: int, default=0 The verbosity level. If not zero, print some information about the fitting process. ``1`` prints only summary info, ``2`` prints info per iteration.	0
	random_state random_state: int, RandomState instance or None, default=None Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.	None

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('tablevectorizer', ...), ('histgradientboostingregressor', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	cardinality_threshold	40
	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	null_strings	None
	n_jobs	None

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

Hands-On with Column Selection and Transformers#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

Please enable javascript

gender

department

department_name

lsa_division_00

lsa_division_01

lsa_division_02

lsa_division_03

lsa_division_04

lsa_division_05

lsa_division_06

lsa_division_07

lsa_division_08

lsa_division_09

lsa_division_10

lsa_division_11

lsa_division_12

lsa_division_13

lsa_division_14

lsa_division_15

lsa_division_16

lsa_division_17

lsa_division_18

lsa_division_19

lsa_division_20

lsa_division_21

lsa_division_22

lsa_division_23

lsa_division_24

lsa_division_25

lsa_division_26

lsa_division_27

lsa_division_28

lsa_division_29

assignment_category

lsa_employee_position_title_00

lsa_employee_position_title_01

lsa_employee_position_title_02

lsa_employee_position_title_03

lsa_employee_position_title_04

lsa_employee_position_title_05

lsa_employee_position_title_06

lsa_employee_position_title_07

lsa_employee_position_title_08

lsa_employee_position_title_09

lsa_employee_position_title_10

lsa_employee_position_title_11

lsa_employee_position_title_12

lsa_employee_position_title_13

lsa_employee_position_title_14

lsa_employee_position_title_15

lsa_employee_position_title_16

lsa_employee_position_title_17

lsa_employee_position_title_18

lsa_employee_position_title_19

lsa_employee_position_title_20

lsa_employee_position_title_21

lsa_employee_position_title_22

lsa_employee_position_title_23

lsa_employee_position_title_24

lsa_employee_position_title_25

lsa_employee_position_title_26

lsa_employee_position_title_27