Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Subsampling for faster development#

Here we show how to use .skb.subsample() to speed-up interactive creation of skrub expressions by subsampling the data when computing preview results.

import skrub
import skrub.datasets

dataset = skrub.datasets.fetch_employee_salaries().employee_salaries

full_data = skrub.var("data", dataset)
full_data

<Var 'data'>

Show graph

Result:

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired	current_annual_salary
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986	69222.18
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988	97392.47
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989	104717.28
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014	52734.57
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007	93396.0

9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2015	72094.53
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1988	169543.85
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2001	102736.52
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006	153747.5
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012	75484.08

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
8	current_annual_salary	Float64DType	0 (0.0%)	3403 (36.9%)	7.34e+04	2.91e+04	9.20e+03	6.94e+04	3.03e+05

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We are working with a dataset of over 9K rows. As we build up our pipeline, we see previews of the intermediate results so we can check that it behaves as we expect. However, if some estimators are slow, fitting them and computing results on the whole data can slow us down.

Lightweight construction of the pipeline on a subsample#

We can tell skrub to subsample the data when computing the previews, with .skb.subsample().

data = full_data.skb.subsample(n=100)
data

<SubsamplePreviews>

Show graph

Result (on a subsample):

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired	current_annual_salary
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986	69222.18
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988	97392.47
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989	104717.28
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014	52734.57
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007	93396.0

95	M	COR	Correction and Rehabilitation	DS MCDC Custody and Security	Fulltime-Regular	Correctional Officer III (Corporal)	02/05/2007	2007	62903.0
96	F	DLC	Department of Liquor Control	Muddy Branch	Parttime-Regular	Liquor Store Clerk I	04/23/2012	2012	29779.07
97	M	COR	Correction and Rehabilitation	DS MCDC Custody and Security	Fulltime-Regular	Correctional Officer II (PFC)	06/01/2004	2004	65623.0
98	F	DLC	Department of Liquor Control	Walnut Hill	Parttime-Regular	Liquor Store Clerk I	05/20/2012	2012	29779.07
99	M	DOT	Department of Transportation	Transit Silver Spring Ride On	Fulltime-Regular	Bus Operator	03/26/2007	2007	48258.81

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	1 (1.0%)	2 (2.0%)
1	department	ObjectDType	0 (0.0%)	19 (19.0%)
2	department_name	ObjectDType	0 (0.0%)	19 (19.0%)
3	division	ObjectDType	0 (0.0%)	77 (77.0%)
4	assignment_category	ObjectDType	0 (0.0%)	2 (2.0%)
5	employee_position_title	ObjectDType	0 (0.0%)	59 (59.0%)
6	date_first_hired	ObjectDType	0 (0.0%)	92 (92.0%)
7	year_first_hired	Int64DType	0 (0.0%)	25 (25.0%)	2.01e+03	8.68	1,979	2,007	2,016
8	current_annual_salary	Float64DType	0 (0.0%)	90 (90.0%)	6.62e+04	3.40e+04	1.60e+04	6.05e+04	2.28e+05

Please enable javascript

The rest of the pipeline will now use only 100 points for its previews.

To continue our pipeline we now define X and y:

employees = data.drop(
    columns="current_annual_salary",
    errors="ignore",
).skb.mark_as_X()

salaries = data["current_annual_salary"].skb.mark_as_y()

And finally we apply a TableVectorizer then gradient boosting:

from sklearn.ensemble import HistGradientBoostingRegressor

predictions = employees.skb.apply(skrub.TableVectorizer()).skb.apply(
    HistGradientBoostingRegressor(), y=salaries
)

All the lines above run very fast, including fitting the predictor above.

When we display our predictions expression, we see that the preview is computed on a subsample: the result column has only 100 entries.

predictions

<Apply HistGradientBoostingRegressor>

Show graph

Result (on a subsample):

	current_annual_salary
0	67234.80878531722
1	102160.2513973197
2	102899.52580672638
3	55083.94058825617
4	97331.49980787367

95	64044.29899692974
96	27111.135496437655
97	67453.01398897123
98	27271.653311181468
99	48021.357751342526

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	current_annual_salary	Float64DType	0 (0.0%)	98 (98.0%)	6.62e+04	2.84e+04	2.18e+04	6.19e+04	1.55e+05

Please enable javascript

We can also turn on subsampling for other methods of the expression, such as .skb.cross_validate(). Here we run the cross-validation on the small subsample of 100 rows we configured. With such a small subsample the scores will be very low but this might help us quickly detect errors in our cross-validation scheme.

predictions.skb.cross_validate(keep_subsampling=True)

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

	fit_time	score_time	test_score
0	0.165644	0.065649	0.069192
1	0.162616	0.064839	0.418448
2	0.173402	0.067149	0.657612
3	0.185402	0.068664	-0.036562
4	0.181036	0.065015	0.363958

Evaluating the pipeline on the full data#

By default, when we do not explicitly ask for keep_subsampling=True, no subsampling takes place.

Here we run the cross-validation on the full data. Note the longer fit_time and much better test_score.

predictions.skb.cross_validate()

	fit_time	score_time	test_score
0	2.047098	0.205897	0.912157
1	2.011364	0.191790	0.878659
2	2.004743	0.194426	0.914859
3	2.016466	0.189403	0.924642
4	1.998402	0.193294	0.925497

Total running time of the script: (0 minutes 13.412 seconds)

Gallery generated by Sphinx-Gallery

Subsampling for faster development#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

Please enable javascript

Lightweight construction of the pipeline on a subsample#

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

Please enable javascript

current_annual_salary

Please enable javascript

Evaluating the pipeline on the full data#

This Page