Subsampling for faster development#

Here we show how to use .skb.subsample() to speed-up interactive creation of skrub expressions by subsampling the data when computing preview results.

import skrub
import skrub.datasets

dataset = skrub.datasets.fetch_employee_salaries().employee_salaries

full_data = skrub.var("data", dataset)
full_data
<Var 'data'>
Show graph VAR 'data'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



We are working with a dataset of over 9K rows. As we build up our pipeline, we see previews of the intermediate results so we can check that it behaves as we expect. However, if some estimators are slow, fitting them and computing results on the whole data can slow us down.

Lightweight construction of the pipeline on a subsample#

We can tell skrub to subsample the data when computing the previews, with .skb.subsample().

data = full_data.skb.subsample(n=100)
data
<SubsamplePreviews>
Show graph VAR 'data' SUBSAMPLEPREVIEWS

Result (on a subsample):

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



The rest of the pipeline will now use only 100 points for its previews.

To continue our pipeline we now define X and y:

employees = data.drop(
    columns="current_annual_salary",
    errors="ignore",
).skb.mark_as_X()

salaries = data["current_annual_salary"].skb.mark_as_y()

And finally we apply a TableVectorizer then gradient boosting:

All the lines above run very fast, including fitting the predictor above.

When we display our predictions expression, we see that the preview is computed on a subsample: the result column has only 100 entries.

<Apply HistGradientBoostingRegressor>
Show graph VAR 'data' SUBSAMPLEPREVIEWS X: CALLMETHOD 'drop' y: GETITEM 'current_annual_salary' APPLY TableVectorizer APPLY HistGradientBoostingRegressor

Result (on a subsample):

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



We can also turn on subsampling for other methods of the expression, such as .skb.cross_validate(). Here we run the cross-validation on the small subsample of 100 rows we configured. With such a small subsample the scores will be very low but this might help us quickly detect errors in our cross-validation scheme.

predictions.skb.cross_validate(keep_subsampling=True)
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros

/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:

Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
fit_time score_time test_score
0 0.182258 0.071600 0.101392
1 0.190907 0.093078 0.415373
2 0.244727 0.118236 0.640438
3 0.264047 0.070756 0.138710
4 0.196654 0.114999 0.360285


Evaluating the pipeline on the full data#

By default, when we do not explicitly ask for keep_subsampling=True, no subsampling takes place.

Here we run the cross-validation on the full data. Note the longer fit_time and much better test_score.

predictions.skb.cross_validate()
fit_time score_time test_score
0 2.839187 0.261421 0.910676
1 2.282177 0.265637 0.885640
2 2.603117 0.237282 0.917163
3 2.597563 0.240626 0.924788
4 2.252195 0.237839 0.923810


Total running time of the script: (0 minutes 17.034 seconds)

Gallery generated by Sphinx-Gallery