.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/expressions/12_subsampling.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_expressions_12_subsampling.py: .. currentmodule:: skrub .. _example_subsampling: Subsampling for faster development ================================== Here we show how to use :meth:`.skb.subsample() ` to speed-up interactive creation of skrub expressions by subsampling the data when computing preview results. .. GENERATED FROM PYTHON SOURCE LINES 15-24 .. code-block:: Python import skrub import skrub.datasets dataset = skrub.datasets.fetch_employee_salaries().employee_salaries full_data = skrub.var("data", dataset) full_data .. raw:: html
<Var 'data'>
Show graph VAR 'data'

Result:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 25-35 We are working with a dataset of over 9K rows. As we build up our pipeline, we see previews of the intermediate results so we can check that it behaves as we expect. However, if some estimators are slow, fitting them and computing results on the whole data can slow us down. Lightweight construction of the pipeline on a subsample ------------------------------------------------------------------------------------- We can tell skrub to subsample the data when computing the previews, with :meth:`.skb.subsample() `. .. GENERATED FROM PYTHON SOURCE LINES 37-40 .. code-block:: Python data = full_data.skb.subsample(n=100) data .. raw:: html
<SubsamplePreviews>
Show graph VAR 'data' SUBSAMPLEPREVIEWS

Result (on a subsample):

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 41-54 The rest of the pipeline will now use only 100 points for its previews. .. topic:: Subsampling only applies to previews by default By default subsampling is applied *only for previews*: the results shown when we display the expression, and the output of calling :meth:`.skb.preview() `. For other methods such as :meth:`.skb.get_pipeline() ` or :meth:`.skb.cross_validate() `, *no subsampling is done by default*. We can explicitly ask for it with ``keep_subsampling=True`` as we will see below. To continue our pipeline we now define X and y: .. GENERATED FROM PYTHON SOURCE LINES 56-63 .. code-block:: Python employees = data.drop( columns="current_annual_salary", errors="ignore", ).skb.mark_as_X() salaries = data["current_annual_salary"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 64-65 And finally we apply a TableVectorizer then gradient boosting: .. GENERATED FROM PYTHON SOURCE LINES 67-73 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor predictions = employees.skb.apply(skrub.TableVectorizer()).skb.apply( HistGradientBoostingRegressor(), y=salaries ) .. GENERATED FROM PYTHON SOURCE LINES 74-78 All the lines above run very fast, including fitting the predictor above. When we display our ``predictions`` expression, we see that the preview is computed on a subsample: the result column has only 100 entries. .. GENERATED FROM PYTHON SOURCE LINES 81-83 .. code-block:: Python predictions .. raw:: html
<Apply HistGradientBoostingRegressor>
Show graph VAR 'data' SUBSAMPLEPREVIEWS X: CALLMETHOD 'drop' y: GETITEM 'current_annual_salary' APPLY TableVectorizer APPLY HistGradientBoostingRegressor

Result (on a subsample):

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 84-89 We can also turn on subsampling for other methods of the expression, such as :meth:`.skb.cross_validate() `. Here we run the cross-validation on the small subsample of 100 rows we configured. With such a small subsample the scores will be very low but this might help us quickly detect errors in our cross-validation scheme. .. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: Python predictions.skb.cross_validate(keep_subsampling=True) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros .. raw:: html
fit_time score_time test_score
0 0.182258 0.071600 0.101392
1 0.190907 0.093078 0.415373
2 0.244727 0.118236 0.640438
3 0.264047 0.070756 0.138710
4 0.196654 0.114999 0.360285


.. GENERATED FROM PYTHON SOURCE LINES 94-101 Evaluating the pipeline on the full data -------------------------------------------------------- By default, when we do not explicitly ask for ``keep_subsampling=True``, no subsampling takes place. Here we run the cross-validation **on the full data**. Note the longer ``fit_time`` and much better ``test_score``. .. GENERATED FROM PYTHON SOURCE LINES 104-105 .. code-block:: Python predictions.skb.cross_validate() .. raw:: html
fit_time score_time test_score
0 2.839187 0.261421 0.910676
1 2.282177 0.265637 0.885640
2 2.603117 0.237282 0.917163
3 2.597563 0.240626 0.924788
4 2.252195 0.237839 0.923810


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 17.034 seconds) .. _sphx_glr_download_auto_examples_expressions_12_subsampling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/expressions/12_subsampling.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/expressions/12_subsampling.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 12_subsampling.ipynb <12_subsampling.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 12_subsampling.py <12_subsampling.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 12_subsampling.zip <12_subsampling.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_