` to speed-up interactive creation of skrub expressions by subsampling the data when computing preview results. .. GENERATED FROM PYTHON SOURCE LINES 15-24 .. code-block:: Python import skrub import skrub.datasets dataset = skrub.datasets.fetch_employee_salaries().employee_salaries full_data = skrub.var("data", dataset) full_data .. raw:: html

<Var 'data'>

Show graph

Result:

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired	current_annual_salary
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986	6.92e+04
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988	9.74e+04
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989	1.05e+05
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014	5.27e+04
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007	9.34e+04

9,223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2,015	7.21e+04
9,224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1,988	1.70e+05
9,225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Services	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2,001	1.03e+05
9,226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2,006	1.54e+05
9,227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2,012	7.55e+04

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	17 (0.2%)	2 (< 0.1%)
1	department	ObjectDType	False	0 (0.0%)	37 (0.4%)
2	department_name	ObjectDType	False	0 (0.0%)	37 (0.4%)
3	division	ObjectDType	False	0 (0.0%)	694 (7.5%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	443 (4.8%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	2264 (24.5%)
7	year_first_hired	Int64DType	False	0 (0.0%)	51 (0.6%)	2.00e+03	9.33	1,965	2,005	2,016
8	current_annual_salary	Float64DType	False	0 (0.0%)	3403 (36.9%)	7.34e+04	2.91e+04	9.20e+03	6.94e+04	3.03e+05

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 25-35 We are working with a dataset of over 9K rows. As we build up our pipeline, we see previews of the intermediate results so we can check that it behaves as we expect. However, if some estimators are slow, fitting them and computing results on the whole data can slow us down. Lightweight construction of the pipeline on a subsample ------------------------------------------------------------------------------------- We can tell skrub to subsample the data when computing the previews, with :meth:`.skb.subsample() `. .. GENERATED FROM PYTHON SOURCE LINES 37-40 .. code-block:: Python data = full_data.skb.subsample(n=100) data .. raw:: html

<SubsamplePreviews>

Show graph

Result (on a subsample):

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired	current_annual_salary
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records Management Section	Fulltime-Regular	Office Services Coordinator	09/22/1986	1,986	6.92e+04
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1,988	9.74e+04
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1,989	1.05e+05
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2,014	5.27e+04
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2,007	9.34e+04

95	M	COR	Correction and Rehabilitation	DS MCDC Custody and Security	Fulltime-Regular	Correctional Officer III (Corporal)	02/05/2007	2,007	6.29e+04
96	F	DLC	Department of Liquor Control	Muddy Branch	Parttime-Regular	Liquor Store Clerk I	04/23/2012	2,012	2.98e+04
97	M	COR	Correction and Rehabilitation	DS MCDC Custody and Security	Fulltime-Regular	Correctional Officer II (PFC)	06/01/2004	2,004	6.56e+04
98	F	DLC	Department of Liquor Control	Walnut Hill	Parttime-Regular	Liquor Store Clerk I	05/20/2012	2,012	2.98e+04
99	M	DOT	Department of Transportation	Transit Silver Spring Ride On	Fulltime-Regular	Bus Operator	03/26/2007	2,007	4.83e+04

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	gender	ObjectDType	False	1 (1.0%)	2 (2.0%)
1	department	ObjectDType	False	0 (0.0%)	19 (19.0%)
2	department_name	ObjectDType	False	0 (0.0%)	19 (19.0%)
3	division	ObjectDType	False	0 (0.0%)	77 (77.0%)
4	assignment_category	ObjectDType	False	0 (0.0%)	2 (2.0%)
5	employee_position_title	ObjectDType	False	0 (0.0%)	59 (59.0%)
6	date_first_hired	ObjectDType	False	0 (0.0%)	92 (92.0%)
7	year_first_hired	Int64DType	False	0 (0.0%)	25 (25.0%)	2.01e+03	8.68	1,979	2,007	2,016
8	current_annual_salary	Float64DType	False	0 (0.0%)	90 (90.0%)	6.62e+04	3.40e+04	1.60e+04	6.05e+04	2.28e+05

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 41-54 The rest of the pipeline will now use only 100 points for its previews. .. topic:: Subsampling only applies to previews by default By default subsampling is applied *only for previews*: the results shown when we display the expression, and the output of calling :meth:`.skb.preview() `. For other methods such as :meth:`.skb.get_pipeline() ` or :meth:`.skb.cross_validate() `, *no subsampling is done by default*. We can explicitly ask for it with ``keep_subsampling=True`` as we will see below. To continue our pipeline we now define X and y: .. GENERATED FROM PYTHON SOURCE LINES 56-63 .. code-block:: Python employees = data.drop( columns="current_annual_salary", errors="ignore", ).skb.mark_as_X() salaries = data["current_annual_salary"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 64-65 And finally we apply a TableVectorizer then gradient boosting: .. GENERATED FROM PYTHON SOURCE LINES 67-73 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor predictions = employees.skb.apply(skrub.TableVectorizer()).skb.apply( HistGradientBoostingRegressor(), y=salaries ) .. GENERATED FROM PYTHON SOURCE LINES 74-78 All the lines above run very fast, including fitting the predictor above. When we display our ``predictions`` expression, we see that the preview is computed on a subsample: the result column has only 100 entries. .. GENERATED FROM PYTHON SOURCE LINES 81-83 .. code-block:: Python predictions .. raw:: html

<Apply HistGradientBoostingRegressor>

Show graph

Result (on a subsample):

	current_annual_salary
0	6.66e+04
1	1.01e+05
2	1.04e+05
3	5.51e+04
4	9.79e+04

95	6.34e+04
96	2.85e+04
97	6.70e+04
98	2.59e+04
99	4.83e+04

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	current_annual_salary	Float64DType	False	0 (0.0%)	96 (96.0%)	6.62e+04	2.87e+04	2.25e+04	6.20e+04	1.55e+05

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 84-89 We can also turn on subsampling for other methods of the expression, such as :meth:`.skb.cross_validate() `. Here we run the cross-validation on the small subsample of 100 rows we configured. With such a small subsample the scores will be very low but this might help us quickly detect errors in our cross-validation scheme. .. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: Python predictions.skb.cross_validate(keep_subsampling=True) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros /home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros .. raw:: html

	fit_time	score_time	test_score
0	0.165258	0.064703	0.081532
1	0.172600	0.063867	0.433360
2	0.177924	0.064813	0.657920
3	0.163901	0.064709	0.276539
4	0.165837	0.064898	0.300465

.. GENERATED FROM PYTHON SOURCE LINES 94-101 Evaluating the pipeline on the full data -------------------------------------------------------- By default, when we do not explicitly ask for ``keep_subsampling=True``, no subsampling takes place. Here we run the cross-validation **on the full data**. Note the longer ``fit_time`` and much better ``test_score``. .. GENERATED FROM PYTHON SOURCE LINES 104-105 .. code-block:: Python predictions.skb.cross_validate() .. raw:: html

	fit_time	score_time	test_score
0	1.894383	0.186427	0.912998
1	1.879840	0.187146	0.881846
2	1.876838	0.183320	0.916099
3	1.889258	0.186023	0.925034
4	1.839404	0.186718	0.925449

.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 12.614 seconds) .. _sphx_glr_download_auto_examples_expressions_12_subsampling.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/expressions/12_subsampling.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/expressions/12_subsampling.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 12_subsampling.ipynb <12_subsampling.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 12_subsampling.py <12_subsampling.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 12_subsampling.zip <12_subsampling.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

Please enable javascript

gender

department

department_name

division

assignment_category

employee_position_title

date_first_hired

year_first_hired

current_annual_salary

Please enable javascript

current_annual_salary

Please enable javascript