Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Subsampling for faster development#
Here we show how to use .skb.subsample()
to speed-up interactive creation of skrub
expressions by subsampling the data when computing preview results.
Show graph
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | current_annual_salary | |
---|---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1986 | 69222.18 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1988 | 97392.47 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1989 | 104717.28 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2014 | 52734.57 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2007 | 93396.0 |
9223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 11/03/2015 | 2015 | 72094.53 |
9224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 11/28/1988 | 1988 | 169543.85 |
9225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 04/30/2001 | 2001 | 102736.52 |
9226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 09/05/2006 | 2006 | 153747.5 |
9227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 01/30/2012 | 2012 | 75484.08 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 694 (7.5%)
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 443 (4.8%)
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2,264 (24.5%)
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 51 (0.6%)
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
current_annual_salary
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 3,403 (36.9%)
- Mean ± Std
- 7.34e+04 ± 2.91e+04
- Median ± IQR
- 6.94e+04 ± 3.94e+04
- Min | Max
- 9.20e+03 | 3.03e+05
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | ObjectDType | 0 (0.0%) | 2264 (24.5%) | |||||
7 | year_first_hired | Int64DType | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
8 | current_annual_salary | Float64DType | 0 (0.0%) | 3403 (36.9%) | 7.34e+04 | 2.91e+04 | 9.20e+03 | 6.94e+04 | 3.03e+05 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We are working with a dataset of over 9K rows. As we build up our pipeline, we see previews of the intermediate results so we can check that it behaves as we expect. However, if some estimators are slow, fitting them and computing results on the whole data can slow us down.
Lightweight construction of the pipeline on a subsample#
We can tell skrub to subsample the data when computing the previews, with
.skb.subsample()
.
Show graph
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | current_annual_salary | |
---|---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1986 | 69222.18 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1988 | 97392.47 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1989 | 104717.28 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2014 | 52734.57 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2007 | 93396.0 |
95 | M | COR | Correction and Rehabilitation | DS MCDC Custody and Security | Fulltime-Regular | Correctional Officer III (Corporal) | 02/05/2007 | 2007 | 62903.0 |
96 | F | DLC | Department of Liquor Control | Muddy Branch | Parttime-Regular | Liquor Store Clerk I | 04/23/2012 | 2012 | 29779.07 |
97 | M | COR | Correction and Rehabilitation | DS MCDC Custody and Security | Fulltime-Regular | Correctional Officer II (PFC) | 06/01/2004 | 2004 | 65623.0 |
98 | F | DLC | Department of Liquor Control | Walnut Hill | Parttime-Regular | Liquor Store Clerk I | 05/20/2012 | 2012 | 29779.07 |
99 | M | DOT | Department of Transportation | Transit Silver Spring Ride On | Fulltime-Regular | Bus Operator | 03/26/2007 | 2007 | 48258.81 |
gender
ObjectDType- Null values
- 1 (1.0%)
- Unique values
- 2 (2.0%)
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 19 (19.0%)
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 19 (19.0%)
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 77 (77.0%)
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (2.0%)
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 59 (59.0%)
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 92 (92.0%)
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 25 (25.0%)
- Mean ± Std
- 2.01e+03 ± 8.68
- Median ± IQR
- 2,007 ± 10
- Min | Max
- 1,979 | 2,016
current_annual_salary
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 90 (90.0%)
- Mean ± Std
- 6.62e+04 ± 3.40e+04
- Median ± IQR
- 6.05e+04 ± 3.57e+04
- Min | Max
- 1.60e+04 | 2.28e+05
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | 1 (1.0%) | 2 (2.0%) | |||||
1 | department | ObjectDType | 0 (0.0%) | 19 (19.0%) | |||||
2 | department_name | ObjectDType | 0 (0.0%) | 19 (19.0%) | |||||
3 | division | ObjectDType | 0 (0.0%) | 77 (77.0%) | |||||
4 | assignment_category | ObjectDType | 0 (0.0%) | 2 (2.0%) | |||||
5 | employee_position_title | ObjectDType | 0 (0.0%) | 59 (59.0%) | |||||
6 | date_first_hired | ObjectDType | 0 (0.0%) | 92 (92.0%) | |||||
7 | year_first_hired | Int64DType | 0 (0.0%) | 25 (25.0%) | 2.01e+03 | 8.68 | 1,979 | 2,007 | 2,016 |
8 | current_annual_salary | Float64DType | 0 (0.0%) | 90 (90.0%) | 6.62e+04 | 3.40e+04 | 1.60e+04 | 6.05e+04 | 2.28e+05 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
The rest of the pipeline will now use only 100 points for its previews.
To continue our pipeline we now define X and y:
And finally we apply a TableVectorizer then gradient boosting:
from sklearn.ensemble import HistGradientBoostingRegressor
predictions = employees.skb.apply(skrub.TableVectorizer()).skb.apply(
HistGradientBoostingRegressor(), y=salaries
)
All the lines above run very fast, including fitting the predictor above.
When we display our predictions
expression, we see that the preview is
computed on a subsample: the result column has only 100 entries.
Show graph
current_annual_salary | |
---|---|
0 | 66215.3326461818 |
1 | 98098.66216012926 |
2 | 104936.48241706485 |
3 | 56398.44582794643 |
4 | 97213.95682438718 |
95 | 63344.67859036612 |
96 | 30322.562396599835 |
97 | 66498.37905597333 |
98 | 26069.92391548338 |
99 | 45731.34194164067 |
current_annual_salary
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 98 (98.0%)
- Mean ± Std
- 6.62e+04 ± 2.87e+04
- Median ± IQR
- 6.21e+04 ± 3.28e+04
- Min | Max
- 2.35e+04 | 1.55e+05
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | current_annual_salary | Float64DType | 0 (0.0%) | 98 (98.0%) | 6.62e+04 | 2.87e+04 | 2.35e+04 | 6.21e+04 | 1.55e+05 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We can also turn on subsampling for other methods of the expression, such as
.skb.cross_validate()
. Here we run the
cross-validation on the small subsample of 100 rows we configured. With such
a small subsample the scores will be very low but this might help us quickly
detect errors in our cross-validation scheme.
predictions.skb.cross_validate(keep_subsampling=True)
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
/home/circleci/project/.pixi/envs/doc/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:246: UserWarning:
Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
fit_time | score_time | test_score | |
---|---|---|---|
0 | 0.182258 | 0.071600 | 0.101392 |
1 | 0.190907 | 0.093078 | 0.415373 |
2 | 0.244727 | 0.118236 | 0.640438 |
3 | 0.264047 | 0.070756 | 0.138710 |
4 | 0.196654 | 0.114999 | 0.360285 |
Evaluating the pipeline on the full data#
By default, when we do not explicitly ask for keep_subsampling=True
, no
subsampling takes place.
Here we run the cross-validation on the full data.
Note the longer fit_time
and much better test_score
.
predictions.skb.cross_validate()
fit_time | score_time | test_score | |
---|---|---|---|
0 | 2.839187 | 0.261421 | 0.910676 |
1 | 2.282177 | 0.265637 | 0.885640 |
2 | 2.603117 | 0.237282 | 0.917163 |
3 | 2.597563 | 0.240626 | 0.924788 |
4 | 2.252195 | 0.237839 | 0.923810 |
Total running time of the script: (0 minutes 17.034 seconds)