Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Getting Started#
This guide showcases some of the features of skrub
, an open-source package
that aims at bridging the gap between tabular data stored in Pandas or Polars
dataframes, and machine-learning models.
Much of skrub
revolves around simplifying many of the tasks that are involved
in pre-processing raw data into a format that shallow or classic machine-learning
models can understand, that is, numerical data.
skrub
does this by vectorizing, assembling, and encoding tabular data through
a number of features that we present in this example and the following.
Downloading example datasets#
The datasets
module allows us to download tabular datasets and
demonstrate skrub
’s features.
Note
You can control the directory where the datasets are stored by:
setting in your environment the
SKRUB_DATA_DIRECTORY
variable to an absolute directory path,using the parameter
data_directory
in fetch functions, which takes precedence over the envar.
By default, the datasets are stored in a folder named “skrub_data” in the user home folder.
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y
Explore all the available datasets in Datasets.
Preliminary exploration and parsing of data#
Typically, the first operations that are done on new data involve data exploration
and parsing.
To quickly get an overview of a dataframe’s contents, use the
TableReport
.
Here, we also use the Cleaner
, a transformer that cleans the
dataframe by parsing nulls and dates, and by dropping “uninformative” columns
(e.g., that contain too many nulls, or that are constant).
from skrub import Cleaner, TableReport
TableReport(employees_df)
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1,986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1,988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1,989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2,014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2,007 |
9,223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 11/03/2015 | 2,015 |
9,224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 11/28/1988 | 1,988 |
9,225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 04/30/2001 | 2,001 |
9,226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 09/05/2006 | 2,006 |
9,227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 01/30/2012 | 2,012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/29/2013
07/16/2012
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/29/2013', '07/16/2012']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | False | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | False | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | False | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | False | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | ObjectDType | False | 0 (0.0%) | 2264 (24.5%) | |||||
7 | year_first_hired | Int64DType | False | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/29/2013
07/16/2012
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/29/2013', '07/16/2012']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
department | department_name | 1.00 | |
division | assignment_category | 0.567 | |
division | employee_position_title | 0.463 | |
assignment_category | employee_position_title | 0.459 | |
department_name | employee_position_title | 0.417 | |
department | employee_position_title | 0.417 | |
department | assignment_category | 0.399 | |
department_name | assignment_category | 0.399 | |
department | division | 0.368 | |
department_name | division | 0.368 | |
gender | department | 0.362 | |
gender | department_name | 0.362 | |
employee_position_title | date_first_hired | 0.276 | |
gender | employee_position_title | 0.249 | |
gender | assignment_category | 0.248 | |
gender | division | 0.239 | |
department | date_first_hired | 0.152 | |
department_name | date_first_hired | 0.152 | |
date_first_hired | year_first_hired | 0.148 | |
employee_position_title | year_first_hired | 0.139 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
From the Report above, we can see that there are datetime columns, so we use the
Cleaner
to parse them.
employees_df = Cleaner().fit_transform(employees_df)
TableReport(employees_df)
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 1986-09-22 00:00:00 | 1,986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 1988-09-12 00:00:00 | 1,988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 1989-11-19 00:00:00 | 1,989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 2014-05-05 00:00:00 | 2,014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 2007-03-05 00:00:00 | 2,007 |
9,223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 2015-11-03 00:00:00 | 2,015 |
9,224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 1988-11-28 00:00:00 | 1,988 |
9,225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 2001-04-30 00:00:00 | 2,001 |
9,226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 2006-09-05 00:00:00 | 2,006 |
9,227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 2012-01-30 00:00:00 | 2,012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
- Min | Max
- 1965-09-30T00:00:00 | 2016-12-27T00:00:00
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | False | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | False | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | False | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | False | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | DateTime64DType | False | 0 (0.0%) | 2264 (24.5%) | 1965-09-30T00:00:00 | 2016-12-27T00:00:00 | |||
7 | year_first_hired | Int64DType | False | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
- Min | Max
- 1965-09-30T00:00:00 | 2016-12-27T00:00:00
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
department | department_name | 1.00 | |
date_first_hired | year_first_hired | 0.916 | |
division | assignment_category | 0.619 | |
assignment_category | employee_position_title | 0.530 | |
department_name | employee_position_title | 0.455 | |
department | employee_position_title | 0.455 | |
division | employee_position_title | 0.427 | |
department | assignment_category | 0.400 | |
department_name | assignment_category | 0.400 | |
department | division | 0.372 | |
department_name | division | 0.372 | |
gender | department | 0.362 | |
gender | department_name | 0.362 | |
gender | employee_position_title | 0.276 | |
gender | division | 0.257 | |
gender | assignment_category | 0.241 | |
employee_position_title | year_first_hired | 0.137 | |
employee_position_title | date_first_hired | 0.135 | |
department | date_first_hired | 0.0801 | |
department_name | date_first_hired | 0.0801 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
You can use the interactive display above to explore the dataset visually.
Note
You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.
It is also possible to tell skrub
to replace the default pandas & polars
displays with TableReport
by modifying the global config with
set_config()
.
from skrub import set_config
set_config(use_table_report=True)
employees_df
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 1986-09-22 00:00:00 | 1,986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 1988-09-12 00:00:00 | 1,988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 1989-11-19 00:00:00 | 1,989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 2014-05-05 00:00:00 | 2,014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 2007-03-05 00:00:00 | 2,007 |
9,223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 2015-11-03 00:00:00 | 2,015 |
9,224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 1988-11-28 00:00:00 | 1,988 |
9,225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 2001-04-30 00:00:00 | 2,001 |
9,226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 2006-09-05 00:00:00 | 2,006 |
9,227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 2012-01-30 00:00:00 | 2,012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
- Min | Max
- 1965-09-30T00:00:00 | 2016-12-27T00:00:00
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | False | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | False | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | False | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | False | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | False | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | DateTime64DType | False | 0 (0.0%) | 2264 (24.5%) | 1965-09-30T00:00:00 | 2016-12-27T00:00:00 | |||
7 | year_first_hired | Int64DType | False | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
694 (7.5%)
This column has a high cardinality (> 40).
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
443 (4.8%)
This column has a high cardinality (> 40).
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
-
2,264 (24.5%)
This column has a high cardinality (> 40).
- Min | Max
- 1965-09-30T00:00:00 | 2016-12-27T00:00:00
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
51 (0.6%)
This column has a high cardinality (> 40).
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
department | department_name | 1.00 | |
date_first_hired | year_first_hired | 0.923 | |
division | assignment_category | 0.591 | |
assignment_category | employee_position_title | 0.501 | |
department_name | assignment_category | 0.431 | |
department | assignment_category | 0.431 | |
division | employee_position_title | 0.424 | |
department | employee_position_title | 0.422 | |
department_name | employee_position_title | 0.422 | |
gender | department_name | 0.377 | |
gender | department | 0.377 | |
department_name | division | 0.368 | |
department | division | 0.368 | |
gender | employee_position_title | 0.285 | |
gender | assignment_category | 0.272 | |
gender | division | 0.272 | |
employee_position_title | year_first_hired | 0.140 | |
employee_position_title | date_first_hired | 0.139 | |
division | year_first_hired | 0.0863 | |
division | date_first_hired | 0.0847 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
This setting can easily be reverted:
set_config(use_table_report=False)
employees_df
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records... | Fulltime-Regular | Office Services Coordinator | 1986-09-22 | 1986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 1988-09-12 | 1988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 1989-11-19 | 1989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 2014-05-05 | 2014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 2007-03-05 | 2007 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
9223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 2015-11-03 | 2015 |
9224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 1988-11-28 | 1988 |
9225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Serv... | Parttime-Regular | Medical Doctor IV - Psychiatrist | 2001-04-30 | 2001 |
9226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 2006-09-05 | 2006 |
9227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 2012-01-30 | 2012 |
9228 rows × 8 columns
Easily building a strong baseline for tabular machine learning#
The goal of skrub
is to ease tabular data preparation for machine learning.
The tabular_pipeline()
function provides an easy way to build a simple
but reliable machine learning model that works well on most tabular data.
from sklearn.model_selection import cross_validate
from skrub import tabular_pipeline
model = tabular_pipeline("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.90515874, 0.88150207, 0.91658913, 0.9211787 , 0.92464814])
To handle rich tabular data and feed it to a machine learning model, the
pipeline returned by tabular_pipeline()
preprocesses and encodes
strings, categories and dates using the TableVectorizer
.
See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for
more details. An overview of the chosen defaults is available in
Strong baseline pipelines.
Assembling data#
skrub
allows imperfect assembly of data, such as joining dataframes
on columns that contain typos. skrub
’s joiners have fit
and
transform
methods, storing information about the data across calls.
The Joiner
allows fuzzy-joining multiple tables, each row of
a main table will be augmented with values from the best match in the auxiliary table.
You can control how distant fuzzy-matches are allowed to be with the
max_dist
parameter.
In the following, we add information about countries to a table containing airports and the cities they are in:
import pandas as pd
from skrub import Joiner
airports = pd.DataFrame(
{
"airport_id": [1, 2],
"airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
"city": ["Paris", "Roma"],
}
)
# Notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
{"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
capitals,
main_key="city",
aux_key="capital",
max_dist=0.8,
add_match_info=False,
)
joiner.fit_transform(airports)
airport_id | airport_name | city | capital | country | |
---|---|---|---|---|---|
0 | 1 | Charles de Gaulle | Paris | Paris | France |
1 | 2 | Aeroporto Leonardo da Vinci | Roma | Rome | Italy |
Information about countries have been added, even if the rows aren’t exactly matching.
skrub
allows to aggregate multiple tables according to various strategies: you
can see other ways to join multiple tables in Assembling: joining multiple tables.
Encoding any data as numerical features#
Tabular data can contain a variety of datatypes, ranging from numerical, to datetimes, to categories, strings, and text. Encoding features in a meaningful way requires a lot of effort and is a major part of the feature engineering process that is required to properly train machine learning models.
skrub
helps with this by providing various transformers that automatically
encode different datatypes into float32
features.
For numerical features, the SquashingScaler
applies a robust
scaling technique that is less sensitive to outliers. Check the
relative example
for more information on the feature.
For datetime columns, skrub
provides the DatetimeEncoder
which can extract useful features such as year, month, day, as well as additional
features such as weekday or day of year. Periodic encoding with trigonometric
or spline features is also available. Refer to the DatetimeEncoder
documentation for more detail.
import pandas as pd
data = pd.DataFrame(
{
"event": ["A", "B", "C"],
"date_1": ["2020-01-01", "2020-06-15", "2021-03-22"],
"date_2": ["2020-01-15", "2020-07-01", "2021-04-05"],
}
)
data = Cleaner().fit_transform(data)
TableReport(data)
event | date_1 | date_2 | |
---|---|---|---|
0 | A | 2020-01-01 00:00:00 | 2020-01-15 00:00:00 |
1 | B | 2020-06-15 00:00:00 | 2020-07-01 00:00:00 |
2 | C | 2021-03-22 00:00:00 | 2021-04-05 00:00:00 |
event
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
Most frequent values
A
B
C
['A', 'B', 'C']
date_1
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
- Min | Max
- 2020-01-01T00:00:00 | 2021-03-22T00:00:00
date_2
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
- Min | Max
- 2020-01-15T00:00:00 | 2021-04-05T00:00:00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | event | ObjectDType | True | 0 (0.0%) | 3 (100.0%) | |||||
1 | date_1 | DateTime64DType | True | 0 (0.0%) | 3 (100.0%) | 2020-01-01T00:00:00 | 2021-03-22T00:00:00 | |||
2 | date_2 | DateTime64DType | True | 0 (0.0%) | 3 (100.0%) | 2020-01-15T00:00:00 | 2021-04-05T00:00:00 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
event
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
Most frequent values
A
B
C
['A', 'B', 'C']
date_1
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
- Min | Max
- 2020-01-01T00:00:00 | 2021-03-22T00:00:00
date_2
DateTime64DType- Null values
- 0 (0.0%)
- Unique values
- 3 (100.0%)
- Min | Max
- 2020-01-15T00:00:00 | 2021-04-05T00:00:00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
date_1 | date_2 | 1.00 | |
event | date_2 | 1.00 | |
event | date_1 | 1.00 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
skrub
transformers are applied column-by-column, but it is possible to use
the ApplyToCols
meta-transformer to apply a transformer to
multiple columns at once. Complex column selection is possible using
skrub’s column selectors.
from skrub import ApplyToCols, DatetimeEncoder
ApplyToCols(
DatetimeEncoder(add_total_seconds=False), cols=["date_1", "date_2"]
).fit_transform(data)
event | date_1_year | date_1_month | date_1_day | date_2_year | date_2_month | date_2_day | |
---|---|---|---|---|---|---|---|
0 | A | 2020.0 | 1.0 | 1.0 | 2020.0 | 1.0 | 15.0 |
1 | B | 2020.0 | 6.0 | 15.0 | 2020.0 | 7.0 | 1.0 |
2 | C | 2021.0 | 3.0 | 22.0 | 2021.0 | 4.0 | 5.0 |
Finally, when a column contains categorical or string data, it can be
encoded using various encoders provided by skrub
. The default encoder is
the StringEncoder
, which encodes categories using
Latent Semantic Analysis (LSA).
It is a simple and efficient way to encode categories, and works well in
practice.
data = pd.DataFrame(
{
"city": ["Paris", "London", "Berlin", "Madrid", "Rome"],
"country": ["France", "UK", "Germany", "Spain", "Italy"],
}
)
TableReport(data)
from skrub import StringEncoder
StringEncoder(n_components=3).fit_transform(data["city"])
city_0 | city_1 | city_2 | |
---|---|---|---|
0 | 9.549681e-08 | -2.087682e-08 | 1.525498e+00 |
1 | 1.014391e+00 | -1.847013e-01 | -8.526597e-08 |
2 | 8.380367e-01 | 9.361470e-01 | 1.514242e-07 |
3 | 2.806120e-01 | 5.614111e-01 | -4.905649e-08 |
4 | -7.191011e-01 | 1.049512e+00 | 3.392050e-08 |
If your data includes a lot of text, you may want to use the
TextEncoder
,
which uses pre-trained language models retrieved from the HuggingFace hub to
create meaningful text embeddings.
See Feature engineering for categorical data for more details on all the categorical encoders
provided by skrub
, and Encoding: from a dataframe to a numerical matrix for machine learning for a
comparison between the different methods.
Advanced use cases#
If your use case involves more complex data preparation, hyperparameter tuning,
or model selection, if you want to build a multi-table pipeline that requires
assembling and preparing multiple tables, or if you want to make sure that the
data preparation can be reproduced exactly, you can use the skrub
Data Ops,
a powerful framework which provides tools to build complex data processing pipelines.
See the relative user guide and the
Skrub DataOps
examples for more details.
Next steps#
We have briefly covered pipeline creation, vectorizing, assembling, and encoding
data. We presented the main functionalities of skrub
, but there is much
more to it!
Please refer to our User Guide for a more in-depth presentation of
skrub
’s concepts, or visit our
examples for more
illustrations of the tools that we provide!
Total running time of the script: (0 minutes 9.884 seconds)