Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Getting Started#
This guide showcases the features of skrub
, an open-source package that aims at
bridging the gap between tabular data sources and machine-learning models.
Much of skrub
revolves around vectorizing, assembling, and encoding tabular data,
to prepare data in a format that shallow or classic machine-learning models understand.
Downloading example datasets#
The datasets
module allows us to download tabular datasets and
demonstrate skrub
’s features.
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
employees_df, salaries = dataset.X, dataset.y
Explore all the available datasets in Downloading a dataset.
Generating an interactive report for a dataframe#
To quickly get an overview of a dataframe’s contents, use the
TableReport
.
from skrub import TableReport
TableReport(employees_df)
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2007 |
9223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 11/03/2015 | 2015 |
9224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 11/28/1988 | 1988 |
9225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 04/30/2001 | 2001 |
9226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 09/05/2006 | 2006 |
9227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 01/30/2012 | 2012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 443 (4.8%)
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2,264 (24.5%)
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/16/2012
07/29/2013
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/16/2012', '07/29/2013']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 51 (0.6%)
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | ObjectDType | 0 (0.0%) | 2264 (24.5%) | |||||
7 | year_first_hired | Int64DType | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 443 (4.8%)
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2,264 (24.5%)
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/16/2012
07/29/2013
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/16/2012', '07/29/2013']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 51 (0.6%)
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
department | department_name | 1.00 |
division | assignment_category | 0.626 |
assignment_category | employee_position_title | 0.514 |
division | employee_position_title | 0.431 |
department_name | employee_position_title | 0.416 |
department | employee_position_title | 0.416 |
department | assignment_category | 0.393 |
department_name | assignment_category | 0.393 |
gender | department | 0.374 |
gender | department_name | 0.374 |
department | division | 0.368 |
department_name | division | 0.368 |
gender | employee_position_title | 0.290 |
gender | assignment_category | 0.267 |
gender | division | 0.253 |
employee_position_title | date_first_hired | 0.235 |
date_first_hired | year_first_hired | 0.164 |
department_name | date_first_hired | 0.158 |
department | date_first_hired | 0.158 |
employee_position_title | year_first_hired | 0.121 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
You can use the interactive display above to explore the dataset visually.
Note
You can see a few more example reports online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser, without installing anything.
It is also possible to tell skrub to replace the default pandas & polars
displays with TableReport
.
from skrub import patch_display, unpatch_display
patch_display()
employees_df
gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
---|---|---|---|---|---|---|---|---|
0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records Management Section | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1986 |
1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1988 |
2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1989 |
3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2014 |
4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2007 |
9223 | F | HHS | Department of Health and Human Services | School Based Health Centers | Fulltime-Regular | Community Health Nurse II | 11/03/2015 | 2015 |
9224 | F | FRS | Fire and Rescue Services | Human Resources Division | Fulltime-Regular | Fire/Rescue Division Chief | 11/28/1988 | 1988 |
9225 | M | HHS | Department of Health and Human Services | Child and Adolescent Mental Health Clinic Services | Parttime-Regular | Medical Doctor IV - Psychiatrist | 04/30/2001 | 2001 |
9226 | M | CCL | County Council | Council Central Staff | Fulltime-Regular | Manager II | 09/05/2006 | 2006 |
9227 | M | DLC | Department of Liquor Control | Licensure, Regulation and Education | Fulltime-Regular | Alcohol/Tobacco Enforcement Specialist II | 01/30/2012 | 2012 |
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 443 (4.8%)
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2,264 (24.5%)
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/16/2012
07/29/2013
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/16/2012', '07/29/2013']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 51 (0.6%)
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | gender | ObjectDType | 17 (0.2%) | 2 (< 0.1%) | |||||
1 | department | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
2 | department_name | ObjectDType | 0 (0.0%) | 37 (0.4%) | |||||
3 | division | ObjectDType | 0 (0.0%) | 694 (7.5%) | |||||
4 | assignment_category | ObjectDType | 0 (0.0%) | 2 (< 0.1%) | |||||
5 | employee_position_title | ObjectDType | 0 (0.0%) | 443 (4.8%) | |||||
6 | date_first_hired | ObjectDType | 0 (0.0%) | 2264 (24.5%) | |||||
7 | year_first_hired | Int64DType | 0 (0.0%) | 51 (0.6%) | 2.00e+03 | 9.33 | 1,965 | 2,005 | 2,016 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
gender
ObjectDType- Null values
- 17 (0.2%)
- Unique values
- 2 (< 0.1%)
Most frequent values
M
F
['M', 'F']
department
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
POL
HHS
FRS
DOT
COR
DLC
DGS
LIB
DPS
SHF
['POL', 'HHS', 'FRS', 'DOT', 'COR', 'DLC', 'DGS', 'LIB', 'DPS', 'SHF']
department_name
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 37 (0.4%)
Most frequent values
Department of Police
Department of Health and Human Services
Fire and Rescue Services
Department of Transportation
Correction and Rehabilitation
Department of Liquor Control
Department of General Services
Department of Public Libraries
Department of Permitting Services
Sheriff's Office
['Department of Police', 'Department of Health and Human Services', 'Fire and Rescue Services', 'Department of Transportation', 'Correction and Rehabilitation', 'Department of Liquor Control', 'Department of General Services', 'Department of Public Libraries', 'Department of Permitting Services', "Sheriff's Office"]
division
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 694 (7.5%)
Most frequent values
School Health Services
Transit Silver Spring Ride On
Transit Gaithersburg Ride On
Highway Services
Child Welfare Services
FSB Traffic Division School Safety Section
Income Supports
PSB 3rd District Patrol
PSB 4th District Patrol
Transit Nicholson Ride On
['School Health Services', 'Transit Silver Spring Ride On', 'Transit Gaithersburg Ride On', 'Highway Services', 'Child Welfare Services', 'FSB Traffic Division School Safety Section', 'Income Supports', 'PSB 3rd District Patrol', 'PSB 4th District Patrol', 'Transit Nicholson Ride On']
assignment_category
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
Most frequent values
Fulltime-Regular
Parttime-Regular
['Fulltime-Regular', 'Parttime-Regular']
employee_position_title
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 443 (4.8%)
Most frequent values
Bus Operator
Police Officer III
Firefighter/Rescuer III
Manager III
Firefighter/Rescuer II
Master Firefighter/Rescuer
Office Services Coordinator
School Health Room Technician I
Police Officer II
Community Health Nurse II
['Bus Operator', 'Police Officer III', 'Firefighter/Rescuer III', 'Manager III', 'Firefighter/Rescuer II', 'Master Firefighter/Rescuer', 'Office Services Coordinator', 'School Health Room Technician I', 'Police Officer II', 'Community Health Nurse II']
date_first_hired
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2,264 (24.5%)
Most frequent values
12/12/2016
01/14/2013
02/24/2014
03/10/2014
08/12/2013
10/06/2014
09/22/2014
03/19/2007
07/16/2012
07/29/2013
['12/12/2016', '01/14/2013', '02/24/2014', '03/10/2014', '08/12/2013', '10/06/2014', '09/22/2014', '03/19/2007', '07/16/2012', '07/29/2013']
year_first_hired
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 51 (0.6%)
- Mean ± Std
- 2.00e+03 ± 9.33
- Median ± IQR
- 2,005 ± 14
- Min | Max
- 1,965 | 2,016
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
department | department_name | 1.00 |
assignment_category | employee_position_title | 0.661 |
division | assignment_category | 0.615 |
division | employee_position_title | 0.512 |
department_name | assignment_category | 0.413 |
department | assignment_category | 0.413 |
department | employee_position_title | 0.409 |
department_name | employee_position_title | 0.409 |
gender | department_name | 0.368 |
gender | department | 0.368 |
department | division | 0.358 |
department_name | division | 0.358 |
gender | assignment_category | 0.284 |
gender | employee_position_title | 0.271 |
gender | division | 0.249 |
employee_position_title | date_first_hired | 0.244 |
date_first_hired | year_first_hired | 0.160 |
department | date_first_hired | 0.152 |
department_name | date_first_hired | 0.152 |
employee_position_title | year_first_hired | 0.136 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
The effect of patch_display
can be undone with skrub.unpatch_display()
Easily building a strong baseline for tabular machine learning#
The goal of skrub
is to ease tabular data preparation for machine learning.
The tabular_learner()
function provides an easy way to build a simple
but reliable machine-learning model, working well on most tabular data.
from sklearn.model_selection import cross_validate
from skrub import tabular_learner
model = tabular_learner("regressor")
results = cross_validate(model, employees_df, salaries)
results["test_score"]
array([0.89370447, 0.89279068, 0.92282557, 0.92319094, 0.92162666])
To handle rich tabular data and feed it to a machine-learning model, the
pipeline returned by tabular_learner()
preprocesses and encodes
strings, categories and dates using the TableVectorizer
.
See its documentation or Encoding: from a dataframe to a numerical matrix for machine learning for
more details. An overview of the chosen defaults is available in
End-to-end predictive models.
Assembling data#
Skrub
allows imperfect assembly of data, such as joining dataframes
on columns that contain typos. Skrub
’s joiners have fit
and
transform
methods, storing information about the data across calls.
The Joiner
allows fuzzy-joining multiple tables, each row of
a main table will be augmented with values from the best match in the auxiliary table.
You can control how distant fuzzy-matches are allowed to be with the
max_dist
parameter.
In the following, we add information about countries to a table containing airports and the cities they are in:
import pandas as pd
from skrub import Joiner
airports = pd.DataFrame(
{
"airport_id": [1, 2],
"airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
"city": ["Paris", "Roma"],
}
)
# notice the "Rome" instead of "Roma"
capitals = pd.DataFrame(
{"capital": ["Berlin", "Paris", "Rome"], "country": ["Germany", "France", "Italy"]}
)
joiner = Joiner(
capitals,
main_key="city",
aux_key="capital",
max_dist=0.8,
add_match_info=False,
)
joiner.fit_transform(airports)
airport_id | airport_name | city | capital | country | |
---|---|---|---|---|---|
0 | 1 | Charles de Gaulle | Paris | Paris | France |
1 | 2 | Aeroporto Leonardo da Vinci | Roma | Rome | Italy |
Information about countries have been added, even if the rows aren’t exactly matching.
It’s also possible to augment data by joining and aggregating multiple
dataframes with the AggJoiner
. This is particularly useful to
summarize information scattered across tables, for instance adding statistics
about flights to the dataframe of airports:
from skrub import AggJoiner
flights = pd.DataFrame(
{
"flight_id": range(1, 7),
"from_airport": [1, 1, 1, 2, 2, 2],
"total_passengers": [90, 120, 100, 70, 80, 90],
"company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}
)
agg_joiner = AggJoiner(
aux_table=flights,
main_key="airport_id",
aux_key="from_airport",
cols=["total_passengers"], # the cols to perform aggregation on
operations=["mean", "std"], # the operations to compute
)
agg_joiner.fit_transform(airports)
airport_id | airport_name | city | total_passengers_mean | total_passengers_std | |
---|---|---|---|---|---|
0 | 1 | Charles de Gaulle | Paris | 103.333333 | 15.275252 |
1 | 2 | Aeroporto Leonardo da Vinci | Roma | 80.000000 | 10.000000 |
For joining multiple auxiliary tables on a main table at once, use the
MultiAggJoiner
.
See other ways to join multiple tables in Assembling: joining multiple tables.
Encoding data#
When a column contains categories with variations and typos, it can
be encoded using one of skrub
’s encoders, such as the
GapEncoder
.
The GapEncoder
creates a continuous encoding, based on
the activation of latent categories. It will create the encoding based on
combinations of substrings which frequently co-occur.
For instance, we might want to encode a column X
that contains
information about cities, being either Madrid or Rome :
from skrub import GapEncoder
X = pd.Series(
[
"Rome, Italy",
"Rome",
"Roma, Italia",
"Madrid, SP",
"Madrid, spain",
"Madrid",
"Romq",
"Rome, It",
],
name="city",
)
enc = GapEncoder(n_components=2, random_state=0) # 2 topics in the data
enc.fit(X)
GapEncoder(n_components=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GapEncoder(n_components=2, random_state=0)
The GapEncoder
has found the following two topics:
['city: madrid, spain, sp', 'city: italia, italy, romq']
Which correspond to the two cities.
Let’s see the activation of each topic depending on the rows of X
:
encoded = enc.fit_transform(X).assign(original=X)
encoded
city: madrid, spain, sp | city: italia, italy, romq | original | |
---|---|---|---|
0 | 0.052257 | 13.547743 | Rome, Italy |
1 | 0.050202 | 3.049798 | Rome |
2 | 0.063282 | 15.036718 | Roma, Italia |
3 | 12.047028 | 0.052972 | Madrid, SP |
4 | 16.547818 | 0.052182 | Madrid, spain |
5 | 6.048861 | 0.051139 | Madrid |
6 | 0.050019 | 3.049981 | Romq |
7 | 0.053193 | 9.046807 | Rome, It |
The higher the activation, the closer the row to the latent topic. These columns can now be understood by a machine-learning model.
The other encoders are presented in Encoding: creating feature matrices.
Next steps#
We have briefly covered pipeline creation, vectorizing, assembling, and encoding
data. We presented the main functionalities of skrub
, but there is much
more to it !
Please refer to our User guide for a more in-depth presentation of
skrub
’s concepts, or visit our
examples for more
illustrations of the tools that we provide !
Total running time of the script: (0 minutes 6.952 seconds)