Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Building a predictive model by combining multiple tables with the skrub DataOps#
This example introduces the skrub DataOps, that builds complex data processing pipelines handling multiple tables, with hyperparameter tuning and model selection.
DataOps form implicitly a “plan”, which records all the operations performed on
the data; the DataOps plan can be exported as a Learner
, a standalone object
that can be saved on disk, loaded in a new environment, and used to make predictions
on new data.
Here we show the basics of the skrub DataOps in a two-table scenario: how to create
DataOps, how to use them to leverage dataframe operations, how to combine them in
a full DataOps plan, how to do simple hyperparameter tuning, and finally how to
export the plan as a Learner
.
The credit fraud dataset#
This dataset originates from an e-commerce website and is structured into two tables:
The “baskets” table contains order IDs, each representing a list of purchased products. For a subset of these orders (the training set), a flag indicates whether the order was fraudulent. This fraud flag is the target variable we aim to predict during inference.
The “products” table provides the detailed contents of all baskets, including those without a known fraud label.
The baskets
table contains a basket ID and a flag indicating if the order
was fraudulent or not.
We start by loading the baskets
table, and exploring it with the TableReport
.
import skrub.datasets
from skrub import TableReport
dataset = skrub.datasets.fetch_credit_fraud() # load labeled data
TableReport(dataset.baskets)
ID | fraud_flag | |
---|---|---|
1 | 51,113 | 0 |
7 | 41,798 | 0 |
9 | 39,361 | 0 |
13 | 38,615 | 0 |
14 | 70,262 | 0 |
92,785 | 21,243 | 0 |
92,786 | 45,891 | 0 |
92,787 | 42,613 | 0 |
92,788 | 43,567 | 0 |
92,789 | 68,268 | 0 |
ID
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
61,241 (100.0%)
This column has a high cardinality (> 40).
- Mean ± Std
- 3.82e+04 ± 2.21e+04
- Median ± IQR
- 38,158 ± 38,196
- Min | Max
- 0 | 76,543
fraud_flag
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
- Mean ± Std
- 0.0130 ± 0.113
- Median ± IQR
- 0 ± 0
- Min | Max
- 0 | 1
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | ID | Int64DType | False | 0 (0.0%) | 61241 (100.0%) | 3.82e+04 | 2.21e+04 | 0 | 38,158 | 76,543 |
1 | fraud_flag | Int64DType | False | 0 (0.0%) | 2 (< 0.1%) | 0.0130 | 0.113 | 0 | 0 | 1 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
ID
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
61,241 (100.0%)
This column has a high cardinality (> 40).
- Mean ± Std
- 3.82e+04 ± 2.21e+04
- Median ± IQR
- 38,158 ± 38,196
- Min | Max
- 0 | 76,543
fraud_flag
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
- Mean ± Std
- 0.0130 ± 0.113
- Median ± IQR
- 0 ± 0
- Min | Max
- 0 | 1
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
ID | fraud_flag | 0.0694 | 0.0215 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We then load the products
table, which contains one row per purchased product.
basket_ID | item | cash_price | make | model | goods_code | Nbr_of_prod_purchas | |
---|---|---|---|---|---|---|---|
1 | 51,113 | COMPUTER PERIPHERALS ACCESSORIES | 409 | APPLE | APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM | 239001518 | 1 |
9 | 41,798 | COMPUTERS | 1,187 | APPLE | 2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8 | 239246780 | 1 |
11 | 39,361 | COMPUTERS | 898 | APPLE | 2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC | 239246776 | 1 |
15 | 38,615 | COMPUTER PERIPHERALS ACCESSORIES | 379 | APPLE | APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE | 239001540 | 1 |
16 | 70,262 | COMPUTERS | 1,899 | APPLE | 2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA | 240575990 | 1 |
163,352 | 42,613 | BEDROOM FURNITURE | 259 | SILENTNIGHT | SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU | 236938439 | 1 |
163,353 | 42,613 | OUTDOOR FURNITURE | 949 | LG OUTDOOR | LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI | 239742814 | 1 |
163,354 | 43,567 | COMPUTERS | 1,099 | APPLE | 2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25 | 240040978 | 1 |
163,355 | 43,567 | COMPUTERS | 2,099 | APPLE | 2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA | 238923518 | 1 |
163,356 | 68,268 | TELEVISIONS HOME CINEMA | 799 | LG | LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV | 239866717 | 1 |
basket_ID
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
61,241 (56.0%)
This column has a high cardinality (> 40).
- Mean ± Std
- 3.59e+04 ± 2.24e+04
- Median ± IQR
- 35,203 ± 39,444
- Min | Max
- 0 | 76,543
item
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
166 (0.2%)
This column has a high cardinality (> 40).
Most frequent values
COMPUTERS
FULFILMENT CHARGE
COMPUTER PERIPHERALS ACCESSORIES
TELEVISIONS HOME CINEMA
WARRANTY
LIVING DINING FURNITURE
TELEPHONES, FAX MACHINES & TWO-WAY RADIOS
BEDROOM FURNITURE
COMPUTER PERIPHERALS & ACCESSORIES
SERVICE
['COMPUTERS', 'FULFILMENT CHARGE', 'COMPUTER PERIPHERALS ACCESSORIES', 'TELEVISIONS HOME CINEMA', 'WARRANTY', 'LIVING DINING FURNITURE', 'TELEPHONES, FAX MACHINES & TWO-WAY RADIOS', 'BEDROOM FURNITURE', 'COMPUTER PERIPHERALS & ACCESSORIES', 'SERVICE']
cash_price
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
1,280 (1.2%)
This column has a high cardinality (> 40).
- Mean ± Std
- 672. ± 714.
- Median ± IQR
- 499 ± 1,049
- Min | Max
- 0 | 18,349
make
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
690 (0.6%)
This column has a high cardinality (> 40).
Most frequent values
APPLE
RETAILER
LG
SAMSUNG
SONY
ANYDAY RETAILER
WEST ELM
SWOON
KETTLER
DYSON
['APPLE', 'RETAILER', 'LG', 'SAMSUNG', 'SONY', 'ANYDAY RETAILER', 'WEST ELM', 'SWOON', 'KETTLER', 'DYSON']
model
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
6,477 (5.9%)
This column has a high cardinality (> 40).
Most frequent values
RETAILER
2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC
2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8
2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS
2020 APPLE MACBOOK AIR
APPLE WATCH SERIES 6
APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE
2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 128G
2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25
2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 12
['RETAILER', '2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC', '2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8', '2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS', '2020 APPLE MACBOOK AIR', 'APPLE WATCH SERIES 6', 'APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE', '2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 128G', '2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25', '2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 12']
goods_code
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
10,738 (9.8%)
This column has a high cardinality (> 40).
Most frequent values
FULFILMENT
239246776
239246779
239246778
237841896
239246782
236604736
239246775
239827061
239246783
['FULFILMENT', '239246776', '239246779', '239246778', '237841896', '239246782', '236604736', '239246775', '239827061', '239246783']
Nbr_of_prod_purchas
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 19 (< 0.1%)
- Mean ± Std
- 1.05 ± 0.426
- Median ± IQR
- 1 ± 0
- Min | Max
- 1 | 40
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | basket_ID | Int64DType | False | 0 (0.0%) | 61241 (56.0%) | 3.59e+04 | 2.24e+04 | 0 | 35,203 | 76,543 |
1 | item | ObjectDType | False | 0 (0.0%) | 166 (0.2%) | |||||
2 | cash_price | Int64DType | False | 0 (0.0%) | 1280 (1.2%) | 672. | 714. | 0 | 499 | 18,349 |
3 | make | ObjectDType | False | 1273 (1.2%) | 690 (0.6%) | |||||
4 | model | ObjectDType | False | 1273 (1.2%) | 6477 (5.9%) | |||||
5 | goods_code | ObjectDType | False | 0 (0.0%) | 10738 (9.8%) | |||||
6 | Nbr_of_prod_purchas | Int64DType | False | 0 (0.0%) | 19 (< 0.1%) | 1.05 | 0.426 | 1 | 1 | 40 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
basket_ID
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
61,241 (56.0%)
This column has a high cardinality (> 40).
- Mean ± Std
- 3.59e+04 ± 2.24e+04
- Median ± IQR
- 35,203 ± 39,444
- Min | Max
- 0 | 76,543
item
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
166 (0.2%)
This column has a high cardinality (> 40).
Most frequent values
COMPUTERS
FULFILMENT CHARGE
COMPUTER PERIPHERALS ACCESSORIES
TELEVISIONS HOME CINEMA
WARRANTY
LIVING DINING FURNITURE
TELEPHONES, FAX MACHINES & TWO-WAY RADIOS
BEDROOM FURNITURE
COMPUTER PERIPHERALS & ACCESSORIES
SERVICE
['COMPUTERS', 'FULFILMENT CHARGE', 'COMPUTER PERIPHERALS ACCESSORIES', 'TELEVISIONS HOME CINEMA', 'WARRANTY', 'LIVING DINING FURNITURE', 'TELEPHONES, FAX MACHINES & TWO-WAY RADIOS', 'BEDROOM FURNITURE', 'COMPUTER PERIPHERALS & ACCESSORIES', 'SERVICE']
cash_price
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
1,280 (1.2%)
This column has a high cardinality (> 40).
- Mean ± Std
- 672. ± 714.
- Median ± IQR
- 499 ± 1,049
- Min | Max
- 0 | 18,349
make
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
690 (0.6%)
This column has a high cardinality (> 40).
Most frequent values
APPLE
RETAILER
LG
SAMSUNG
SONY
ANYDAY RETAILER
WEST ELM
SWOON
KETTLER
DYSON
['APPLE', 'RETAILER', 'LG', 'SAMSUNG', 'SONY', 'ANYDAY RETAILER', 'WEST ELM', 'SWOON', 'KETTLER', 'DYSON']
model
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
6,477 (5.9%)
This column has a high cardinality (> 40).
Most frequent values
RETAILER
2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC
2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8
2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS
2020 APPLE MACBOOK AIR
APPLE WATCH SERIES 6
APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE
2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 128G
2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25
2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 12
['RETAILER', '2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC', '2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8', '2020 APPLE IPAD AIR 10 9 A14 BIONIC PROCESSOR IOS', '2020 APPLE MACBOOK AIR', 'APPLE WATCH SERIES 6', 'APPLE PENCIL 2ND GENERATION 2018 MATTE WHITE', '2021 APPLE IPAD PRO 11 M1 PROCESSOR IOS WI-FI 128G', '2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25', '2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 12']
goods_code
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
10,738 (9.8%)
This column has a high cardinality (> 40).
Most frequent values
FULFILMENT
239246776
239246779
239246778
237841896
239246782
236604736
239246775
239827061
239246783
['FULFILMENT', '239246776', '239246779', '239246778', '237841896', '239246782', '236604736', '239246775', '239827061', '239246783']
Nbr_of_prod_purchas
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 19 (< 0.1%)
- Mean ± Std
- 1.05 ± 0.426
- Median ± IQR
- 1 ± 0
- Min | Max
- 1 | 40
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
---|---|---|---|
model | goods_code | 0.542 | |
item | make | 0.471 | |
item | model | 0.463 | |
make | model | 0.448 | |
item | goods_code | 0.432 | |
make | goods_code | 0.255 | |
basket_ID | model | 0.218 | |
basket_ID | item | 0.212 | |
cash_price | make | 0.157 | |
item | cash_price | 0.151 | |
basket_ID | make | 0.140 | |
cash_price | model | 0.128 | |
basket_ID | goods_code | 0.125 | |
item | Nbr_of_prod_purchas | 0.119 | |
make | Nbr_of_prod_purchas | 0.107 | |
cash_price | goods_code | 0.0750 | |
basket_ID | cash_price | 0.0687 | 0.142 |
basket_ID | Nbr_of_prod_purchas | 0.0557 | -0.0387 |
goods_code | Nbr_of_prod_purchas | 0.0534 | |
cash_price | Nbr_of_prod_purchas | 0.0469 | -0.00230 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
Each basket contains at least one product, and products can
be associated with the corresponding basket through the "basket_ID"
column.
A design problem: how to combine tables while avoiding leakage?#
We want to fit a HistGradientBoostingClassifier
to predict the fraud
flag (or y
). To do so, we build a design matrix where each row corresponds
to a basket, and we want to add features from the products
table to
each basket.
The general structure of the pipeline looks like this:
First, as the products
table contains strings and categories (such as
"SAMSUNG"
), we vectorize those entries to extract numeric
features. This is easily done with skrub’s TableVectorizer
. Then, since each
basket can contain several products, we want to aggregate all the lines in
products
that correspond to a single basket into a single vector that can
then be attached to the basket.
The difficulty is that the vectorized products
should be aggregated before joining
to baskets
, and, in order to compute a meaningful aggregation, must
be vectorized before the aggregation. Thus, we have a TableVectorizer
to
fit on a table which does not (yet) have the same number of rows as the
target y
— something that the scikit-learn Pipeline
, with its
single-input, linear structure, does not allow.
We can fit it ourselves, outside of any pipeline with something like:
vectorizer = skrub.TableVectorizer()
vectorized_products = vectorizer.fit_transform(dataset.products)
However, because it is dissociated from the main estimator which handles
X
(the baskets), we have to manage this transformer ourselves. We lose
the scikit-learn machinery for grouping all transformation steps,
storing fitted estimators, cross-validating the data, and tuning hyper-parameters.
Moreover, we might need some Pandas code to perform the aggregation and join. Again, as this transformation is not in a scikit-learn estimator, it is error-prone. The difficulty is that we have to keep track of it ourselves to apply it later to unseen data, and we cannot tune any choices (like the choice of the aggregation function).
Fortunately, skrub provides an alternative way to build more flexible pipelines.
DataOps make DataOps plans#
In a skrub DataOps plan, we do not have an explicit, sequential list of transformation steps. Instead, we perform “Data Operations” (or “DataOps”): operations that act on variables and wrap user operations to keep track of their parameters.
User operations could be dataframe operations (selection, merge, group by, etc.), scikit-learn estimators (such as a RandomForest with its hyperparameters), or arbitrary code (for loading data, converting values, etc.).
As we perform operations on skrub variables, the plan records each DataOp and its parameters. This record can later be synthesized into a standalone object called a “learner”, which can replay these operations on unseen data, ensuring that the same operations and parameters are used.
In a DataOps plan, we manipulate skrub objects representing intermediate results. The plan is built implicitly as we perform operations (such as applying operators or calling functions) on those objects.
We start by creating skrub variables, which are the inputs to our plan. In our example, we create three variables: “products”, “baskets”, and “fraud flags”:
products = skrub.var("products", dataset.products)
full_baskets = skrub.var("baskets", dataset.baskets)
baskets = full_baskets[["ID"]].skb.mark_as_X()
fraud_flags = full_baskets["fraud_flag"].skb.mark_as_y()
Variables are given a name and an (optional) initial value, which is used to show previews of the result of each DataOp, detect errors early, and provide data for cross-validation and hyperparameter search.
Then, the plan is built by applying DataOps to those variables, that is, by performing user operations that have been wrapped in a DataOp.
Above, mark_as_X()
and mark_as_y()
indicate that the baskets and
flags are respectively our design matrix and target variables, that
should be split into training and testing sets for cross-validation. Here,
they are direct inputs to the plan, but any
intermediate result could be marked as X or y.
By setting products, baskets and fraud_flags as skrub variables, we can manipulate
those objects as if they were dataframes, while keeping track of all the operations
that are performed on them. Additionally, skrub variables and DataOps provide
their own
set of methods, which are
accessible through the skb
attribute of any skrub variable and DataOp.
For instance, we can filter products to keep only those that match one of the
baskets in the baskets
table, and then add a column containing the total
amount for each kind of product in a basket:
kept_products = products[products["basket_ID"].isin(baskets["ID"])]
products_with_total = kept_products.assign(
total_price=kept_products["Nbr_of_prod_purchas"] * kept_products["cash_price"]
)
products_with_total
Show graph
basket_ID | item | cash_price | make | model | goods_code | Nbr_of_prod_purchas | total_price | |
---|---|---|---|---|---|---|---|---|
1 | 51,113 | COMPUTER PERIPHERALS ACCESSORIES | 409 | APPLE | APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM | 239001518 | 1 | 409 |
9 | 41,798 | COMPUTERS | 1,187 | APPLE | 2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8 | 239246780 | 1 | 1,187 |
11 | 39,361 | COMPUTERS | 898 | APPLE | 2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC | 239246776 | 1 | 898 |
15 | 38,615 | COMPUTER PERIPHERALS ACCESSORIES | 379 | APPLE | APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE | 239001540 | 1 | 379 |
16 | 70,262 | COMPUTERS | 1,899 | APPLE | 2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA | 240575990 | 1 | 1,899 |
163,352 | 42,613 | BEDROOM FURNITURE | 259 | SILENTNIGHT | SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU | 236938439 | 1 | 259 |
163,353 | 42,613 | OUTDOOR FURNITURE | 949 | LG OUTDOOR | LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI | 239742814 | 1 | 949 |
163,354 | 43,567 | COMPUTERS | 1,099 | APPLE | 2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25 | 240040978 | 1 | 1,099 |
163,355 | 43,567 | COMPUTERS | 2,099 | APPLE | 2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA | 238923518 | 1 | 2,099 |
163,356 | 68,268 | TELEVISIONS HOME CINEMA | 799 | LG | LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV | 239866717 | 1 | 799 |
basket_ID
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
61,241 (56.0%)
This column has a high cardinality (> 40).
- Mean ± Std
- 3.59e+04 ± 2.24e+04
- Median ± IQR
- 35,203 ± 39,444
- Min | Max
- 0 | 76,543
item
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
166 (0.2%)
This column has a high cardinality (> 40).
cash_price
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
1,280 (1.2%)
This column has a high cardinality (> 40).
- Mean ± Std
- 672. ± 714.
- Median ± IQR
- 499 ± 1,049
- Min | Max
- 0 | 18,349
make
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
690 (0.6%)
This column has a high cardinality (> 40).
model
ObjectDType- Null values
- 1,273 (1.2%)
- Unique values
-
6,477 (5.9%)
This column has a high cardinality (> 40).
goods_code
ObjectDType- Null values
- 0 (0.0%)
- Unique values
-
10,738 (9.8%)
This column has a high cardinality (> 40).
Nbr_of_prod_purchas
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 19 (< 0.1%)
- Mean ± Std
- 1.05 ± 0.426
- Median ± IQR
- 1 ± 0
- Min | Max
- 1 | 40
total_price
Int64DType- Null values
- 0 (0.0%)
- Unique values
-
1,454 (1.3%)
This column has a high cardinality (> 40).
- Mean ± Std
- 704. ± 882.
- Median ± IQR
- 519 ± 1,040
- Min | Max
- 0 | 71,280
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | basket_ID | Int64DType | False | 0 (0.0%) | 61241 (56.0%) | 3.59e+04 | 2.24e+04 | 0 | 35,203 | 76,543 |
1 | item | ObjectDType | False | 0 (0.0%) | 166 (0.2%) | |||||
2 | cash_price | Int64DType | False | 0 (0.0%) | 1280 (1.2%) | 672. | 714. | 0 | 499 | 18,349 |
3 | make | ObjectDType | False | 1273 (1.2%) | 690 (0.6%) | |||||
4 | model | ObjectDType | False | 1273 (1.2%) | 6477 (5.9%) | |||||
5 | goods_code | ObjectDType | False | 0 (0.0%) | 10738 (9.8%) | |||||
6 | Nbr_of_prod_purchas | Int64DType | False | 0 (0.0%) | 19 (< 0.1%) | 1.05 | 0.426 | 1 | 1 | 40 |
7 | total_price | Int64DType | False | 0 (0.0%) | 1454 (1.3%) | 704. | 882. | 0 | 519 | 71,280 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We see previews of the output of intermediate results. For
example, the added "total_price"
column is in the output above.
The “Show graph” dropdown at the top allows us to check the
structure of the DataOps plan and all the DataOps it contains.
Note
We recommend to assign each new skrub DataOp to a new variable name,
as is done above. For example kept_products = products[...]
instead of
reusing the name products = products[...]
. This makes it easy to
backtrack to any step of the plan and change the subsequent steps, and
can avoid ending up in a confusing state in jupyter notebooks when the
same cell might be re-executed several times.
A major advantage of the skrub DataOps plan is that it allows to specify a grid of hyperparameter choices where the parameter is defined, improving code readability and maintainability. We do so by replacing a parameter’s value with a skrub “choice”, which indicates the range of values we consider during hyperparameter selection.
Skrub choices can be nested arbitrarily. They are not restricted to parameters of a scikit-learn estimator, but can be anything: choosing between different estimators, arguments to function calls, whole sections of the plan etc.
In-depth information about choices and hyperparameter/model selection is provided in the Tuning Data Plans example.
Here, we build a skrub TableVectorizer
with choices for the high-cardinality
encoder, and the number of components it uses.
n = skrub.choose_int(5, 15, name="n_components")
encoder = skrub.choose_from(
{
"MinHash": skrub.MinHashEncoder(n_components=n),
"LSA": skrub.StringEncoder(n_components=n),
},
name="encoder",
)
vectorizer = skrub.TableVectorizer(high_cardinality=encoder)
A transformer does not have to apply to the full dataframe; we can
restrict it to some columns, using the cols
or exclude_cols
parameters. In our example, we vectorize all columns except the "basket_ID"
.
vectorized_products = products_with_total.skb.apply(
vectorizer, exclude_cols="basket_ID"
)
Having access to the underlying dataframe’s API, we can perform the data-wrangling we need. Those transformations are being implicitly recorded as DataOps in our plan.
aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index()
augmented_baskets = baskets.merge(
aggregated_products, left_on="ID", right_on="basket_ID"
).drop(columns=["ID", "basket_ID"])
We can actually ask for a full report of the plan and inspect the results of each DataOp:
predictions.skb.full_report()
This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can see the output here.
Finally, we add a supervised estimator:
from sklearn.ensemble import HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(
learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="learning_rate")
)
predictions = augmented_baskets.skb.apply(hgb, y=fraud_flags)
predictions
Show graph
fraud_flag | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
61,236 | 0 |
61,237 | 0 |
61,238 | 0 |
61,239 | 0 |
61,240 | 0 |
fraud_flag
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (< 0.1%)
- Mean ± Std
- 0.00292 ± 0.0540
- Median ± IQR
- 0 ± 0
- Min | Max
- 0 | 1
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|---|
0 | fraud_flag | Int64DType | False | 0 (0.0%) | 2 (< 0.1%) | 0.00292 | 0.0540 | 0 | 0 | 1 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
And our DataOps plan is complete!
From the choices we inserted at different locations in our plan, skrub
can build a grid of hyperparameters and run the hyperparameter search for us,
backed by scikit-learn’s GridSearchCV
or RandomizedSearchCV
.
print(predictions.skb.describe_param_grid())
- learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate')
encoder: 'MinHash'
n_components: choose_int(5, 15, name='n_components')
- learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate')
encoder: 'LSA'
n_components: choose_int(5, 15, name='n_components')
search = predictions.skb.make_randomized_search(
scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True
)
search.results_
n_components | encoder | learning_rate | mean_test_score | |
---|---|---|---|---|
0 | 13 | LSA | 0.067286 | 0.885875 |
1 | 14 | LSA | 0.052436 | 0.885407 |
2 | 10 | LSA | 0.038147 | 0.883859 |
3 | 17 | MinHash | 0.086695 | 0.880904 |
4 | 19 | MinHash | 0.056148 | 0.880410 |
5 | 18 | LSA | 0.013766 | 0.876848 |
6 | 13 | MinHash | 0.446581 | 0.747957 |
7 | 16 | LSA | 0.501435 | 0.712538 |
We can also run a cross validation, using the first choices defined in the choose
objects:
predictions.skb.cross_validate(scoring="roc_auc", verbose=1, n_jobs=4)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 16.0s finished
fit_time | score_time | test_score | |
---|---|---|---|
0 | 8.872665 | 2.215057 | 0.878377 |
1 | 8.355339 | 2.792936 | 0.868375 |
2 | 9.008083 | 2.196138 | 0.869367 |
3 | 9.828978 | 1.783087 | 0.871620 |
4 | 3.330792 | 0.882430 | 0.909514 |
We can also display a parallel coordinates plot of the results.
In a parallel coordinates plot, each line corresponds to a combination of hyperparameter (choices) values, followed by the corresponding test scores, and training and scoring computation durations. Different columns show the hyperparameter values.
By clicking and dragging the mouse on any column, we can restrict the set of lines we see. This allows quickly inspecting which hyperparameters are important, which values perform best, and potential trade-offs between the quality of predictions and computation time.
It seems here that using the LSA as an encoder brings better test scores, but at the expense of training and scoring time.
From the DataOps plan to the learner#
The learner is a standalone object that can replay all the DataOps recorded in the plan, and can be used to make predictions on new, unseen data. The learner can be saved and loaded, allowing us to use it later without having to rebuild the plan. We would usually save the learner in a binary file, but to avoid accessing the filesystem with this example notebook, we serialize the learner in memory instead.
import pickle
saved_model = pickle.dumps(search.best_learner_)
Let’s say we got some new data, and we want to use the learner we just saved to make predictions on them:
new_data = skrub.datasets.fetch_credit_fraud(split="test")
new_baskets = new_data.baskets[["ID"]]
new_products = new_data.products
Our learner expects the same variable names as the training plan, which is why we pass a dictionary that contains new dataframes and the same variable:
loaded_model = pickle.loads(saved_model)
loaded_model.predict({"baskets": new_baskets, "products": new_products})
array([0, 0, 0, ..., 0, 0, 0], shape=(31549,))
Conclusion#
If you are curious to know more on how to build your own complex, multi-table plans with easy hyperparameter tuning and transforming them into reusable learners, please see the next examples for an in-depth tutorial.
Indeed, the following examples will explain how to tune hyperparameters in detail (see Tuning DataOps), and how to speed up development by subsampling preview data (see Subsampling for faster development).
Total running time of the script: (4 minutes 15.992 seconds)