` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_10_expressions_intro.py: .. currentmodule:: skrub .. _example_expressions_intro: Building a predictive model by combining multiple tables with the skrub DataOps ============================================================================== This example introduces the skrub DataOps, that builds complex data processing pipelines handling multiple tables, with hyperparameter tuning and model selection. DataOps form implicitly a "plan", which records all the operations performed on the data; the DataOps plan can be exported as a ``Learner``, a standalone object that can be saved on disk, loaded in a new environment, and used to make predictions on new data. Here we show the basics of the skrub DataOps in a two-table scenario: how to create DataOps, how to use them to leverage dataframe operations, how to combine them in a full DataOps plan, how to do simple hyperparameter tuning, and finally how to export the plan as a ``Learner``. .. GENERATED FROM PYTHON SOURCE LINES 26-42 The credit fraud dataset ------------------------ This dataset originates from an e-commerce website and is structured into two tables: - The "baskets" table contains order IDs, each representing a list of purchased products. For a subset of these orders (the training set), a flag indicates whether the order was fraudulent. This fraud flag is the target variable we aim to predict during inference. - The "products" table provides the detailed contents of all baskets, including those without a known fraud label. The ``baskets`` table contains a basket ID and a flag indicating if the order was fraudulent or not. We start by loading the ``baskets`` table, and exploring it with the ``TableReport``. .. GENERATED FROM PYTHON SOURCE LINES 44-50 .. code-block:: Python import skrub.datasets from skrub import TableReport dataset = skrub.datasets.fetch_credit_fraud() # load labeled data TableReport(dataset.baskets) .. raw:: html

	ID	fraud_flag
1	51,113	0
7	41,798	0
9	39,361	0
13	38,615	0
14	70,262	0

92,785	21,243	0
92,786	45,891	0
92,787	42,613	0
92,788	43,567	0
92,789	68,268	0

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	False	0 (0.0%)	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	fraud_flag	Int64DType	False	0 (0.0%)	2 (< 0.1%)	0.0130	0.113	0	0	1

Column 1	Column 2	Cramér's V	Pearson's Correlation
ID	fraud_flag	0.112	0.0147

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 51-52 We then load the ``products`` table, which contains one row per purchased product. .. GENERATED FROM PYTHON SOURCE LINES 54-56 .. code-block:: Python TableReport(dataset.products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
1	51,113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
9	41,798	COMPUTERS	1,187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1
11	39,361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
15	38,615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1
16	70,262	COMPUTERS	1,899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1

163,352	42,613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163,353	42,613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163,354	43,567	COMPUTERS	1,099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163,355	43,567	COMPUTERS	2,099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163,356	68,268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	False	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	False	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	False	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	False	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	False	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	False	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	False	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40

Column 1	Column 2	Cramér's V	Pearson's Correlation
model	goods_code	0.629
item	model	0.474
item	make	0.467
item	goods_code	0.434
make	model	0.306
make	goods_code	0.252
basket_ID	item	0.207
item	cash_price	0.206
cash_price	model	0.200
cash_price	make	0.195
basket_ID	model	0.194
cash_price	goods_code	0.192
basket_ID	make	0.145
item	Nbr_of_prod_purchas	0.127
basket_ID	goods_code	0.116
make	Nbr_of_prod_purchas	0.111
basket_ID	cash_price	0.0753	0.145
goods_code	Nbr_of_prod_purchas	0.0486
model	Nbr_of_prod_purchas	0.0397
basket_ID	Nbr_of_prod_purchas	0.0378	-0.0104

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 57-107 Each basket contains at least one product, and products can be associated with the corresponding basket through the ``"basket_ID"`` column. A design problem: how to combine tables while avoiding leakage? ---------------------------- We want to fit a ``HistGradientBoostingClassifier`` to predict the fraud flag (or ``y``). To do so, we build a design matrix where each row corresponds to a basket, and we want to add features from the ``products`` table to each basket. The general structure of the pipeline looks like this: .. image:: ../../_static/credit_fraud_diagram.svg :width: 300 First, as the ``products`` table contains strings and categories (such as ``"SAMSUNG"``), we vectorize those entries to extract numeric features. This is easily done with skrub's ``TableVectorizer``. Then, since each basket can contain several products, we want to aggregate all the lines in ``products`` that correspond to a single basket into a single vector that can then be attached to the basket. The difficulty is that the vectorized ``products`` should be aggregated before joining to ``baskets``, and, in order to compute a meaningful aggregation, must be vectorized *before* the aggregation. Thus, we have a ``TableVectorizer`` to fit on a table which does not (yet) have the same number of rows as the target ``y`` — something that the scikit-learn ``Pipeline``, with its single-input, linear structure, does not allow. We can fit it ourselves, outside of any pipeline with something like:: vectorizer = skrub.TableVectorizer() vectorized_products = vectorizer.fit_transform(dataset.products) However, because it is dissociated from the main estimator which handles ``X`` (the baskets), we have to manage this transformer ourselves. We lose the scikit-learn machinery for grouping all transformation steps, storing fitted estimators, cross-validating the data, and tuning hyper-parameters. Moreover, we might need some Pandas code to perform the aggregation and join. Again, as this transformation is not in a scikit-learn estimator, it is error-prone. The difficulty is that we have to keep track of it ourselves to apply it later to unseen data, and we cannot tune any choices (like the choice of the aggregation function). Fortunately, skrub provides an alternative way to build more flexible pipelines. .. GENERATED FROM PYTHON SOURCE LINES 109-129 DataOps make DataOps plans -------------------------- In a skrub DataOps plan, we do not have an explicit, sequential list of transformation steps. Instead, we perform "Data Operations" (or "DataOps"): operations that act on variables and wrap user operations to keep track of their parameters. User operations could be dataframe operations (selection, merge, group by, etc.), scikit-learn estimators (such as a RandomForest with its hyperparameters), or arbitrary code (for loading data, converting values, etc.). As we perform operations on skrub variables, the plan records each DataOp and its parameters. This record can later be synthesized into a standalone object called a "learner", which can replay these operations on unseen data, ensuring that the same operations and parameters are used. In a DataOps plan, we manipulate skrub objects representing intermediate results. The plan is built implicitly as we perform operations (such as applying operators or calling functions) on those objects. .. GENERATED FROM PYTHON SOURCE LINES 131-133 We start by creating skrub variables, which are the inputs to our plan. In our example, we create three variables: "products", "baskets", and "fraud flags": .. GENERATED FROM PYTHON SOURCE LINES 135-141 .. code-block:: Python products = skrub.var("products", dataset.products) full_baskets = skrub.var("baskets", dataset.baskets) baskets = full_baskets[["ID"]].skb.mark_as_X() fraud_flags = full_baskets["fraud_flag"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 142-165 Variables are given a name and an (optional) initial value, which is used to show previews of the result of each DataOp, detect errors early, and provide data for cross-validation and hyperparameter search. Then, the plan is built by applying DataOps to those variables, that is, by performing user operations that have been wrapped in a DataOp. Above, ``mark_as_X()`` and ``mark_as_y()`` indicate that the baskets and flags are respectively our design matrix and target variables, that should be split into training and testing sets for cross-validation. Here, they are direct inputs to the plan, but any intermediate result could be marked as X or y. By setting products, baskets and fraud_flags as skrub variables, we can manipulate those objects as if they were dataframes, while keeping track of all the operations that are performed on them. Additionally, skrub variables and DataOps provide their own set of methods, which are accessible through the ``skb`` attribute of any skrub variable and DataOp. For instance, we can filter products to keep only those that match one of the baskets in the ``baskets`` table, and then add a column containing the total amount for each kind of product in a basket: .. GENERATED FROM PYTHON SOURCE LINES 167-173 .. code-block:: Python kept_products = products[products["basket_ID"].isin(baskets["ID"])] products_with_total = kept_products.assign( total_price=kept_products["Nbr_of_prod_purchas"] * kept_products["cash_price"] ) products_with_total .. raw:: html

<CallMethod 'assign'>

Show graph

Result:

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas	total_price
1	51,113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1	409
9	41,798	COMPUTERS	1,187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1	1,187
11	39,361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1	898
15	38,615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1	379
16	70,262	COMPUTERS	1,899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1	1,899

163,352	42,613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1	259
163,353	42,613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1	949
163,354	43,567	COMPUTERS	1,099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1	1,099
163,355	43,567	COMPUTERS	2,099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1	2,099
163,356	68,268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1	799

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	False	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	False	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	False	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	False	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	False	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	False	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	False	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40
7	total_price	Int64DType	False	0 (0.0%)	1454 (1.3%)	704.	882.	0	519	71,280

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 174-205 We see previews of the output of intermediate results. For example, the added ``"total_price"`` column is in the output above. The "Show graph" dropdown at the top allows us to check the structure of the DataOps plan and all the DataOps it contains. .. note:: We recommend to assign each new skrub DataOp to a new variable name, as is done above. For example ``kept_products = products[...]`` instead of reusing the name ``products = products[...]``. This makes it easy to backtrack to any step of the plan and change the subsequent steps, and can avoid ending up in a confusing state in jupyter notebooks when the same cell might be re-executed several times. A major advantage of the skrub DataOps plan is that it allows to specify a grid of hyperparameter choices where the parameter is defined, improving code readability and maintainability. We do so by replacing a parameter's value with a skrub "choice", which indicates the range of values we consider during hyperparameter selection. Skrub choices can be nested arbitrarily. They are not restricted to parameters of a scikit-learn estimator, but can be anything: choosing between different estimators, arguments to function calls, whole sections of the plan etc. In-depth information about choices and hyperparameter/model selection is provided in the :ref:`Tuning Data Plans example `. Here, we build a skrub ``TableVectorizer`` with choices for the high-cardinality encoder, and the number of components it uses. .. GENERATED FROM PYTHON SOURCE LINES 207-217 .. code-block:: Python n = skrub.choose_int(5, 15, name="n_components") encoder = skrub.choose_from( { "MinHash": skrub.MinHashEncoder(n_components=n), "LSA": skrub.StringEncoder(n_components=n), }, name="encoder", ) vectorizer = skrub.TableVectorizer(high_cardinality=encoder) .. GENERATED FROM PYTHON SOURCE LINES 218-221 A transformer does not have to apply to the full dataframe; we can restrict it to some columns, using the ``cols`` or ``exclude_cols`` parameters. In our example, we vectorize all columns except the ``"basket_ID"``. .. GENERATED FROM PYTHON SOURCE LINES 223-227 .. code-block:: Python vectorized_products = products_with_total.skb.apply( vectorizer, exclude_cols="basket_ID" ) .. GENERATED FROM PYTHON SOURCE LINES 228-231 Having access to the underlying dataframe's API, we can perform the data-wrangling we need. Those transformations are being implicitly recorded as DataOps in our plan. .. GENERATED FROM PYTHON SOURCE LINES 233-238 .. code-block:: Python aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index() augmented_baskets = baskets.merge( aggregated_products, left_on="ID", right_on="basket_ID" ).drop(columns=["ID", "basket_ID"]) .. GENERATED FROM PYTHON SOURCE LINES 239-249 We can actually ask for a full report of the plan and inspect the results of each DataOp:: predictions.skb.full_report() This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can `see the output here <../../_static/credit_fraud_report/index.html>`_. Finally, we add a supervised estimator: .. GENERATED FROM PYTHON SOURCE LINES 251-259 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier hgb = HistGradientBoostingClassifier( learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="learning_rate") ) predictions = augmented_baskets.skb.apply(hgb, y=fraud_flags) predictions .. raw:: html

<Apply HistGradientBoostingClassifier>

Show graph

Result:

	fraud_flag
0	0
1	0
2	0
3	0
4	0

61,236	0
61,237	0
61,238	0
61,239	0
61,240	0

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	fraud_flag	Int64DType	False	0 (0.0%)	2 (< 0.1%)	0.00232	0.0481	0	0	1

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 260-265 And our DataOps plan is complete! From the choices we inserted at different locations in our plan, skrub can build a grid of hyperparameters and run the hyperparameter search for us, backed by scikit-learn's ``GridSearchCV`` or ``RandomizedSearchCV``. .. GENERATED FROM PYTHON SOURCE LINES 267-269 .. code-block:: Python print(predictions.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'MinHash' n_components: choose_int(5, 15, name='n_components') - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'LSA' n_components: choose_int(5, 15, name='n_components') .. GENERATED FROM PYTHON SOURCE LINES 270-275 .. code-block:: Python search = predictions.skb.make_randomized_search( scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True ) search.results_ .. raw:: html

	n_components	encoder	learning_rate	mean_test_score
0	13	LSA	0.067286	0.885289
1	10	LSA	0.038147	0.885147
2	14	LSA	0.052436	0.882626
3	17	MinHash	0.086695	0.882019
4	19	MinHash	0.056148	0.881646
5	18	LSA	0.013766	0.874913
6	13	MinHash	0.446581	0.756478
7	16	LSA	0.501435	0.712367

.. GENERATED FROM PYTHON SOURCE LINES 276-278 We can also run a cross validation, using the first choices defined in the ``choose`` objects: .. GENERATED FROM PYTHON SOURCE LINES 278-281 .. code-block:: Python predictions.skb.cross_validate(scoring="roc_auc", verbose=1, n_jobs=4) .. rst-class:: sphx-glr-script-out .. code-block:: none [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 17.1s finished .. raw:: html

	fit_time	score_time	test_score
0	11.007291	1.960853	0.878432
1	9.180779	2.794137	0.869466
2	10.076163	2.418898	0.859110
3	9.606173	2.697500	0.876415
4	3.310782	0.901558	0.907842

.. GENERATED FROM PYTHON SOURCE LINES 282-293 We can also display a parallel coordinates plot of the results. In a parallel coordinates plot, each line corresponds to a combination of hyperparameter (choices) values, followed by the corresponding test scores, and training and scoring computation durations. Different columns show the hyperparameter values. By **clicking and dragging the mouse** on any column, we can restrict the set of lines we see. This allows quickly inspecting which hyperparameters are important, which values perform best, and potential trade-offs between the quality of predictions and computation time. .. GENERATED FROM PYTHON SOURCE LINES 295-297 .. code-block:: Python search.plot_results() .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 298-309 It seems here that using the LSA as an encoder brings better test scores, but at the expense of training and scoring time. From the DataOps plan to the learner ------------------------------------------ The learner is a standalone object that can replay all the DataOps recorded in the plan, and can be used to make predictions on new, unseen data. The learner can be saved and loaded, allowing us to use it later without having to rebuild the plan. We would usually save the learner in a binary file, but to avoid accessing the filesystem with this example notebook, we serialize the learner in memory instead. .. GENERATED FROM PYTHON SOURCE LINES 309-313 .. code-block:: Python import pickle saved_model = pickle.dumps(search.best_learner_) .. GENERATED FROM PYTHON SOURCE LINES 314-316 Let's say we got some new data, and we want to use the learner we just saved to make predictions on them: .. GENERATED FROM PYTHON SOURCE LINES 316-320 .. code-block:: Python new_data = skrub.datasets.fetch_credit_fraud(split="test") new_baskets = new_data.baskets[["ID"]] new_products = new_data.products .. GENERATED FROM PYTHON SOURCE LINES 321-323 Our learner expects the same variable names as the training plan, which is why we pass a dictionary that contains new dataframes and the same variable: .. GENERATED FROM PYTHON SOURCE LINES 323-326 .. code-block:: Python loaded_model = pickle.loads(saved_model) loaded_model.predict({"baskets": new_baskets, "products": new_products}) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0, 0, 0, ..., 0, 0, 0], shape=(31549,)) .. GENERATED FROM PYTHON SOURCE LINES 327-337 Conclusion ---------- If you are curious to know more on how to build your own complex, multi-table plans with easy hyperparameter tuning and transforming them into reusable learners, please see the next examples for an in-depth tutorial. Indeed, the following examples will explain how to tune hyperparameters in detail (see :ref:`example_tuning_pipelines`), and how to speed up development by subsampling preview data (see :ref:`example_subsampling`). .. rst-class:: sphx-glr-timing **Total running time of the script:** (4 minutes 29.200 seconds) .. _sphx_glr_download_auto_examples_data_ops_10_expressions_intro.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/data_ops/10_expressions_intro.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/10_expressions_intro.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 10_expressions_intro.ipynb <10_expressions_intro.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 10_expressions_intro.py <10_expressions_intro.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 10_expressions_intro.zip <10_expressions_intro.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

ID

fraud_flag

ID

fraud_flag

Please enable javascript

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

total_price

Please enable javascript

fraud_flag

Please enable javascript