` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_11_multiple_tables.py: Multiples tables: building machine learning pipelines with DataOps ================================================================== In this example, we show how to build a DataOps plan to handle pre-processing, validation and hyperparameter tuning of a dataset with **multiple tables**. We consider the credit fraud dataset, which contains two tables: one for baskets (orders) and one for products. The goal is to predict whether a basket (a single order that has been placed with the website) is fraudulent or not, based on the products it contains. .. currentmodule:: skrub .. |choose_from| replace:: :func:`skrub.choose_from` .. |choose_int| replace:: :func:`skrub.choose_int` .. |choose_float| replace:: :func:`skrub.choose_float` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |StringEncoder| replace:: :class:`~skrub.StringEncoder` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |var| replace:: :func:`skrub.var` .. |TableReport| replace:: :class:`~skrub.TableReport` .. |HistGradientBoostingClassifier| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |make_randomized_search| replace:: :func:`~skrub.DataOp.skb.make_randomized_search` .. GENERATED FROM PYTHON SOURCE LINES 33-38 The credit fraud dataset ------------------------ The ``baskets`` table contains a basket ID and a flag indicating if the order was fraudulent or not (the customer never made the payment). .. GENERATED FROM PYTHON SOURCE LINES 40-46 .. code-block:: Python import skrub import skrub.datasets dataset = skrub.datasets.fetch_credit_fraud() skrub.TableReport(dataset.baskets) .. raw:: html

	ID	fraud_flag
1	51,113	0
7	41,798	0
9	39,361	0
13	38,615	0
14	70,262	0

92,785	21,243	0
92,786	45,891	0
92,787	42,613	0
92,788	43,567	0
92,789	68,268	0

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	False	0 (0.0%)	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	fraud_flag	Int64DType	False	0 (0.0%)	2 (< 0.1%)	0.0130	0.113	0	0	1

Column 1	Column 2	Cramér's V	Pearson's Correlation
ID	fraud_flag	0.0534	0.00411

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 47-51 The ``products`` table contains information about the products that have been purchased, and the basket they belong to. A basket contains at least one product. Products can be associated with the corresponding basket through the "basket_ID" column. .. GENERATED FROM PYTHON SOURCE LINES 53-55 .. code-block:: Python skrub.TableReport(dataset.products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
1	51,113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
9	41,798	COMPUTERS	1,187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1
11	39,361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
15	38,615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1
16	70,262	COMPUTERS	1,899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1

163,352	42,613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163,353	42,613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163,354	43,567	COMPUTERS	1,099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163,355	43,567	COMPUTERS	2,099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163,356	68,268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	False	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	False	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	False	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	False	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	False	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	False	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	False	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40

Column 1	Column 2	Cramér's V	Pearson's Correlation
model	goods_code	0.696
item	model	0.478
item	make	0.462
item	goods_code	0.442
make	model	0.330
make	goods_code	0.293
item	cash_price	0.225
cash_price	model	0.213
basket_ID	item	0.209
basket_ID	model	0.204
cash_price	goods_code	0.201
cash_price	make	0.192
basket_ID	make	0.149
basket_ID	goods_code	0.132
make	Nbr_of_prod_purchas	0.123
item	Nbr_of_prod_purchas	0.116
basket_ID	cash_price	0.0734	0.130
basket_ID	Nbr_of_prod_purchas	0.0546	-0.0157
goods_code	Nbr_of_prod_purchas	0.0489
cash_price	Nbr_of_prod_purchas	0.0354	-0.0128

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 56-82 A data-processing challenge ---------------------------- The general structure of the DataOps plan we want to build looks like this: .. image:: ../../_static/credit_fraud_diagram.svg :width: 300 We want to fit a |HistGradientBoostingClassifier| to predict the fraud flag (y). However, since the features for each basket are stored in the products table, we need to extract these features, aggregate them at the basket level, and merge the result with the basket data. We can use the |TableVectorizer| to vectorize the products, but we then need to aggregate the resulting vectors to obtain a single row per basket. Using a scikit-learn Pipeline is tricky because the |TableVectorizer| would be fitted on a table with a different number of rows than the target y (the baskets table), which scikit-learn does not allow. While we could fit the |TableVectorizer| manually, this would forfeit scikit-learn’s tooling for managing transformations, storing fitted estimators, splitting data, cross-validation, and hyper-parameter tuning. We would also have to handle the aggregation and join ourselves, likely with error-prone Pandas code. Fortunately, skrub DataOps provide a powerful alternative for building flexible plans that address these problems. .. GENERATED FROM PYTHON SOURCE LINES 84-89 Building a multi-table DataOps plan ------------------------------------ We start by creating skrub variables, which are the inputs to our plan. In our example, we create two skrub |var| objects: ``products`` and ``baskets``: .. GENERATED FROM PYTHON SOURCE LINES 91-97 .. code-block:: Python products = skrub.var("products", dataset.products) baskets = skrub.var("baskets", dataset.baskets) basket_ids = baskets[["ID"]].skb.mark_as_X() fraud_flags = baskets["fraud_flag"].skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 98-107 We mark the "baskets_ids" variable as ``X`` and the "fraud flags" variable as ``y`` so that DataOps can use their indices for train-test splitting and cross-validation. We then build the plan by applying transformations to those inputs. Since our DataOps expect dataframes for products, baskets and fraud flags, we manipulate those objects as we would manipulate pandas dataframes. For instance, we filter products to keep only those that match one of the baskets in the ``baskets`` table, and then add a column containing the total amount for each kind of product in a basket: .. GENERATED FROM PYTHON SOURCE LINES 109-115 .. code-block:: Python kept_products = products[products["basket_ID"].isin(basket_ids["ID"])] products_with_total = kept_products.assign( total_price=kept_products["Nbr_of_prod_purchas"] * kept_products["cash_price"] ) products_with_total .. raw:: html

<CallMethod 'assign'>

Show graph

Result:

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas	total_price
1	51,113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1	409
9	41,798	COMPUTERS	1,187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1	1,187
11	39,361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1	898
15	38,615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1	379
16	70,262	COMPUTERS	1,899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1	1,899

163,352	42,613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1	259
163,353	42,613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1	949
163,354	43,567	COMPUTERS	1,099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1	1,099
163,355	43,567	COMPUTERS	2,099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1	2,099
163,356	68,268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1	799

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	False	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	False	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	False	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	False	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	False	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	False	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	False	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40
7	total_price	Int64DType	False	0 (0.0%)	1454 (1.3%)	704.	882.	0	519	71,280

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 116-127 We then build a skrub ``TableVectorizer`` with different choices of the type of encoder for high-cardinality categorical or string columns, and the number of components it uses. With skrub, there’s no need to specify a separate grid of hyperparameters outside the pipeline. Instead, within a DataOps plan, we can directly replace a parameter’s value using one of skrub’s ``choose_*`` functions, which define the range of values to consider during hyperparameter selection. In this example, we use |choose_int| to select the number of components for the encoder and |choose_from| to select the type of encoder. .. GENERATED FROM PYTHON SOURCE LINES 129-139 .. code-block:: Python n = skrub.choose_int(5, 15, name="n_components") encoder = skrub.choose_from( { "MinHash": skrub.MinHashEncoder(n_components=n), "LSA": skrub.StringEncoder(n_components=n), }, name="encoder", ) vectorizer = skrub.TableVectorizer(high_cardinality=encoder) .. GENERATED FROM PYTHON SOURCE LINES 140-143 We can restrict the vectorizer to a subset of columns: in our case, we want to vectorize all columns except the ``"basket_ID"`` column, which is not a feature but a link to the basket it belongs to. .. GENERATED FROM PYTHON SOURCE LINES 145-149 .. code-block:: Python vectorized_products = products_with_total.skb.apply( vectorizer, exclude_cols="basket_ID" ) .. GENERATED FROM PYTHON SOURCE LINES 150-152 We then aggregate the vectorized products by basket ID, and then merge the result with the baskets table. .. GENERATED FROM PYTHON SOURCE LINES 154-159 .. code-block:: Python aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index() augmented_baskets = basket_ids.merge( aggregated_products, left_on="ID", right_on="basket_ID" ).drop(columns=["ID", "basket_ID"]) .. GENERATED FROM PYTHON SOURCE LINES 160-162 Finally, we add a supervised estimator, and use |choose_float| to add the learning rate as a hyperparameter to tune. .. GENERATED FROM PYTHON SOURCE LINES 164-172 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier hgb = HistGradientBoostingClassifier( learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="learning_rate") ) predictions = augmented_baskets.skb.apply(hgb, y=fraud_flags) predictions .. raw:: html

<Apply HistGradientBoostingClassifier>

Show graph

Result:

	fraud_flag
0	0
1	0
2	0
3	0
4	0

61,236	0
61,237	0
61,238	0
61,239	0
61,240	0

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	fraud_flag	Int64DType	False	0 (0.0%)	2 (< 0.1%)	0.00207	0.0455	0	0	1

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 173-178 And our DataOps plan is complete! We can now use |make_randomized_search| to perform hyperparameter tuning and find the best hyperparameters for our model. We present below the hyperparameter combinations that define our search space. .. GENERATED FROM PYTHON SOURCE LINES 180-182 .. code-block:: Python print(predictions.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'MinHash' n_components: choose_int(5, 15, name='n_components') - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'LSA' n_components: choose_int(5, 15, name='n_components') .. GENERATED FROM PYTHON SOURCE LINES 183-185 |make_randomized_search| returns a :class:`~skrub.ParamSearch` object, which contains our search result and some plotting logic. .. GENERATED FROM PYTHON SOURCE LINES 185-190 .. code-block:: Python search = predictions.skb.make_randomized_search( scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True ) search.results_ .. raw:: html

	n_components	encoder	learning_rate	mean_test_score
0	10	LSA	0.038147	0.884499
1	14	LSA	0.052436	0.884043
2	13	LSA	0.067286	0.883262
3	19	MinHash	0.056148	0.882233
4	17	MinHash	0.086695	0.882082
5	18	LSA	0.013766	0.874956
6	13	MinHash	0.446581	0.767410
7	16	LSA	0.501435	0.710838

.. GENERATED FROM PYTHON SOURCE LINES 191-192 We can also display the results of the search in a parallel coordinates plot: .. GENERATED FROM PYTHON SOURCE LINES 192-194 .. code-block:: Python search.plot_results() .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 195-200 It seems here that using the LSA as an encoder brings better test scores, but at the expense of training and scoring time. We can get the best performing :class:`~skrub.SkrubLearner` via ``best_learner_``, and use it for inference on new data with: .. GENERATED FROM PYTHON SOURCE LINES 200-218 .. code-block:: Python import pandas as pd new_baskets = pd.DataFrame([dict(ID="abc")]) new_products = pd.DataFrame( [ dict( basket_ID="abc", item="COMPUTER", cash_price=200, make="APPLE", model="XXX-X", goods_code="239246782", Nbr_of_prod_purchas=1, ) ] ) search.best_learner_.predict_proba({"baskets": new_baskets, "products": new_products}) .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0.99894788, 0.00105212]]) .. GENERATED FROM PYTHON SOURCE LINES 219-231 Conclusion ---------- In this example, we have shown how to build a multi-table machine learning pipeline with the skrub DataOps. We have seen how DataOps allow us to use Pandas to manipulate dataframes, and how we can build a DataOps plan that can make use of multiple tables, and perform hyperparameter tuning on the resulting pipeline. If you are curious to know more on how to tune hyperparameters using the skrub DataOps, please see :ref:`Tuning Pipelines example ` for an in-depth tutorial. .. rst-class:: sphx-glr-timing **Total running time of the script:** (4 minutes 32.535 seconds) .. _sphx_glr_download_auto_examples_data_ops_11_multiple_tables.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/data_ops/11_multiple_tables.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/11_multiple_tables.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 11_multiple_tables.ipynb <11_multiple_tables.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 11_multiple_tables.py <11_multiple_tables.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 11_multiple_tables.zip <11_multiple_tables.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_