` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_expressions_10_expressions_intro.py: .. _example_expressions_intro: Building complex tabular pipelines ================================== In this example, we show a simple pipeline handling a dataset with 2 tables, which would be difficult to implement, validate, and deploy correctly without skrub. .. GENERATED FROM PYTHON SOURCE LINES 13-22 The credit fraud dataset ------------------------ This dataset comes from an e-commerce website. We have a set of "baskets" ( orders that have been placed with the website). The task is to detect which orders were fraudulent (the customer never made the payment). The ``baskets`` table contains a basket ID and a flag indicating if the order was fraudulent or not. .. GENERATED FROM PYTHON SOURCE LINES 24-30 .. code-block:: Python import skrub import skrub.datasets dataset = skrub.datasets.fetch_credit_fraud() skrub.TableReport(dataset.baskets) .. raw:: html

	ID	fraud_flag
1	51113	0
7	41798	0
9	39361	0
13	38615	0
14	70262	0

92785	21243	0
92786	45891	0
92787	42613	0
92788	43567	0
92789	68268	0

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	ID	Int64DType	0 (0.0%)	61241 (100.0%)	3.82e+04	2.21e+04	0	38,158	76,543
1	fraud_flag	Int64DType	0 (0.0%)	2 (< 0.1%)	0.0130	0.113	0	0	1

Column 1	Column 2	Cramér's V	Pearson's Correlation
ID	fraud_flag	0.0382	0.00377

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 31-35 Each basket contains one or more products. Each row in the ``products`` table corresponds to a type of product present in a basket. Products can be associated with the corresponding basket through the ``"basket_ID"`` column. .. GENERATED FROM PYTHON SOURCE LINES 37-39 .. code-block:: Python skrub.TableReport(dataset.products) .. raw:: html

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas
1	51113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1
9	41798	COMPUTERS	1187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1
11	39361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1
15	38615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1
16	70262	COMPUTERS	1899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1

163352	42613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1
163353	42613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1
163354	43567	COMPUTERS	1099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1
163355	43567	COMPUTERS	2099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1
163356	68268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40

Column 1	Column 2	Cramér's V	Pearson's Correlation
model	goods_code	0.605
item	make	0.471
item	model	0.469
make	model	0.450
item	goods_code	0.444
item	cash_price	0.316
cash_price	model	0.283
make	goods_code	0.254
cash_price	goods_code	0.253
cash_price	make	0.240
basket_ID	model	0.204
basket_ID	item	0.199
basket_ID	make	0.136
item	Nbr_of_prod_purchas	0.121
basket_ID	goods_code	0.120
make	Nbr_of_prod_purchas	0.104
basket_ID	cash_price	0.0971	0.131
cash_price	Nbr_of_prod_purchas	0.0697	-0.0104
basket_ID	Nbr_of_prod_purchas	0.0506	-0.00527
goods_code	Nbr_of_prod_purchas	0.0457

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 40-87 A data-processing challenge ---------------------------- We want to fit a ``HistGradientBoostingClassifier`` to predict the fraud flag (or ``y``). We build a design matrix with one row per basket (and thus per fraud flag). Our ``baskets`` (or ``X``) table only contains IDs. We enrich it by adding features from the ``products`` table. The general structure of the pipeline looks like this: .. image:: ../../_static/credit_fraud_diagram.svg :width: 300 First, as the ``products`` table contains strings and categories (such as ``"SAMSUNG"``), we vectorize those entries to extract numeric features. This is easily done with skrub's ``TableVectorizer``. Then, as each basket can contain several products, all the product lines corresponding to a basket are aggregated into a single feature vector that can be attached to the basket. The difficulty is that the vectorized ``products`` should be aggregated before joining to ``baskets``, and, in order to compute a meaningful aggregation, must be vectorized *before* the aggregation. Thus, we have a ``TableVectorizer`` to fit on a table which does not (yet) have the same number of rows as the target ``y`` — something that the scikit-learn ``Pipeline``, with its single-input, linear structure, does not allow. We can fit it ourselves, outside of any pipeline with something like:: vectorizer = skrub.TableVectorizer() vectorized_products = vectorizer.fit_transform(dataset.products) However, because it is dissociated from the main estimator which handles ``X`` (the baskets), we have to manage this transformer ourselves. We lose the scikit-learn machinery for grouping all transformation steps, storing fitted estimators, splitting the input data and cross-validation, and hyper-parameter tuning. Moreover, we might need some Pandas code to perform the aggregation and join. Again, as this transformation is not in a scikit-learn estimator, it is error-prone. The difficulty is that we have to keep track of it ourselves to apply it later to unseen data, and we cannot tune any choices (like the choice of the aggregation function). Fortunately, skrub provides an alternative way to build more flexible pipelines. .. GENERATED FROM PYTHON SOURCE LINES 89-97 A solution with skrub --------------------- In a skrub pipeline, we do not provide an explicit list of transformation steps. Rather, we manipulate skrub objects representing intermediate results. The pipeline is built implicitly as we perform operations (such as applying operators or calling functions) on those objects. .. GENERATED FROM PYTHON SOURCE LINES 99-101 We start by creating skrub variables, which are the inputs to our pipeline. In our example, we create three variables: "products", "baskets", and "fraud flags": .. GENERATED FROM PYTHON SOURCE LINES 103-107 .. code-block:: Python products = skrub.var("products", dataset.products) baskets = skrub.var("baskets", dataset.baskets[["ID"]]).skb.mark_as_X() fraud_flags = skrub.var("fraud_flags", dataset.baskets["fraud_flag"]).skb.mark_as_y() .. GENERATED FROM PYTHON SOURCE LINES 108-128 They are given a name and an (optional) initial value, used to show previews of the pipeline's output, detect errors early, and provide data for cross-validation and hyperparameter search. We then build the pipeline by applying transformations to those inputs. Above, ``mark_as_X()`` and ``mark_as_y()`` indicate that the baskets and flags are respectively our design matrix and target variables, that should be split into training and testing sets for cross-validation. Here, they are direct inputs to the pipeline but any intermediate result could be marked as X or y. Because our pipeline expects dataframes for products, baskets and fraud flags, we manipulate those objects as we would manipulate dataframes. All attribute accesses are transparently forwarded to the actual input dataframes when we run the pipeline. For instance, we filter products to keep only those that match one of the baskets in the ``baskets`` table, and then add a column containing the total amount for each kind of product in a basket: .. GENERATED FROM PYTHON SOURCE LINES 130-136 .. code-block:: Python kept_products = products[products["basket_ID"].isin(baskets["ID"])] products_with_total = kept_products.assign( total_price=kept_products["Nbr_of_prod_purchas"] * kept_products["cash_price"] ) products_with_total .. raw:: html

<CallMethod 'assign'>

Show graph

Result:

	basket_ID	item	cash_price	make	model	goods_code	Nbr_of_prod_purchas	total_price
1	51113	COMPUTER PERIPHERALS ACCESSORIES	409	APPLE	APPLE WATCH SERIES 6 GPS 44MM SPACE GREY ALUMINIUM	239001518	1	409
9	41798	COMPUTERS	1187	APPLE	2020 APPLE MACBOOK PRO 13 TOUCH BAR M1 PROCESSOR 8	239246780	1	1187
11	39361	COMPUTERS	898	APPLE	2020 APPLE MACBOOK AIR 13 3 RETINA DISPLAY M1 PROC	239246776	1	898
15	38615	COMPUTER PERIPHERALS ACCESSORIES	379	APPLE	APPLE WATCH SERIES 6 GPS 40MM BLUE ALUMINIUM CASE	239001540	1	379
16	70262	COMPUTERS	1899	APPLE	2021 APPLE MACBOOK PRO 14 M1 PRO PROCESSOR 16GB RA	240575990	1	1899

163352	42613	BEDROOM FURNITURE	259	SILENTNIGHT	SILENTNIGHT SLEEP GENIUS FULL HEIGHT HEADBOARD DOU	236938439	1	259
163353	42613	OUTDOOR FURNITURE	949	LG OUTDOOR	LG OUTDOOR BERGEN 2-SEAT GARDEN SIDE TABLE RECLINI	239742814	1	949
163354	43567	COMPUTERS	1099	APPLE	2021 APPLE IPAD PRO 12 9 M1 PROCESSOR IOS WI-FI 25	240040978	1	1099
163355	43567	COMPUTERS	2099	APPLE	2020 APPLE IMAC 27 ALL-IN-ONE INTEL CORE I7 8GB RA	238923518	1	2099
163356	68268	TELEVISIONS HOME CINEMA	799	LG	LG OLED48A16LA 2021 OLED HDR 4K ULTRA HD SMART TV	239866717	1	799

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	basket_ID	Int64DType	0 (0.0%)	61241 (56.0%)	3.59e+04	2.24e+04	0	35,203	76,543
1	item	ObjectDType	0 (0.0%)	166 (0.2%)
2	cash_price	Int64DType	0 (0.0%)	1280 (1.2%)	672.	714.	0	499	18,349
3	make	ObjectDType	1273 (1.2%)	690 (0.6%)
4	model	ObjectDType	1273 (1.2%)	6477 (5.9%)
5	goods_code	ObjectDType	0 (0.0%)	10738 (9.8%)
6	Nbr_of_prod_purchas	Int64DType	0 (0.0%)	19 (< 0.1%)	1.05	0.426	1	1	40
7	total_price	Int64DType	0 (0.0%)	1454 (1.3%)	704.	882.	0	519	71,280

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 137-167 We see previews of the output of intermediate results. For example, the added ``"total_price"`` column is in the output above. The "Show graph" dropdown at the top allows us to check the structure of the pipeline and all the steps it contains. .. note:: We recommend to assign each new skrub expression to a new variable name, as is done above. For example ``kept_products = products[...]`` instead of reusing the name ``products = products[...]``. This makes it easy to backtrack to any step of the pipeline and change the subsequent steps, and can avoid ending up in a confusing state in jupyter notebooks when the same cell might be re-executed several times. With skrub, we do not need to specify a grid of hyperparameters separately from the pipeline. Instead, we replace a parameter's value with a skrub "choice" which indicates the range of values we consider during hyperparameter selection. Skrub choices can be nested arbitrarily. They are not restricted to parameters of a scikit-learn estimator, but can be anything: choosing between different estimators, arguments to function calls, whole sections of the pipeline etc. In-depth information about choices and hyperparameter/model selection is provided in the :ref:`Tuning Pipelines example `. We build a skrub ``TableVectorizer`` with different choices of: the type of encoder for high-cardinality categorical or string columns, and the number of components it uses. .. GENERATED FROM PYTHON SOURCE LINES 169-179 .. code-block:: Python n = skrub.choose_int(5, 15, name="n_components") encoder = skrub.choose_from( { "MinHash": skrub.MinHashEncoder(n_components=n), "LSA": skrub.StringEncoder(n_components=n), }, name="encoder", ) vectorizer = skrub.TableVectorizer(high_cardinality=encoder) .. GENERATED FROM PYTHON SOURCE LINES 180-183 A transformer does not have to apply to the full dataframe; we can restrict it to some columns, using the ``cols`` or ``exclude_cols`` parameters. In our example, we vectorize all columns except the ``"basket_ID"``. .. GENERATED FROM PYTHON SOURCE LINES 185-189 .. code-block:: Python vectorized_products = products_with_total.skb.apply( vectorizer, exclude_cols="basket_ID" ) .. GENERATED FROM PYTHON SOURCE LINES 190-193 Having access to the underlying dataframe's API, we can perform the data-wrangling we need. Those transformations are being implicitly added as steps in our machine-learning pipeline. .. GENERATED FROM PYTHON SOURCE LINES 195-200 .. code-block:: Python aggregated_products = vectorized_products.groupby("basket_ID").agg("mean").reset_index() augmented_baskets = baskets.merge( aggregated_products, left_on="ID", right_on="basket_ID" ).drop(columns=["ID", "basket_ID"]) .. GENERATED FROM PYTHON SOURCE LINES 201-211 We can actually ask for a full report of the pipeline and inspect the results at every step:: predictions.skb.full_report() This produces a folder on disk rather than displaying inline in a notebook so we do not run it here. But you can `see the output here <../../_static/credit_fraud_report/index.html>`_. Finally, we add a supervised estimator: .. GENERATED FROM PYTHON SOURCE LINES 213-221 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier hgb = HistGradientBoostingClassifier( learning_rate=skrub.choose_float(0.01, 0.9, log=True, name="learning_rate") ) predictions = augmented_baskets.skb.apply(hgb, y=fraud_flags) predictions .. raw:: html

<Apply HistGradientBoostingClassifier>

Show graph

Result:

	fraud_flag
0	0
1	0
2	0
3	0
4	0

61236	0
61237	0
61238	0
61239	0
61240	0

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	fraud_flag	Int64DType	0 (0.0%)	2 (< 0.1%)	0.00237	0.0486	0	0	1

Please enable javascript

.. GENERATED FROM PYTHON SOURCE LINES 222-227 And our pipeline is complete! From the choices we inserted at different locations in our pipeline, skrub can build a grid of hyperparameters and run the hyperparameter search for us, backed by scikit-learn's ``GridSearchCV`` or ``RandomizedSearchCV``. .. GENERATED FROM PYTHON SOURCE LINES 229-231 .. code-block:: Python print(predictions.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'MinHash' n_components: choose_int(5, 15, name='n_components') - learning_rate: choose_float(0.01, 0.9, log=True, name='learning_rate') encoder: 'LSA' n_components: choose_int(5, 15, name='n_components') .. GENERATED FROM PYTHON SOURCE LINES 232-237 .. code-block:: Python search = predictions.skb.get_randomized_search( scoring="roc_auc", n_iter=8, n_jobs=4, random_state=0, fitted=True ) search.results_ .. raw:: html

	mean_test_score	n_components	encoder	learning_rate
0	0.884405	14	LSA	0.052436
1	0.883460	10	LSA	0.038147
2	0.881597	13	LSA	0.067286
3	0.879897	17	MinHash	0.086695
4	0.879757	19	MinHash	0.056148
5	0.876065	18	LSA	0.013766
6	0.756046	13	MinHash	0.446581
7	0.721020	16	LSA	0.501435

.. GENERATED FROM PYTHON SOURCE LINES 238-240 We can also run a cross validation, using the first choices defined in the ``choose`` objects: .. GENERATED FROM PYTHON SOURCE LINES 240-243 .. code-block:: Python predictions.skb.cross_validate(scoring="roc_auc", verbose=1, n_jobs=4) .. rst-class:: sphx-glr-script-out .. code-block:: none [Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 5 out of 5 | elapsed: 16.2s finished .. raw:: html

	fit_time	score_time	test_score
0	10.660552	1.629370	0.876892
1	8.531549	2.432822	0.869317
2	8.900710	2.557423	0.867150
3	8.900213	2.315498	0.878599
4	3.597726	0.919810	0.903728

.. GENERATED FROM PYTHON SOURCE LINES 244-255 We can also display a parallel coordinates plot of the results. In a parallel coordinates plot, each line corresponds to a combination of hyperparameter (choices) values, followed by the corresponding test scores, and training and scoring computation durations. Different columns show the hyperparameter values. By **clicking and dragging the mouse** on any column, we can restrict the set of lines we see. This allows quickly inspecting which hyperparameters are important, which values perform best, and potential trade-offs between the quality of predictions and computation time. .. GENERATED FROM PYTHON SOURCE LINES 257-259 .. code-block:: Python search.plot_results() .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 260-267 It seems here that using the LSA as an encoder brings better test scores, but at the expense of training and scoring time. Serializing ----------- We would usually save this model in a binary file, but to avoid accessing the filesystem with this example notebook, we serialize the model in memory instead. .. GENERATED FROM PYTHON SOURCE LINES 267-271 .. code-block:: Python import pickle saved_model = pickle.dumps(search.best_pipeline_) .. GENERATED FROM PYTHON SOURCE LINES 272-274 Let's say we got some new data, and we want to use the model we just saved to make predictions on them: .. GENERATED FROM PYTHON SOURCE LINES 274-278 .. code-block:: Python new_data = skrub.datasets.fetch_credit_fraud(split="test") new_baskets = new_data.baskets[["ID"]] new_products = new_data.products .. GENERATED FROM PYTHON SOURCE LINES 279-281 Our estimator expects the same variable names as the training pipeline, which is why we pass a dictionary that contains new dataframes and the same variable: .. GENERATED FROM PYTHON SOURCE LINES 281-284 .. code-block:: Python loaded_model = pickle.loads(saved_model) loaded_model.predict({"baskets": new_baskets, "products": new_products}) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0, 0, 0, ..., 0, 0, 0], shape=(31549,)) .. GENERATED FROM PYTHON SOURCE LINES 285-291 Conclusion ---------- If you are curious to know more on how to build your own complex, multi-table pipelines with easy hyperparameter tuning, please see the next examples for an in-depth tutorial. .. rst-class:: sphx-glr-timing **Total running time of the script:** (4 minutes 41.480 seconds) .. _sphx_glr_download_auto_examples_expressions_10_expressions_intro.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/expressions/10_expressions_intro.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/expressions/10_expressions_intro.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 10_expressions_intro.ipynb <10_expressions_intro.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 10_expressions_intro.py <10_expressions_intro.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 10_expressions_intro.zip <10_expressions_intro.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

ID

fraud_flag

ID

fraud_flag

Please enable javascript

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

Please enable javascript

basket_ID

item

cash_price

make

model

goods_code

Nbr_of_prod_purchas

total_price

Please enable javascript

fraud_flag

Please enable javascript