.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/08_join_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_08_join_aggregation.py: AggJoiner on a credit fraud dataset =================================== Many problems involve tables whose entities have a one-to-many relationship. To simplify aggregate-then-join operations for machine learning, we can include the |AggJoiner| in our pipeline. In this example, we are tackling a fraudulent loan detection use case. Because fraud is rare, this dataset is extremely imbalanced, with a prevalence of around 1.4%. The data consists of two distinct entities: e-commerce "baskets", and "products". Baskets can be tagged fraudulent (1) or not (0), and are essentially a list of products of variable size. Each basket is linked to at least one products, e.g. basket 1 can have product 1 and 2. .. image:: ../../_static/08_example_data.png :width: 450 px | Our aim is to predict which baskets are fraudulent. The products dataframe can be joined on the baskets dataframe using the ``basket_ID`` column. Each product has several attributes: - a category (marked by the column ``"item"``), - a model (``"model"``), - a brand (``"make"``), - a merchant code (``"goods_code"``), - a price per unit (``"cash_price"``), - a quantity selected in the basket (``"Nbr_of_prod_purchas"``) .. |AggJoiner| replace:: :class:`~skrub.AggJoiner` .. |Joiner| replace:: :class:`~skrub.Joiner` .. |DropCols| replace:: :class:`~skrub.DropCols` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |TableReport| replace:: :class:`~skrub.TableReport` .. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder` .. |TargetEncoder| replace:: :class:`~sklearn.preprocessing.TargetEncoder` .. |make_pipeline| replace:: :func:`~sklearn.pipeline.make_pipeline` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. |HGBC| replace:: :class:`~sklearn.ensemble.HistGradientBoostingClassifier` .. |OrdinalEncoder| replace:: :class:`~sklearn.preprocessing.OrdinalEncoder` .. |TunedThresholdClassifierCV| replace:: :class:`~sklearn.model_selection.TunedThresholdClassifierCV` .. |CalibrationDisplay| replace:: :class:`~sklearn.calibration.CalibrationDisplay` .. |pandas.melt| replace:: :func:`~pandas.melt` .. GENERATED FROM PYTHON SOURCE LINES 82-89 .. code-block:: Python from skrub import TableReport from skrub.datasets import fetch_credit_fraud bunch = fetch_credit_fraud() products, baskets = bunch.products, bunch.baskets TableReport(products) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 90-92 .. code-block:: Python TableReport(baskets) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 93-106 Naive aggregation ----------------- Let's explore a naive solution first. .. note:: Click :ref:`here` to skip this section and see the AggJoiner in action! The first idea that comes to mind to merge these two tables is to aggregate the products attributes into lists, using their basket IDs. .. GENERATED FROM PYTHON SOURCE LINES 106-109 .. code-block:: Python products_grouped = products.groupby("basket_ID").agg(list) TableReport(products_grouped) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 110-113 Then, we can expand all lists into columns, as if we were "flattening" the dataframe. We end up with a products dataframe ready to be joined on the baskets dataframe, using ``"basket_ID"`` as the join key. .. GENERATED FROM PYTHON SOURCE LINES 113-123 .. code-block:: Python import pandas as pd products_flatten = [] for col in products_grouped.columns: cols = [f"{col}{idx}" for idx in range(24)] products_flatten.append(pd.DataFrame(products_grouped[col].to_list(), columns=cols)) products_flatten = pd.concat(products_flatten, axis=1) products_flatten.insert(0, "basket_ID", products_grouped.index) TableReport(products_flatten) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 124-144 Look at the "Stats" section of the |TableReport| above. Does anything strike you? Not only did we create 144 columns, but most of these columns are filled with NaN, which is very inefficient for learning! This is because each basket contains a variable number of products, up to 24, and we created one column for each product attribute, for each position (up to 24) in the dataframe. Moreover, if we wanted to replace text columns with encodings, we would create :math:`d \times 24 \times 2` columns (encoding of dimensionality :math:`d`, for 24 products, for the ``"item"`` and ``"make"`` columns), which would explode the memory usage. .. _agg-joiner-anchor: AggJoiner --------- Let's now see how the |AggJoiner| can help us solve this. We begin with splitting our basket dataset in a training and testing set. .. GENERATED FROM PYTHON SOURCE LINES 144-150 .. code-block:: Python from sklearn.model_selection import train_test_split X, y = baskets[["ID"]], baskets["fraud_flag"] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1) X_train.shape, y_train.shape .. rst-class:: sphx-glr-script-out .. code-block:: none ((83511, 1), (83511,)) .. GENERATED FROM PYTHON SOURCE LINES 151-163 Before aggregating our product dataframe, we need to vectorize our categorical columns. To do so, we use: - |MinHashEncoder| on "item" and "model" columns, because they both expose typos and text similarities. - |OrdinalEncoder| on "make" and "goods_code" columns, because they consist in orthogonal categories. We bring this logic into a |TableVectorizer| to vectorize these columns in a single step. See `this example `_ for more details about these encoding choices. .. GENERATED FROM PYTHON SOURCE LINES 163-176 .. code-block:: Python from sklearn.preprocessing import OrdinalEncoder from skrub import MinHashEncoder, TableVectorizer vectorizer = TableVectorizer( high_cardinality=MinHashEncoder(), # encode ["item", "model"] specific_transformers=[ (OrdinalEncoder(), ["make", "goods_code"]), ], ) products_transformed = vectorizer.fit_transform(products) TableReport(products_transformed) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 177-197 Our objective is now to aggregate this vectorized product dataframe by ``"basket_ID"``, then to merge it on the baskets dataframe, still on the ``"basket_ID"``. .. image:: ../../_static/08_example_aggjoiner.png :width: 900 | |AggJoiner| can help us achieve exactly this. We need to pass the product dataframe as an auxiliary table argument to |AggJoiner| in ``__init__``. The ``aux_key`` argument represent both the columns used to groupby on, and the columns used to join on. The basket dataframe is our main table, and we indicate the columns to join on with ``main_key``. Note that we pass the main table during ``fit``, and we discuss the limitations of this design in the conclusion at the bottom of this notebook. The minimum ("min") is the most appropriate operation to aggregate encodings from |MinHashEncoder|, for reasons that are out of the scope of this notebook. .. GENERATED FROM PYTHON SOURCE LINES 197-215 .. code-block:: Python from skrub import AggJoiner from skrub import _selectors as s # Skrub selectors allow us to select columns using regexes, which reduces # the boilerplate. minhash_cols_query = s.glob("item*") | s.glob("model*") minhash_cols = s.select(products_transformed, minhash_cols_query).columns agg_joiner = AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ) baskets_products = agg_joiner.fit_transform(baskets) TableReport(baskets_products) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 216-227 Now that we understand how to use the |AggJoiner|, we can now assemble our pipeline by chaining two |AggJoiner| together: - the first one to deal with the |MinHashEncoder| vectors as we just saw - the second one to deal with the all the other columns For the second |AggJoiner|, we use the mean, standard deviation, minimum and maximum operations to extract a representative summary of each distribution. |DropCols| is another skrub transformer which removes the "ID" column, which doesn't bring any information after the joining operation. .. GENERATED FROM PYTHON SOURCE LINES 227-253 .. code-block:: Python from scipy.stats import loguniform, randint from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.pipeline import make_pipeline from skrub import DropCols model = make_pipeline( AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=minhash_cols, operations=["min"], ), AggJoiner( aux_table=products_transformed, aux_key="basket_ID", main_key="ID", cols=["make", "goods_code", "cash_price", "Nbr_of_prod_purchas"], operations=["sum", "mean", "std", "min", "max"], ), DropCols(["ID"]), HistGradientBoostingClassifier(), ) model .. raw:: html
Pipeline(steps=[('aggjoiner-1',
                     AggJoiner(aux_key='basket_ID',
                               aux_table=        basket_ID       item_00  ...  goods_code  Nbr_of_prod_purchas
    0         85517.0 -2.119082e+09  ...     11181.0                  1.0
    1         51113.0 -2.119082e+09  ...     10552.0                  1.0
    2         83008.0 -2.128260e+09  ...     12038.0                  1.0
    3         78712.0 -2.119082e+09  ...     10513.0                  1.0
    4         78712.0 -2.119082e+09  ...      4925.0                  1.0
    ...           ...           ...  ...         ...                  ...
    163352    42613.0 -1.944861e+09  ...      2807.0                  1.0
    163353...
    163354    43567.0 -2.119082e+09  ...     13080.0                  1.0
    163355    43567.0 -2.119082e+09  ...      9971.0                  1.0
    163356    68268.0 -2.128260e+09  ...     12106.0                  1.0

    [163357 rows x 65 columns],
                               cols=['make', 'goods_code', 'cash_price',
                                     'Nbr_of_prod_purchas'],
                               main_key='ID',
                               operations=['sum', 'mean', 'std', 'min', 'max'])),
                    ('dropcols', DropCols(cols=['ID'])),
                    ('histgradientboostingclassifier',
                     HistGradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 254-255 We tune the hyper-parameters of the |HGBC| to get a good performance. .. GENERATED FROM PYTHON SOURCE LINES 255-277 .. code-block:: Python from time import time from sklearn.model_selection import RandomizedSearchCV param_distributions = dict( histgradientboostingclassifier__learning_rate=loguniform(1e-3, 1), histgradientboostingclassifier__max_depth=randint(3, 9), histgradientboostingclassifier__max_leaf_nodes=[None, 10, 30, 60, 90], histgradientboostingclassifier__max_iter=randint(50, 500), ) tic = time() search = RandomizedSearchCV( model, param_distributions, scoring="neg_log_loss", refit=False, n_iter=10, cv=3, verbose=1, ).fit(X_train, y_train) print(f"This operation took {time() - tic:.1f}s") .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting 3 folds for each of 10 candidates, totalling 30 fits This operation took 150.3s .. GENERATED FROM PYTHON SOURCE LINES 278-279 The best hyper parameters are: .. GENERATED FROM PYTHON SOURCE LINES 279-282 .. code-block:: Python pd.Series(search.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none histgradientboostingclassifier__learning_rate 0.022994 histgradientboostingclassifier__max_depth 7.000000 histgradientboostingclassifier__max_iter 398.000000 histgradientboostingclassifier__max_leaf_nodes 60.000000 dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 283-291 To benchmark our performance, we plot the log loss of our model on the test set against the log loss of a dummy model that always output the observed probability of the two classes. As this dataset is extremely imbalanced, this dummy model should be a good baseline. The vertical bar represents one standard deviation around the mean of the cross validation log-loss. .. GENERATED FROM PYTHON SOURCE LINES 291-325 .. code-block:: Python import seaborn as sns from matplotlib import pyplot as plt from sklearn.dummy import DummyClassifier from sklearn.metrics import log_loss results = search.cv_results_ best_idx = search.best_index_ log_loss_model_mean = -results["mean_test_score"][best_idx] log_loss_model_std = results["std_test_score"][best_idx] dummy = DummyClassifier(strategy="prior").fit(X_train, y_train) y_proba_dummy = dummy.predict_proba(X_test) log_loss_dummy = log_loss(y_true=y_test, y_pred=y_proba_dummy) fig, ax = plt.subplots() ax.bar( height=[log_loss_model_mean, log_loss_dummy], x=["AggJoiner model", "Dummy"], color=["C0", "C4"], ) for container in ax.containers: ax.bar_label(container, padding=4) ax.vlines( x="AggJoiner model", ymin=log_loss_model_mean - log_loss_model_std, ymax=log_loss_model_mean + log_loss_model_std, linestyle="-", linewidth=1, color="k", ) sns.despine() ax.set_title("Log loss (lower is better)") .. image-sg:: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :alt: Log loss (lower is better) :srcset: /auto_examples/images/sphx_glr_08_join_aggregation_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Log loss (lower is better)') .. GENERATED FROM PYTHON SOURCE LINES 326-344 Conclusion ---------- With |AggJoiner|, you can bring the aggregation and joining operations within a sklearn pipeline, and train models more efficiently. One known limitation of both the |AggJoiner| and |Joiner| is that the auxiliary data to join is passed during the ``__init__`` method instead of the ``fit`` method, and is therefore fixed once the model has been trained. This limitation causes two main issues: 1. **Bigger model serialization:** Since the dataset has to be pickled along with the model, it can result in a massive file size on disk. 2. **Inflexibility with new, unseen data in a production environment:** To use new auxiliary data, you would need to replace the auxiliary table in the |AggJoiner| that was used during ``fit`` with the updated data, which is a rather hacky approach. These limitations will be addressed later in skrub. .. rst-class:: sphx-glr-timing **Total running time of the script:** (4 minutes 31.548 seconds) .. _sphx_glr_download_auto_examples_08_join_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.4.1?urlpath=lab/tree/notebooks/auto_examples/08_join_aggregation.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/08_join_aggregation.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 08_join_aggregation.ipynb <08_join_aggregation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 08_join_aggregation.py <08_join_aggregation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 08_join_aggregation.zip <08_join_aggregation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_