.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/06_ken_embeddings.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_06_ken_embeddings.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_06_ken_embeddings.py:


Wikipedia embeddings to enrich the data
=======================================

When the data comprises common entities (cities,
companies or famous people), bringing new information assembled from external
sources may be the key to improving the analysis.

Embeddings, or vectorial representations of entities, are a convenient way to
capture and summarize the information on an entity.
Relational data embeddings capture all common entities from Wikipedia. [#]_
These will be called `KEN embeddings` in the following example.

We will see that these embeddings of common entities significantly
improve our results.

.. note::
    This example requires `pyarrow` to be installed.

.. [#] https://soda-inria.github.io/ken_embeddings/


 .. |Pipeline| replace::
     :class:`~sklearn.pipeline.Pipeline`

 .. |OneHotEncoder| replace::
     :class:`~sklearn.preprocessing.OneHotEncoder`

 .. |ColumnTransformer| replace::
     :class:`~sklearn.compose.ColumnTransformer`

 .. |MinHash| replace::
     :class:`~skrub.MinHashEncoder`

 .. |HGBR| replace::
     :class:`~sklearn.ensemble.HistGradientBoostingRegressor`

.. GENERATED FROM PYTHON SOURCE LINES 40-45

The data
--------

We will take a look at the video game sales dataset.
Let's retrieve the dataset:

.. GENERATED FROM PYTHON SOURCE LINES 45-54

.. code-block:: Python

    import pandas as pd

    from skrub import datasets

    data = datasets.fetch_videogame_sales()

    X = data.X
    X.head(3)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading 'videogame_sales' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/videogame_sales.zip (attempt 1/3)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Name</th>
          <th>Platform</th>
          <th>Year</th>
          <th>Genre</th>
          <th>Publisher</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>Wii Sports</td>
          <td>Wii</td>
          <td>2006</td>
          <td>Sports</td>
          <td>Nintendo</td>
        </tr>
        <tr>
          <th>1</th>
          <td>Super Mario Bros.</td>
          <td>NES</td>
          <td>1985</td>
          <td>Platform</td>
          <td>Nintendo</td>
        </tr>
        <tr>
          <th>2</th>
          <td>Mario Kart Wii</td>
          <td>Wii</td>
          <td>2008</td>
          <td>Racing</td>
          <td>Nintendo</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 55-56

Our goal will be to predict the sales amount (y, our target column):

.. GENERATED FROM PYTHON SOURCE LINES 56-60

.. code-block:: Python

    y = data.y
    y


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0        82.74
    1        40.24
    2        35.82
    3        33.00
    4        31.37
             ...  
    16567     0.01
    16568     0.01
    16569     0.01
    16570     0.01
    16571     0.01
    Name: Global_Sales, Length: 16572, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 61-62

Let's take a look at the distribution of our target variable:

.. GENERATED FROM PYTHON SOURCE LINES 62-70

.. code-block:: Python

    import matplotlib.pyplot as plt
    import seaborn as sns

    sns.set_theme(style="ticks")

    sns.histplot(y)
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_001.png
   :alt: 06 ken embeddings
   :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 71-72

It seems better to take the log of sales rather than the absolute values:

.. GENERATED FROM PYTHON SOURCE LINES 72-78

.. code-block:: Python

    import numpy as np

    y = np.log(y)
    sns.histplot(y)
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_002.png
   :alt: 06 ken embeddings
   :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 79-80

Before moving further, let's carry out some basic preprocessing:

.. GENERATED FROM PYTHON SOURCE LINES 80-87

.. code-block:: Python


    # Get a mask of the rows with missing values in "Publisher" and "Global_Sales"
    mask = X["Publisher"].isna() | y.isna()
    # And remove them
    X = X[~mask]
    y = y[~mask]


.. GENERATED FROM PYTHON SOURCE LINES 88-95

Extracting entity embeddings
----------------------------

We will use KEN embeddings to enrich our data.

We will start by checking out the available tables with
:class:`~skrub.datasets.fetch_ken_table_aliases`:

.. GENERATED FROM PYTHON SOURCE LINES 95-99

.. code-block:: Python

    from skrub.datasets import fetch_ken_table_aliases

    fetch_ken_table_aliases()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    {'movies', 'all_entities', 'companies', 'schools', 'games', 'albums'}


.. GENERATED FROM PYTHON SOURCE LINES 100-103

The *games* table is the most relevant to our case.
Let's see what kind of types we can find in it with the function
:class:`~skrub.datasets.fetch_ken_types`:

.. GENERATED FROM PYTHON SOURCE LINES 103-107

.. code-block:: Python

    from skrub.datasets import fetch_ken_types

    fetch_ken_types(embedding_table_id="games")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/skrub/datasets/_ken_embeddings.py:165: UserWarning: Could not find the dataset np.int64(40019788) locally. Downloading it from figshare; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/figshare/figshare_40019788.parquet.
      info = _fetch_figshare(dataset_id, data_directory)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>Type</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>wikicat_1994_video_games</td>
        </tr>
        <tr>
          <th>1</th>
          <td>wikicat_irem_games</td>
        </tr>
        <tr>
          <th>2</th>
          <td>wikicat_ea_guingamp_players</td>
        </tr>
        <tr>
          <th>3</th>
          <td>wikicat_video_game_companies_of_the_united_kin...</td>
        </tr>
        <tr>
          <th>4</th>
          <td>wikicat_asian_games_medalists_in_swimming</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
        </tr>
        <tr>
          <th>636</th>
          <td>wikicat_college_football_games</td>
        </tr>
        <tr>
          <th>637</th>
          <td>wikicat_sonic_team_games</td>
        </tr>
        <tr>
          <th>638</th>
          <td>wikicat_space_opera_video_games</td>
        </tr>
        <tr>
          <th>639</th>
          <td>wikicat_boxers_at_the_2002_asian_games</td>
        </tr>
        <tr>
          <th>640</th>
          <td>wikicat_motorcycle_video_games</td>
        </tr>
      </tbody>
    </table>
    <p>641 rows × 1 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 108-112

Interesting, we have a broad range of topics!

Next, we'll use :class:`~skrub.datasets.fetch_ken_embeddings`
to extract the embeddings of entities we need:

.. GENERATED FROM PYTHON SOURCE LINES 112-114

.. code-block:: Python

    from skrub.datasets import fetch_ken_embeddings


.. GENERATED FROM PYTHON SOURCE LINES 115-126

KEN Embeddings are classified by types.
See the example on :class:`~skrub.datasets.fetch_ken_embeddings`
to understand how you can filter types you are interested in.

The :class:`~skrub.datasets.fetch_ken_embeddings` function
allows us to specify the types to be included and/or excluded
so as not to load all Wikipedia entity embeddings in a table.


In a first table, we include all embeddings with the type name "game"
and exclude those with type name "companies" or "developer".

.. GENERATED FROM PYTHON SOURCE LINES 126-132

.. code-block:: Python

    embedding_games = fetch_ken_embeddings(
        search_types="game",
        exclude="companies|developer",
        embedding_table_id="games",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/skrub/datasets/_ken_embeddings.py:165: UserWarning: Could not find the dataset np.int64(47119717) locally. Downloading it from figshare; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/figshare/figshare_47119717.parquet.
      info = _fetch_figshare(dataset_id, data_directory)
    /home/circleci/project/skrub/datasets/_ken_embeddings.py:165: UserWarning: Could not find the dataset np.int64(39254360) locally. Downloading it from figshare; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/figshare/figshare_39254360.parquet.
      info = _fetch_figshare(dataset_id, data_directory)


.. GENERATED FROM PYTHON SOURCE LINES 133-135

In a second table, we include all embeddings containing the type name
"game_development_companies", "game_companies" or "game_publish":

.. GENERATED FROM PYTHON SOURCE LINES 135-147

.. code-block:: Python

    embedding_publisher = fetch_ken_embeddings(
        search_types="game_development_companies|game_companies|game_publish",
        embedding_table_id="games",
    )

    # We keep the 200 embeddings column names in a list (for the |Pipeline|):
    n_dim = 200

    emb_columns = [f"X{j}" for j in range(n_dim)]

    emb_columns2 = [f"X{j}_aux" for j in range(n_dim)]


.. GENERATED FROM PYTHON SOURCE LINES 148-157

Merging the entities
....................

We will now merge the entities from Wikipedia with their equivalent match
in our video game sales table:

The entities from the 'embedding_games' table will be merged along the
column "Name" and the ones from 'embedding_publisher' table with the
column "Publisher"

.. GENERATED FROM PYTHON SOURCE LINES 157-165

.. code-block:: Python

    from skrub import Joiner

    fa1 = Joiner(embedding_games, aux_key="Entity", main_key="Name")
    fa2 = Joiner(embedding_publisher, aux_key="Entity", main_key="Publisher", suffix="_aux")

    X_full = fa1.fit_transform(X)
    X_full = fa2.fit_transform(X_full)


.. GENERATED FROM PYTHON SOURCE LINES 166-172

Prediction with base features
-----------------------------

We will forget for now the KEN Embeddings and build a typical learning
pipeline, where will we try to predict the amount of sales only using
the base features contained in the initial table.

.. GENERATED FROM PYTHON SOURCE LINES 174-177

We first use scikit-learn's |ColumnTransformer| to define the columns
that will be included in the learning process and the appropriate encoding of
categorical variables using the |MinHash| and |OneHotEncoder|:

.. GENERATED FROM PYTHON SOURCE LINES 177-192

.. code-block:: Python

    from sklearn.compose import make_column_transformer
    from sklearn.preprocessing import OneHotEncoder

    from skrub import MinHashEncoder

    min_hash = MinHashEncoder(n_components=100)
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

    encoder = make_column_transformer(
        ("passthrough", ["Year"]),
        (ohe, ["Genre"]),
        (min_hash, "Platform"),
        remainder="drop",
    )


.. GENERATED FROM PYTHON SOURCE LINES 193-195

We incorporate our |ColumnTransformer| into a |Pipeline|.
We define a predictor, |HGBR|, fast and reliable for big datasets.

.. GENERATED FROM PYTHON SOURCE LINES 195-201

.. code-block:: Python

    from sklearn.ensemble import HistGradientBoostingRegressor
    from sklearn.pipeline import make_pipeline

    hgb = HistGradientBoostingRegressor(random_state=0)
    pipeline = make_pipeline(encoder, hgb)


.. GENERATED FROM PYTHON SOURCE LINES 202-203

The |Pipeline| can now be readily applied to the dataframe for prediction:

.. GENERATED FROM PYTHON SOURCE LINES 203-228

.. code-block:: Python

    from sklearn.model_selection import KFold, cross_validate

    # We will save the results in a dictionary:
    all_r2_scores = dict()
    all_rmse_scores = dict()

    # The dataset is ordered by rank (most sales first so we need to shuffle before
    # splitting into cross-validation folds)
    cv = KFold(shuffle=True, random_state=0)

    cv_results = cross_validate(
        pipeline, X_full, y, scoring=["r2", "neg_root_mean_squared_error"], cv=cv
    )

    all_r2_scores["Base features"] = cv_results["test_r2"]
    all_rmse_scores["Base features"] = -cv_results["test_neg_root_mean_squared_error"]

    print("With base features:")
    print(
        f"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-"
        f" {all_r2_scores['Base features'].std():.2f} and the RMSE is"
        f" {all_rmse_scores['Base features'].mean():.2f} +-"
        f" {all_rmse_scores['Base features'].std():.2f}"
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    With base features:
    Mean R2 is 0.21 +- 0.01 and the RMSE is 1.30 +- 0.01


.. GENERATED FROM PYTHON SOURCE LINES 229-234

Prediction with KEN Embeddings
------------------------------

We will now build a second learning pipeline using only the KEN embeddings
from Wikipedia.

.. GENERATED FROM PYTHON SOURCE LINES 236-237

We keep only the embeddings columns:

.. GENERATED FROM PYTHON SOURCE LINES 237-241

.. code-block:: Python

    encoder2 = make_column_transformer(
        ("passthrough", emb_columns), ("passthrough", emb_columns2), remainder="drop"
    )


.. GENERATED FROM PYTHON SOURCE LINES 242-243

We redefine the |Pipeline|:

.. GENERATED FROM PYTHON SOURCE LINES 243-245

.. code-block:: Python

    pipeline2 = make_pipeline(encoder2, hgb)


.. GENERATED FROM PYTHON SOURCE LINES 246-247

Let's look at the results:

.. GENERATED FROM PYTHON SOURCE LINES 247-262

.. code-block:: Python

    cv_results = cross_validate(
        pipeline2, X_full, y, scoring=["r2", "neg_root_mean_squared_error"], cv=cv
    )

    all_r2_scores["KEN features"] = cv_results["test_r2"]
    all_rmse_scores["KEN features"] = -cv_results["test_neg_root_mean_squared_error"]

    print("With KEN Embeddings:")
    print(
        f"Mean R2 is {all_r2_scores['KEN features'].mean():.2f} +-"
        f" {all_r2_scores['KEN features'].std():.2f} and the RMSE is"
        f" {all_rmse_scores['KEN features'].mean():.2f} +-"
        f" {all_rmse_scores['KEN features'].std():.2f}"
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    With KEN Embeddings:
    Mean R2 is 0.36 +- 0.01 and the RMSE is 1.16 +- 0.01


.. GENERATED FROM PYTHON SOURCE LINES 263-265

It seems including the embeddings is very relevant for the prediction task
at hand!

.. GENERATED FROM PYTHON SOURCE LINES 267-273

Prediction with KEN Embeddings and base features
------------------------------------------------

As we have seen the predictions scores in the case when embeddings are
only present and when they are missing, we will do a final prediction
with all variables included.

.. GENERATED FROM PYTHON SOURCE LINES 275-276

We include both the embeddings and the base features:

.. GENERATED FROM PYTHON SOURCE LINES 276-285

.. code-block:: Python

    encoder3 = make_column_transformer(
        ("passthrough", emb_columns),
        ("passthrough", emb_columns2),
        ("passthrough", ["Year"]),
        (ohe, ["Genre"]),
        (min_hash, "Platform"),
        remainder="drop",
    )


.. GENERATED FROM PYTHON SOURCE LINES 286-287

We redefine the |Pipeline|:

.. GENERATED FROM PYTHON SOURCE LINES 287-289

.. code-block:: Python

    pipeline3 = make_pipeline(encoder3, hgb)


.. GENERATED FROM PYTHON SOURCE LINES 290-291

Let's look at the results:

.. GENERATED FROM PYTHON SOURCE LINES 291-306

.. code-block:: Python

    cv_results = cross_validate(
        pipeline3, X_full, y, scoring=["r2", "neg_root_mean_squared_error"], cv=cv
    )

    all_r2_scores["Base + KEN features"] = cv_results["test_r2"]
    all_rmse_scores["Base + KEN features"] = -cv_results["test_neg_root_mean_squared_error"]

    print("With KEN Embeddings and base features:")
    print(
        f"Mean R2 is {all_r2_scores['Base + KEN features'].mean():.2f} +-"
        f" {all_r2_scores['Base + KEN features'].std():.2f} and the RMSE is"
        f" {all_rmse_scores['Base + KEN features'].mean():.2f} +-"
        f" {all_rmse_scores['Base + KEN features'].std():.2f}"
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    With KEN Embeddings and base features:
    Mean R2 is 0.49 +- 0.02 and the RMSE is 1.04 +- 0.02


.. GENERATED FROM PYTHON SOURCE LINES 307-311

Plotting the results
....................

Finally, we plot the scores on a boxplot:

.. GENERATED FROM PYTHON SOURCE LINES 311-318

.. code-block:: Python

    plt.figure(figsize=(5, 3))
    # sphinx_gallery_thumbnail_number = -1
    ax = sns.boxplot(data=pd.DataFrame(all_r2_scores), orient="h")
    plt.xlabel("Prediction accuracy     ", size=15)
    plt.yticks(size=15)
    plt.tight_layout()


.. image-sg:: /auto_examples/images/sphx_glr_06_ken_embeddings_003.png
   :alt: 06 ken embeddings
   :srcset: /auto_examples/images/sphx_glr_06_ken_embeddings_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 319-328

There is a clear improvement when including the KEN embeddings among the
explanatory variables.

In this case, the embeddings from Wikipedia introduced
additional background information on the game and the publisher of the
game that would otherwise be missed.

It helped significantly improve the prediction score.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (1 minutes 20.379 seconds)


.. _sphx_glr_download_auto_examples_06_ken_embeddings.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.5.4?urlpath=lab/tree/notebooks/auto_examples/06_ken_embeddings.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../lite/lab/index.html?path=auto_examples/06_ken_embeddings.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 06_ken_embeddings.ipynb <06_ken_embeddings.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 06_ken_embeddings.py <06_ken_embeddings.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 06_ken_embeddings.zip <06_ken_embeddings.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
	Name	Platform	Year	Genre	Publisher
0	Wii Sports	Wii	2006	Sports	Nintendo
1	Super Mario Bros.	NES	1985	Platform	Nintendo
2	Mario Kart Wii	Wii	2008	Racing	Nintendo
	Type
0	wikicat_1994_video_games
1	wikicat_irem_games
2	wikicat_ea_guingamp_players
3	wikicat_video_game_companies_of_the_united_kin...
4	wikicat_asian_games_medalists_in_swimming
...	...
636	wikicat_college_football_games
637	wikicat_sonic_team_games
638	wikicat_space_opera_video_games
639	wikicat_boxers_at_the_2002_asian_games
640	wikicat_motorcycle_video_games