.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/04_fuzzy_joining.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_04_fuzzy_joining.py: Fuzzy joining dirty tables with the Joiner ========================================== Here we show how to combine data from different sources, with a vocabulary not well normalized. Joining is difficult: one entry on one side does not have an exact match on the other side. The |fj| function enables to join tables without cleaning the data by accounting for the label variations. To illustrate, we will join data from the `2022 World Happiness Report `_, with tables provided in `the World Bank open data platform `_ in order to create a first prediction model. Moreover, the |joiner| is a scikit-learn Transformer that makes it easy to use such fuzzy joining multiple tables to bring in information in a machine-learning pipeline. In particular, it enables tuning parameters of |fj| to find the matches that maximize prediction accuracy. .. |fj| replace:: :func:`~skrub.fuzzy_join` .. |joiner| replace:: :func:`~skrub.Joiner` .. GENERATED FROM PYTHON SOURCE LINES 31-35 Data Importing and preprocessing -------------------------------- We import the happiness score table first: .. GENERATED FROM PYTHON SOURCE LINES 35-46 .. code-block:: Python import pandas as pd df = pd.read_csv( ( "https://raw.githubusercontent.com/skrub-data/datasets/" "master/data/Happiness_report_2022.csv" ), thousands=",", ) df.drop(df.tail(1).index, inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 47-48 Let's look at the table: .. GENERATED FROM PYTHON SOURCE LINES 48-50 .. code-block:: Python df.head(3) .. raw:: html
RANK Country Happiness score Whisker-high Whisker-low Dystopia (1.83) + residual Explained by: GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption
0 1 Finland 7821.0 7886.0 7756.0 2518.0 1892.0 1258.0 775.0 736.0 109.0 534.0
1 2 Denmark 7636.0 7710.0 7563.0 2226.0 1953.0 1243.0 777.0 719.0 188.0 532.0
2 3 Iceland 7557.0 7651.0 7464.0 2320.0 1936.0 1320.0 803.0 718.0 270.0 191.0


.. GENERATED FROM PYTHON SOURCE LINES 51-55 This is a table that contains the happiness index of a country along with some of the possible explanatory factors: GDP per capita, Social support, Generosity etc. .. GENERATED FROM PYTHON SOURCE LINES 57-59 For the sake of this example, we only keep the country names and our variable of interest: the 'Happiness score'. .. GENERATED FROM PYTHON SOURCE LINES 59-61 .. code-block:: Python df = df[["Country", "Happiness score"]] .. GENERATED FROM PYTHON SOURCE LINES 62-71 Additional tables from other sources ------------------------------------ Now, we need to include explanatory factors from other sources, to complete our covariates (X table). Interesting tables can be found on `the World Bank open data platform `_, for which we have a downloading function: .. GENERATED FROM PYTHON SOURCE LINES 71-73 .. code-block:: Python from skrub.datasets import fetch_world_bank_indicator .. GENERATED FROM PYTHON SOURCE LINES 74-75 We extract the table containing GDP per capita by country: .. GENERATED FROM PYTHON SOURCE LINES 75-78 .. code-block:: Python gdp_per_capita = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X gdp_per_capita.head(3) .. raw:: html
Country Name GDP per capita (current US$)
0 Aruba 33300.838819
1 Africa Eastern and Southern 1644.062829
2 Afghanistan 355.777826


.. GENERATED FROM PYTHON SOURCE LINES 79-80 Then another table, with life expectancy by country: .. GENERATED FROM PYTHON SOURCE LINES 80-83 .. code-block:: Python life_exp = fetch_world_bank_indicator("SP.DYN.LE00.IN").X life_exp.head(3) .. raw:: html
Country Name Life expectancy at birth, total (years)
0 Aruba 74.992000
1 Africa Eastern and Southern 62.899031
2 Afghanistan 62.879000


.. GENERATED FROM PYTHON SOURCE LINES 84-85 And a table with legal rights strength by country: .. GENERATED FROM PYTHON SOURCE LINES 85-88 .. code-block:: Python legal_rights = fetch_world_bank_indicator("IC.LGL.CRED.XQ").X legal_rights.head(3) .. raw:: html
Country Name Strength of legal rights index (0=weak to 12=strong)
0 Africa Eastern and Southern 4.538462
1 Afghanistan 10.000000
2 Africa Western and Central 5.863636


.. GENERATED FROM PYTHON SOURCE LINES 89-95 A correspondence problem ------------------------ Alas, the entries for countries do not perfectly match between our original table (df), and those that we downloaded from the worldbank (gdp_per_capita): .. GENERATED FROM PYTHON SOURCE LINES 95-98 .. code-block:: Python df.sort_values(by="Country").tail(7) .. raw:: html
Country Happiness score
29 Uruguay 6474.0
52 Uzbekistan 6063.0
107 Venezuela 4925.0
76 Vietnam 5485.0
131 Yemen* 4197.0
136 Zambia 3760.0
143 Zimbabwe 2995.0


.. GENERATED FROM PYTHON SOURCE LINES 99-101 .. code-block:: Python gdp_per_capita.sort_values(by="Country Name").tail(7) .. raw:: html
Country Name GDP per capita (current US$)
253 Viet Nam 4163.514299
252 Virgin Islands (U.S.) 41976.008312
193 West Bank and Gaza 3789.327966
255 World 12687.741894
258 Yemen, Rep. 650.272218
260 Zambia 1456.901570
261 Zimbabwe 1676.821489


.. GENERATED FROM PYTHON SOURCE LINES 102-107 We can see that Yemen is written "Yemen*" on one side, and "Yemen, Rep." on the other. We also have entries that probably do not have correspondences: "World" on one side, whereas the other table only has country-level data. .. GENERATED FROM PYTHON SOURCE LINES 109-115 Joining tables with imperfect correspondence -------------------------------------------- We will now join our initial table, df, with the 3 additional ones that we have extracted. .. GENERATED FROM PYTHON SOURCE LINES 117-123 .. _example_fuzzy_join: 1. Joining GDP per capita table ............................... To join them with skrub, we only need to do the following: .. GENERATED FROM PYTHON SOURCE LINES 123-137 .. code-block:: Python from skrub import fuzzy_join augmented_df = fuzzy_join( df, # our table to join gdp_per_capita, # the table to join with left_on="Country", # the first join key column right_on="Country Name", # the second join key column add_match_info=True, ) augmented_df.tail(20) # We merged the first World Bank table to our initial one. .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
126 Sri Lanka 4362.0 Sri Lanka 3354.383408 0.000000 0.000000 True
127 Madagascar* 4339.0 Madagascar 516.592616 0.616018 0.440370 True
128 Egypt 4288.0 Egypt, Arab Rep. 4295.407496 0.957242 0.684299 True
129 Chad* 4251.0 Chad 716.804381 0.921325 0.658623 True
130 Ethiopia 4241.0 Ethiopia 1027.585911 0.000000 0.000000 True
131 Yemen* 4197.0 Yemen, Rep. 650.272218 0.989239 0.707172 True
132 Mauritania* 4153.0 Mauritania 2065.155516 0.618608 0.442221 True
133 Jordan 4152.0 Jordan 4311.000046 0.000000 0.000000 True
134 Togo 4112.0 Togo 942.649902 0.000000 0.000000 True
135 India 3777.0 India 2410.888021 0.000000 0.000000 True
136 Zambia 3760.0 Zambia 1456.901570 0.000000 0.000000 True
137 Malawi 3750.0 Malawi 645.158692 0.000000 0.000000 True
138 Tanzania 3702.0 Tanzania 1192.766479 0.000000 0.000000 True
139 Sierra Leone 3574.0 Sierra Leone 475.795728 0.000000 0.000000 True
140 Lesotho* 3512.0 Lesotho 969.935760 0.719706 0.514492 True
141 Botswana* 3471.0 Botswana 7738.878804 0.638844 0.456687 True
142 Rwanda* 3268.0 Rwanda 966.232071 0.754071 0.539059 True
143 Zimbabwe 2995.0 Zimbabwe 1676.821489 0.000000 0.000000 True
144 Lebanon 2955.0 Lebanon 4136.146575 0.000000 0.000000 True
145 Afghanistan 2404.0 Afghanistan 355.777826 0.000000 0.000000 True


.. GENERATED FROM PYTHON SOURCE LINES 138-143 .. topic:: Note: We set the ``add_match_info`` parameter to `True` to show distances between the rows that have been matched, that we will use later to show what are the worst matches. .. GENERATED FROM PYTHON SOURCE LINES 145-161 We see that our |fj| succesfully identified the countries, even though some country names differ between tables. For instance, "Egypt" and "Egypt, Arab Rep." are correctly matched, as are "Lesotho*" and "Lesotho". .. topic:: Note: This would all be missed out if we were using other methods such as `pandas.merge `_, which can only find exact matches. In this case, to reach the best result, we would have to `manually` clean the data (e.g. remove the * after country name) and look for matching patterns in every observation. Let's do some more inspection of the merging done. .. GENERATED FROM PYTHON SOURCE LINES 164-166 Let's print the worst matches, which will give us an overview of the situation: .. GENERATED FROM PYTHON SOURCE LINES 166-169 .. code-block:: Python augmented_df.sort_values("skrub_Joiner_rescaled_distance").tail(10) .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
124 Eswatini, Kingdom of* 4396.0 Eswatini 3986.886637 0.983023 0.702729 True
131 Yemen* 4197.0 Yemen, Rep. 650.272218 0.989239 0.707172 True
34 Slovakia 6391.0 Slovak Republic 21256.808427 0.989245 0.707177 True
63 Kyrgyzstan 5828.0 Kyrgyz Republic 1655.072698 0.990845 0.708320 True
109 Iran 4888.0 Iraq 5937.195466 1.055793 0.754750 True
25 Taiwan Province of China 6512.0 China 12720.216318 1.084546 0.775304 True
111 Turkey 4744.0 Turkiye 10674.504173 1.134734 0.811182 True
94 Laos 5140.0 Lao PDR 2054.430781 1.144947 0.818483 True
121 Palestinian Territories* 4483.0 Palau 12921.827321 1.310487 0.936821 True
87 Ivory Coast 5235.0 East Asia & Pacific 12930.691675 1.320558 0.944021 True


.. GENERATED FROM PYTHON SOURCE LINES 170-173 We see that some matches were unsuccesful (e.g "Palestinian Territories*" and "Palau"), because there is simply no match in the two tables. .. GENERATED FROM PYTHON SOURCE LINES 175-178 In this case, it is better to use the threshold parameter (``max_dist``) so as to include only precise-enough matches: .. GENERATED FROM PYTHON SOURCE LINES 178-188 .. code-block:: Python augmented_df = fuzzy_join( df, gdp_per_capita, left_on="Country", right_on="Country Name", max_dist=0.9, add_match_info=True, ) augmented_df.sort_values("skrub_Joiner_rescaled_distance", ascending=False).head() .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
87 Ivory Coast 5235.0 NaN NaN 1.320558 0.944021 False
121 Palestinian Territories* 4483.0 NaN NaN 1.310487 0.936821 False
94 Laos 5140.0 Lao PDR 2054.430781 1.144947 0.818483 True
111 Turkey 4744.0 Turkiye 10674.504173 1.134734 0.811182 True
25 Taiwan Province of China 6512.0 China 12720.216318 1.084546 0.775304 True


.. GENERATED FROM PYTHON SOURCE LINES 189-191 Matches that are not available (or precise enough) are marked as `NaN`. We will remove them using the ``drop_unmatched`` parameter: .. GENERATED FROM PYTHON SOURCE LINES 191-204 .. code-block:: Python augmented_df = fuzzy_join( df, gdp_per_capita, left_on="Country", right_on="Country Name", drop_unmatched=True, max_dist=0.9, add_match_info=True, ) augmented_df.drop(columns=["Country Name"], inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 205-207 We can finally plot and look at the link between GDP per capital and happiness: .. GENERATED FROM PYTHON SOURCE LINES 207-224 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns sns.set_context("notebook") plt.figure(figsize=(4, 3)) ax = sns.regplot( data=augmented_df, x="GDP per capita (current US$)", y="Happiness score", lowess=True, ) ax.set_ylabel("Happiness index") ax.set_title("Is a higher GDP per capita linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_001.png :alt: Is a higher GDP per capita linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 225-229 It seems that the happiest countries are those having a high GDP per capita. However, unhappy countries do not have only low levels of GDP per capita. We have to search for other patterns. .. GENERATED FROM PYTHON SOURCE LINES 231-236 2. Joining life expectancy table ................................ Now let's include other information that may be relevant, such as in the life_exp table: .. GENERATED FROM PYTHON SOURCE LINES 236-249 .. code-block:: Python augmented_df = fuzzy_join( augmented_df, life_exp, left_on="Country", right_on="Country Name", max_dist=0.9, add_match_info=True, ) augmented_df.drop(columns=["Country Name"], inplace=True) augmented_df.head(3) .. raw:: html
Country Happiness score GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted Life expectancy at birth, total (years)
0 Finland 7821.0 50871.930451 0.0 0.0 True 81.187805
1 Denmark 7636.0 67790.053992 0.0 0.0 True 81.304878
2 Iceland 7557.0 73466.778667 0.0 0.0 True 82.170732


.. GENERATED FROM PYTHON SOURCE LINES 250-251 Let's plot this relation: .. GENERATED FROM PYTHON SOURCE LINES 251-263 .. code-block:: Python plt.figure(figsize=(4, 3)) fig = sns.regplot( data=augmented_df, x="Life expectancy at birth, total (years)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Is a higher life expectancy linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_002.png :alt: Is a higher life expectancy linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 264-266 It seems the answer is yes! Countries with higher life expectancy are also happier. .. GENERATED FROM PYTHON SOURCE LINES 269-273 3. Joining legal rights strength table ...................................... And the table with a measure of legal rights strength in the country: .. GENERATED FROM PYTHON SOURCE LINES 273-286 .. code-block:: Python augmented_df = fuzzy_join( augmented_df, legal_rights, left_on="Country", right_on="Country Name", max_dist=0.9, add_match_info=True, ) augmented_df.drop(columns=["Country Name"], inplace=True) augmented_df.head(3) .. raw:: html
Country Happiness score GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted Life expectancy at birth, total (years) Strength of legal rights index (0=weak to 12=strong)
0 Finland 7821.0 50871.930451 0.0 0.0 True 81.187805 6.0
1 Denmark 7636.0 67790.053992 0.0 0.0 True 81.304878 8.0
2 Iceland 7557.0 73466.778667 0.0 0.0 True 82.170732 4.0


.. GENERATED FROM PYTHON SOURCE LINES 287-288 Let's take a look at their correspondence in a figure: .. GENERATED FROM PYTHON SOURCE LINES 288-300 .. code-block:: Python plt.figure(figsize=(4, 3)) fig = sns.regplot( data=augmented_df, x="Strength of legal rights index (0=weak to 12=strong)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Does a country's legal rights strength lead to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_003.png :alt: Does a country's legal rights strength lead to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 301-303 From this plot, it is not clear that this measure of legal strength is linked to happiness. .. GENERATED FROM PYTHON SOURCE LINES 305-307 Great! Our joined table has become bigger and full of useful information. And now we are ready to apply a first machine learning model to it! .. GENERATED FROM PYTHON SOURCE LINES 309-314 Prediction model ---------------- We now separate our covariates (X), from the target (or exogenous) variables: y. .. GENERATED FROM PYTHON SOURCE LINES 314-317 .. code-block:: Python y = augmented_df["Happiness score"] X = augmented_df.drop(["Happiness score", "Country"], axis=1) .. GENERATED FROM PYTHON SOURCE LINES 318-319 Let us now define the model that will be used to predict the happiness score: .. GENERATED FROM PYTHON SOURCE LINES 319-326 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import KFold hgdb = HistGradientBoostingRegressor(random_state=0) cv = KFold(n_splits=5, shuffle=True, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 327-331 To evaluate our model, we will apply a `5-fold cross-validation`. We evaluate our model using the `R2` score. Let's finally assess the results of our models: .. GENERATED FROM PYTHON SOURCE LINES 331-339 .. code-block:: Python from sklearn.model_selection import cross_validate cv_results_t = cross_validate(hgdb, X, y, cv=cv, scoring="r2") cv_r2_t = cv_results_t["test_score"] print(f"Mean R² score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Mean R² score is 0.64 +- 0.09 .. GENERATED FROM PYTHON SOURCE LINES 340-351 We have a satisfying first result: an R² of 0.63! Data cleaning varies from dataset to dataset: there are as many ways to clean a table as there are errors. |fj| method is generalizable across all datasets. Data transformation is also often very costly in both time and ressources. |fj| is fast and easy-to-use. Now up to you, try improving our model by adding information into it and beating our result! .. GENERATED FROM PYTHON SOURCE LINES 353-360 Using the |joiner| to fuzzy join multiple tables -------------------------------------------- A convenient way to merge different tables from the World Bank to `X` in a scikit-learn Pipeline and tune the parameters is to use the |joiner|. The |joiner| is a transformer that can fuzzy-join a table on a main table. .. GENERATED FROM PYTHON SOURCE LINES 362-366 .. _example_joiner: Instantiating the transformer ............................. .. GENERATED FROM PYTHON SOURCE LINES 366-396 .. code-block:: Python y = df["Happiness score"] df = df.drop("Happiness score", axis=1) from sklearn.pipeline import make_pipeline from skrub import Joiner, SelectCols # We create a selector that we will insert at the end of our pipeline, to # select the relevant columns before fitting the regressor selector = SelectCols( [ "GDP per capita (current US$) gdp", "Life expectancy at birth, total (years) life_exp", "Strength of legal rights index (0=weak to 12=strong) legal_rights", ] ) # And we can now put together the pipeline pipeline = make_pipeline( Joiner(gdp_per_capita, main_key="Country", aux_key="Country Name", suffix=" gdp"), Joiner(life_exp, main_key="Country", aux_key="Country Name", suffix=" life_exp"), Joiner( legal_rights, main_key="Country", aux_key="Country Name", suffix=" legal_rights" ), selector, HistGradientBoostingRegressor(), ) .. GENERATED FROM PYTHON SOURCE LINES 397-400 And the best part is that we are now able to evaluate the parameters of the |fj|. For instance, the ``match_score`` was manually picked and can now be introduced into a grid search: .. GENERATED FROM PYTHON SOURCE LINES 400-414 .. code-block:: Python from sklearn.model_selection import GridSearchCV # We will test 2 possible values of max_dist: params = { "joiner-1__max_dist": [0.1, 0.9], "joiner-2__max_dist": [0.1, 0.9], "joiner-3__max_dist": [0.1, 0.9], } grid = GridSearchCV(pipeline, param_grid=params, cv=cv) grid.fit(df, y) print("Best parameters:", grid.best_params_) .. rst-class:: sphx-glr-script-out .. code-block:: none Best parameters: {'joiner-1__max_dist': 0.1, 'joiner-2__max_dist': 0.9, 'joiner-3__max_dist': 0.1} .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 13.196 seconds) .. _sphx_glr_download_auto_examples_04_fuzzy_joining.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.2.0?urlpath=lab/tree/notebooks/auto_examples/04_fuzzy_joining.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/?path=auto_examples/04_fuzzy_joining.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 04_fuzzy_joining.ipynb <04_fuzzy_joining.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 04_fuzzy_joining.py <04_fuzzy_joining.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_