.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/04_fuzzy_joining.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_04_fuzzy_joining.py: Fuzzy joining dirty tables with the Joiner ========================================== Here we show how to combine data from different sources, with a vocabulary not well normalized. Joining is difficult: one entry on one side does not have an exact match on the other side. The |fj| function enables to join tables without cleaning the data by accounting for the label variations. To illustrate, we will join data from the `2022 World Happiness Report `_, with tables provided in `the World Bank open data platform `_ in order to create a first prediction model. Moreover, the |joiner| is a scikit-learn Transformer that makes it easy to use such fuzzy joining multiple tables to bring in information in a machine-learning pipeline. In particular, it enables tuning parameters of |fj| to find the matches that maximize prediction accuracy. .. |fj| replace:: :func:`~skrub.fuzzy_join` .. |joiner| replace:: :func:`~skrub.Joiner` .. GENERATED FROM PYTHON SOURCE LINES 31-35 Data Importing and preprocessing -------------------------------- We import the happiness score table first: .. GENERATED FROM PYTHON SOURCE LINES 35-43 .. code-block:: Python import pandas as pd df = pd.read_csv( "https://raw.githubusercontent.com/skrub-data/datasets/master/data/Happiness_report_2022.csv", # noqa thousands=",", ) df.drop(df.tail(1).index, inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 44-45 Let's look at the table: .. GENERATED FROM PYTHON SOURCE LINES 45-47 .. code-block:: Python df.head(3) .. raw:: html
RANK Country Happiness score Whisker-high Whisker-low Dystopia (1.83) + residual Explained by: GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption
0 1 Finland 7821.0 7886.0 7756.0 2518.0 1892.0 1258.0 775.0 736.0 109.0 534.0
1 2 Denmark 7636.0 7710.0 7563.0 2226.0 1953.0 1243.0 777.0 719.0 188.0 532.0
2 3 Iceland 7557.0 7651.0 7464.0 2320.0 1936.0 1320.0 803.0 718.0 270.0 191.0


.. GENERATED FROM PYTHON SOURCE LINES 48-52 This is a table that contains the happiness index of a country along with some of the possible explanatory factors: GDP per capita, Social support, Generosity etc. .. GENERATED FROM PYTHON SOURCE LINES 54-56 For the sake of this example, we only keep the country names and our variable of interest: the 'Happiness score'. .. GENERATED FROM PYTHON SOURCE LINES 56-58 .. code-block:: Python df = df[["Country", "Happiness score"]] .. GENERATED FROM PYTHON SOURCE LINES 59-68 Additional tables from other sources ------------------------------------ Now, we need to include explanatory factors from other sources, to complete our covariates (X table). Interesting tables can be found on `the World Bank open data platform `_, for which we have a downloading function: .. GENERATED FROM PYTHON SOURCE LINES 68-70 .. code-block:: Python from skrub.datasets import fetch_world_bank_indicator .. GENERATED FROM PYTHON SOURCE LINES 71-72 We extract the table containing GDP per capita by country: .. GENERATED FROM PYTHON SOURCE LINES 72-75 .. code-block:: Python gdppc = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X gdppc.head(3) .. raw:: html
Country Name GDP per capita (current US$)
0 Aruba 29342.100730
1 Africa Eastern and Southern 1622.391720
2 Afghanistan 363.674087


.. GENERATED FROM PYTHON SOURCE LINES 76-77 Then another table, with life expectancy by country: .. GENERATED FROM PYTHON SOURCE LINES 77-80 .. code-block:: Python life_exp = fetch_world_bank_indicator("SP.DYN.LE00.IN").X life_exp.head(3) .. raw:: html
Country Name Life expectancy at birth, total (years)
0 Aruba 74.62600
1 Africa Eastern and Southern 62.45459
2 Afghanistan 61.98200


.. GENERATED FROM PYTHON SOURCE LINES 81-82 And a table with legal rights strength by country: .. GENERATED FROM PYTHON SOURCE LINES 82-85 .. code-block:: Python legal_rights = fetch_world_bank_indicator("IC.LGL.CRED.XQ").X legal_rights.head(3) .. raw:: html
Country Name Strength of legal rights index (0=weak to 12=strong)
0 Africa Eastern and Southern 4.538462
1 Afghanistan 10.000000
2 Africa Western and Central 5.863636


.. GENERATED FROM PYTHON SOURCE LINES 86-92 A correspondence problem ------------------------ Alas, the entries for countries do not perfectly match between our original table (df), and those that we downloaded from the worldbank (gdppc): .. GENERATED FROM PYTHON SOURCE LINES 92-95 .. code-block:: Python df.sort_values(by="Country").tail(7) .. raw:: html
Country Happiness score
29 Uruguay 6474.0
52 Uzbekistan 6063.0
107 Venezuela 4925.0
76 Vietnam 5485.0
131 Yemen* 4197.0
136 Zambia 3760.0
143 Zimbabwe 2995.0


.. GENERATED FROM PYTHON SOURCE LINES 96-98 .. code-block:: Python gdppc.sort_values(by="Country Name").tail(7) .. raw:: html
Country Name GDP per capita (current US$)
253 Viet Nam 4163.514300
252 Virgin Islands (U.S.) 39552.168595
193 West Bank and Gaza 3789.327966
255 World 12647.480789
258 Yemen, Rep. 676.928385
260 Zambia 1487.907764
261 Zimbabwe 1266.996031


.. GENERATED FROM PYTHON SOURCE LINES 99-104 We can see that Yemen is written "Yemen*" on one side, and "Yemen, Rep." on the other. We also have entries that probably do not have correspondences: "World" on one side, whereas the other table only has country-level data. .. GENERATED FROM PYTHON SOURCE LINES 106-112 Joining tables with imperfect correspondence -------------------------------------------- We will now join our initial table, df, with the 3 additional ones that we have extracted. .. GENERATED FROM PYTHON SOURCE LINES 114-120 .. _example_fuzzy_join: 1. Joining GDP per capita table ............................... To join them with skrub, we only need to do the following: .. GENERATED FROM PYTHON SOURCE LINES 120-133 .. code-block:: Python from skrub import fuzzy_join df1 = fuzzy_join( df, # our table to join gdppc, # the table to join with left_on="Country", # the first join key column right_on="Country Name", # the second join key column return_score=True, ) df1.tail(20) # We merged the first World Bank table to our initial one. .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) matching_score
126 Sri Lanka 4362.0 Sri Lanka 3354.383408 1.000000
127 Madagascar* 4339.0 Madagascar 505.035528 0.558621
128 Egypt 4288.0 Egypt, Arab Rep. 4295.407496 0.255220
129 Chad* 4251.0 Chad 716.804381 0.318124
130 Ethiopia 4241.0 Ethiopia 1027.585918 1.000000
131 Yemen* 4197.0 Yemen, Rep. 676.928385 0.254184
132 Mauritania* 4153.0 Mauritania 2190.700205 0.592422
133 Jordan 4152.0 Jordan 4204.505640 1.000000
134 Togo 4112.0 Togo 918.376756 1.000000
135 India 3777.0 India 2388.621198 1.000000
136 Zambia 3760.0 Zambia 1487.907764 1.000000
137 Malawi 3750.0 Malawi 645.158692 1.000000
138 Tanzania 3702.0 Tanzania 1192.403809 1.000000
139 Sierra Leone 3574.0 Sierra Leone 461.361138 1.000000
140 Lesotho* 3512.0 Lesotho 1107.395298 0.472890
141 Botswana* 3471.0 Botswana 7737.654681 0.560316
142 Rwanda* 3268.0 Rwanda 966.327110 0.471537
143 Zimbabwe 2995.0 Zimbabwe 1266.996031 1.000000
144 Lebanon 2955.0 Lebanon 4136.146575 1.000000
145 Afghanistan 2404.0 Afghanistan 363.674087 1.000000


.. GENERATED FROM PYTHON SOURCE LINES 134-138 .. topic:: Note: We fix the ``return_score`` parameter to `True` so as to keep the matching score, that we will use later to show what are the worst matches. .. GENERATED FROM PYTHON SOURCE LINES 140-156 We see that our |fj| succesfully identified the countries, even though some country names differ between tables. For instance, "Czechia" is well identified as "Czech Republic" and "Luxembourg*" as "Luxembourg". .. topic:: Note: This would all be missed out if we were using other methods such as `pandas.merge `_, # noqa which can only find exact matches. In this case, to reach the best result, we would have to `manually` clean the data (e.g. remove the * after country name) and look for matching patterns in every observation. Let's do some more inspection of the merging done. .. GENERATED FROM PYTHON SOURCE LINES 159-161 Let's print the four worst matches, which will give us an overview of the situation: .. GENERATED FROM PYTHON SOURCE LINES 161-164 .. code-block:: Python df1.sort_values("matching_score").head(4) .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) matching_score
87 Ivory Coast 5235.0 East Asia & Pacific 12906.694014 0.000000
121 Palestinian Territories* 4483.0 Palau 12083.888149 0.002512
94 Laos 5140.0 Lao PDR 2088.377182 0.124818
111 Turkey 4744.0 Turkiye 10616.061045 0.134366


.. GENERATED FROM PYTHON SOURCE LINES 165-168 We see that some matches were unsuccesful (e.g "Palestinian Territories*" and "Palau"), because there is simply no match in the two tables. .. GENERATED FROM PYTHON SOURCE LINES 170-173 In this case, it is better to use the threshold parameter so as to include only precise-enough matches: .. GENERATED FROM PYTHON SOURCE LINES 173-183 .. code-block:: Python df1 = fuzzy_join( df, gdppc, left_on="Country", right_on="Country Name", match_score=0.1, return_score=True, ) df1.sort_values("matching_score").head(4) .. raw:: html
Country Happiness score Country Name GDP per capita (current US$) matching_score
87 Ivory Coast 5235.0 <NA> <NA> 0.000000
121 Palestinian Territories* 4483.0 <NA> <NA> 0.002512
94 Laos 5140.0 Lao PDR 2088.377182 0.124818
111 Turkey 4744.0 Turkiye 10616.061045 0.134366


.. GENERATED FROM PYTHON SOURCE LINES 184-186 Matches that are not available (or precise enough) are marked as `NaN`. We will remove them using the drop_unmatched parameter: .. GENERATED FROM PYTHON SOURCE LINES 186-198 .. code-block:: Python df1 = fuzzy_join( df, gdppc, left_on="Country", right_on="Country Name", match_score=0.1, drop_unmatched=True, ) df1.drop(columns=["Country Name"], inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 199-201 We can finally plot and look at the link between GDP per capital and happiness: .. GENERATED FROM PYTHON SOURCE LINES 201-218 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns sns.set_context("notebook") plt.figure(figsize=(4, 3)) ax = sns.regplot( data=df1, x="GDP per capita (current US$)", y="Happiness score", lowess=True, ) ax.set_ylabel("Happiness index") ax.set_title("Is a higher GDP per capita linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_001.png :alt: Is a higher GDP per capita linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 219-223 It seems that the happiest countries are those having a high GDP per capita. However, unhappy countries do not have only low levels of GDP per capita. We have to search for other patterns. .. GENERATED FROM PYTHON SOURCE LINES 225-230 2. Joining life expectancy table ................................ Now let's include other information that may be relevant, such as in the life_exp table: .. GENERATED FROM PYTHON SOURCE LINES 230-242 .. code-block:: Python df2 = fuzzy_join( df1, life_exp, left_on="Country", right_on="Country Name", match_score=0.1, ) df2.drop(columns=["Country Name"], inplace=True) df2.head(3) .. raw:: html
Country Happiness score GDP per capita (current US$) Life expectancy at birth, total (years)
0 Finland 7821.0 50536.624467 81.934146
1 Denmark 7636.0 66983.132002 81.404878
2 Iceland 7557.0 72902.979954 83.117073


.. GENERATED FROM PYTHON SOURCE LINES 243-244 Let's plot this relation: .. GENERATED FROM PYTHON SOURCE LINES 244-256 .. code-block:: Python plt.figure(figsize=(4, 3)) fig = sns.regplot( data=df2, x="Life expectancy at birth, total (years)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Is a higher life expectancy linked to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_002.png :alt: Is a higher life expectancy linked to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 257-259 It seems the answer is yes! Countries with higher life expectancy are also happier. .. GENERATED FROM PYTHON SOURCE LINES 262-266 3. Joining legal rights strength table ...................................... And the table with a measure of legal rights strength in the country: .. GENERATED FROM PYTHON SOURCE LINES 266-278 .. code-block:: Python df3 = fuzzy_join( df2, legal_rights, left_on="Country", right_on="Country Name", match_score=0.1, ) df3.drop(columns=["Country Name"], inplace=True) df3.head(3) .. raw:: html
Country Happiness score GDP per capita (current US$) Life expectancy at birth, total (years) Strength of legal rights index (0=weak to 12=strong)
0 Finland 7821.0 50536.624467 81.934146 6.0
1 Denmark 7636.0 66983.132002 81.404878 8.0
2 Iceland 7557.0 72902.979954 83.117073 4.0


.. GENERATED FROM PYTHON SOURCE LINES 279-280 Let's take a look at their correspondence in a figure: .. GENERATED FROM PYTHON SOURCE LINES 280-292 .. code-block:: Python plt.figure(figsize=(4, 3)) fig = sns.regplot( data=df3, x="Strength of legal rights index (0=weak to 12=strong)", y="Happiness score", lowess=True, ) fig.set_ylabel("Happiness index") fig.set_title("Does a country's legal rights strength lead to happiness?") plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_04_fuzzy_joining_003.png :alt: Does a country's legal rights strength lead to happiness? :srcset: /auto_examples/images/sphx_glr_04_fuzzy_joining_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 293-295 From this plot, it is not clear that this measure of legal strength is linked to happiness. .. GENERATED FROM PYTHON SOURCE LINES 297-299 Great! Our joined table has become bigger and full of useful information. And now we are ready to apply a first machine learning model to it! .. GENERATED FROM PYTHON SOURCE LINES 301-306 Prediction model ---------------- We now separate our covariates (X), from the target (or exogenous) variables: y .. GENERATED FROM PYTHON SOURCE LINES 306-309 .. code-block:: Python y = df3["Happiness score"] X = df3.drop(["Happiness score", "Country"], axis=1) .. GENERATED FROM PYTHON SOURCE LINES 310-311 Let us now define the model that will be used to predict the happiness score: .. GENERATED FROM PYTHON SOURCE LINES 311-318 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import KFold hgdb = HistGradientBoostingRegressor(random_state=0) cv = KFold(n_splits=5, shuffle=True, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 319-323 To evaluate our model, we will apply a `5-fold cross-validation`. We evaluate our model using the `R2` score. Let's finally assess the results of our models: .. GENERATED FROM PYTHON SOURCE LINES 323-331 .. code-block:: Python from sklearn.model_selection import cross_validate cv_results_t = cross_validate(hgdb, X, y, cv=cv, scoring="r2") cv_r2_t = cv_results_t["test_score"] print(f"Mean R² score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Mean R² score is 0.63 +- 0.10 .. GENERATED FROM PYTHON SOURCE LINES 332-343 We have a satisfying first result: an R² of 0.63! Data cleaning varies from dataset to dataset: there are as many ways to clean a table as there are errors. |fj| method is generalizable across all datasets. Data transformation is also often very costly in both time and ressources. |fj| is fast and easy-to-use. Now up to you, try improving our model by adding information into it and beating our result! .. GENERATED FROM PYTHON SOURCE LINES 345-352 Using the |joiner| to fuzzy join multiple tables -------------------------------------------- A faster way to merge different tables from the World Bank to `X` is to use the |joiner|. The |joiner| is a transformer that can easily chain joins of tables on a main table. .. GENERATED FROM PYTHON SOURCE LINES 354-358 .. _example_joiner: Instantiating the transformer ............................. .. GENERATED FROM PYTHON SOURCE LINES 358-385 .. code-block:: Python y = df["Happiness score"] df = df.drop("Happiness score", axis=1) from skrub import Joiner, SelectCols, DropCols from sklearn.pipeline import make_pipeline # We create a selector that we will insert at the end of our pipeline, to # select the relevant columns before fitting the regressor pipeline = make_pipeline( Joiner(gdppc, main_key="Country", aux_key="Country Name"), DropCols("Country Name"), Joiner(life_exp, main_key="Country", aux_key="Country Name"), DropCols("Country Name"), Joiner(legal_rights, main_key="Country", aux_key="Country Name"), SelectCols( [ "GDP per capita (current US$)", "Life expectancy at birth, total (years)", "Strength of legal rights index (0=weak to 12=strong)", ] ), HistGradientBoostingRegressor(), ) .. GENERATED FROM PYTHON SOURCE LINES 386-389 And the best part is that we are now able to evaluate the parameters of the |fj|. For instance, the ``match_score`` was manually picked and can now be introduced into a grid search: .. GENERATED FROM PYTHON SOURCE LINES 389-406 .. code-block:: Python from sklearn.model_selection import GridSearchCV # We will test four possible values of match_score: params = { "joiner-1__match_score": [0.1, 0.9], "joiner-2__match_score": [0.1, 0.9], "joiner-3__match_score": [0.1, 0.9], } grid = GridSearchCV(pipeline, param_grid=params, cv=cv) grid.fit(df, y) print("Best parameters:", grid.best_params_) # The gridsearch selects a stricter threshold on the matching_score than what # we had set manually for the GDP and legal rights joins. .. rst-class:: sphx-glr-script-out .. code-block:: none Best parameters: {'joiner-1__match_score': 0.9, 'joiner-2__match_score': 0.1, 'joiner-3__match_score': 0.9} .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 28.247 seconds) .. _sphx_glr_download_auto_examples_04_fuzzy_joining.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/04_fuzzy_joining.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/?path=auto_examples/04_fuzzy_joining.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 04_fuzzy_joining.ipynb <04_fuzzy_joining.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 04_fuzzy_joining.py <04_fuzzy_joining.py>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_