Fuzzy joining dirty tables with the Joiner#

Here we show how to combine data from different sources, with a vocabulary not well normalized.

Joining is difficult: one entry on one side does not have an exact match on the other side.

The fuzzy_join() function enables to join tables without cleaning the data by accounting for the label variations.

To illustrate, we will join data from the 2022 World Happiness Report, with tables provided in the World Bank open data platform in order to create a first prediction model.

Moreover, the Joiner() is a scikit-learn Transformer that makes it easy to use such fuzzy joining multiple tables to bring in information in a machine-learning pipeline. In particular, it enables tuning parameters of fuzzy_join() to find the matches that maximize prediction accuracy.

Data Importing and preprocessing#

We import the happiness score table first:

import pandas as pd

df = pd.read_csv(
    (
        "https://raw.githubusercontent.com/skrub-data/datasets/"
        "master/data/Happiness_report_2022.csv"
    ),
    thousands=",",
)
df.drop(df.tail(1).index, inplace=True)

Let’s look at the table:

RANK Country Happiness score Whisker-high Whisker-low Dystopia (1.83) + residual Explained by: GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption
0 1 Finland 7821.0 7886.0 7756.0 2518.0 1892.0 1258.0 775.0 736.0 109.0 534.0
1 2 Denmark 7636.0 7710.0 7563.0 2226.0 1953.0 1243.0 777.0 719.0 188.0 532.0
2 3 Iceland 7557.0 7651.0 7464.0 2320.0 1936.0 1320.0 803.0 718.0 270.0 191.0


This is a table that contains the happiness index of a country along with some of the possible explanatory factors: GDP per capita, Social support, Generosity etc.

For the sake of this example, we only keep the country names and our variable of interest: the ‘Happiness score’.

df = df[["Country", "Happiness score"]]

Additional tables from other sources#

Now, we need to include explanatory factors from other sources, to complete our covariates (X table).

Interesting tables can be found on the World Bank open data platform, for which we have a downloading function:

from skrub.datasets import fetch_world_bank_indicator

We extract the table containing GDP per capita by country:

gdp_per_capita = fetch_world_bank_indicator(indicator_id="NY.GDP.PCAP.CD").X
gdp_per_capita.head(3)
/home/circleci/project/skrub/datasets/_fetching.py:685: UserWarning: Could not find the dataset 'NY.GDP.PCAP.CD' locally. Downloading it from the World Bank; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/world_bank/NY.GDP.PCAP.CD.csv.
  info = _fetch_world_bank_data(dataset_id, data_directory)
Country Name GDP per capita (current US$)
0 Aruba 33300.838819
1 Africa Eastern and Southern 1672.505957
2 Afghanistan 352.603733


Then another table, with life expectancy by country:

/home/circleci/project/skrub/datasets/_fetching.py:685: UserWarning: Could not find the dataset 'SP.DYN.LE00.IN' locally. Downloading it from the World Bank; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/world_bank/SP.DYN.LE00.IN.csv.
  info = _fetch_world_bank_data(dataset_id, data_directory)
Country Name Life expectancy at birth, total (years)
0 Aruba 74.992000
1 Africa Eastern and Southern 62.899031
2 Afghanistan 62.879000


And a table with legal rights strength by country:

/home/circleci/project/skrub/datasets/_fetching.py:685: UserWarning: Could not find the dataset 'IC.LGL.CRED.XQ' locally. Downloading it from the World Bank; this might take a while... If it is interrupted, some files might be invalid/incomplete: if on the following run, the fetching raises errors, you can try fixing this issue by deleting the directory /home/circleci/skrub_data/world_bank/IC.LGL.CRED.XQ.csv.
  info = _fetch_world_bank_data(dataset_id, data_directory)
Country Name Strength of legal rights index (0=weak to 12=strong)
0 Africa Eastern and Southern 4.538462
1 Afghanistan 10.000000
2 Africa Western and Central 5.863636


A correspondence problem#

Alas, the entries for countries do not perfectly match between our original table (df), and those that we downloaded from the worldbank (gdp_per_capita):

df.sort_values(by="Country").tail(7)
Country Happiness score
29 Uruguay 6474.0
52 Uzbekistan 6063.0
107 Venezuela 4925.0
76 Vietnam 5485.0
131 Yemen* 4197.0
136 Zambia 3760.0
143 Zimbabwe 2995.0


gdp_per_capita.sort_values(by="Country Name").tail(7)
Country Name GDP per capita (current US$)
253 Viet Nam 4346.768492
252 Virgin Islands (U.S.) 41976.008312
193 West Bank and Gaza 3367.606990
255 World 13138.327546
258 Yemen, Rep. 533.367123
260 Zambia 1369.129365
261 Zimbabwe 1592.416574


We can see that Yemen is written “Yemen*” on one side, and “Yemen, Rep.” on the other.

We also have entries that probably do not have correspondences: “World” on one side, whereas the other table only has country-level data.

Joining tables with imperfect correspondence#

We will now join our initial table, df, with the 3 additional ones that we have extracted.

1. Joining GDP per capita table#

To join them with skrub, we only need to do the following:

from skrub import fuzzy_join

augmented_df = fuzzy_join(
    df,  # our table to join
    gdp_per_capita,  # the table to join with
    left_on="Country",  # the first join key column
    right_on="Country Name",  # the second join key column
    add_match_info=True,
)

augmented_df.tail(20)

# We merged the first World Bank table to our initial one.
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
126 Sri Lanka 4362.0 Sri Lanka 3827.964806 0.000000 0.000000 True
127 Madagascar* 4339.0 Madagascar 528.650155 0.616018 0.440370 True
128 Egypt 4288.0 Egypt, Arab Rep. 3512.580065 0.957242 0.684299 True
129 Chad* 4251.0 Chad 719.384875 0.921325 0.658623 True
130 Ethiopia 4241.0 Ethiopia 1293.778008 0.000000 0.000000 True
131 Yemen* 4197.0 Yemen, Rep. 533.367123 0.989239 0.707172 True
132 Mauritania* 4153.0 Mauritania 2149.414087 0.618608 0.442221 True
133 Jordan 4152.0 Jordan 4482.086026 0.000000 0.000000 True
134 Togo 4112.0 Togo 1012.973873 0.000000 0.000000 True
135 India 3777.0 India 2484.845429 0.000000 0.000000 True
136 Zambia 3760.0 Zambia 1369.129365 0.000000 0.000000 True
137 Malawi 3750.0 Malawi 672.869702 0.000000 0.000000 True
138 Tanzania 3702.0 Tanzania 1211.058594 0.000000 0.000000 True
139 Sierra Leone 3574.0 Sierra Leone 433.374174 0.000000 0.000000 True
140 Lesotho* 3512.0 Lesotho 878.008505 0.719706 0.514492 True
141 Botswana* 3471.0 Botswana 7249.799326 0.638844 0.456687 True
142 Rwanda* 3268.0 Rwanda 1000.218923 0.754071 0.539059 True
143 Zimbabwe 2995.0 Zimbabwe 1592.416574 0.000000 0.000000 True
144 Lebanon 2955.0 Lebanon 3823.938069 0.000000 0.000000 True
145 Afghanistan 2404.0 Afghanistan 352.603733 0.000000 0.000000 True


We see that our fuzzy_join() succesfully identified the countries, even though some country names differ between tables.

For instance, “Egypt” and “Egypt, Arab Rep.” are correctly matched, as are “Lesotho*” and “Lesotho”.

Let’s do some more inspection of the merging done.

Let’s print the worst matches, which will give us an overview of the situation:

augmented_df.sort_values("skrub_Joiner_rescaled_distance").tail(10)
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
124 Eswatini, Kingdom of* 4396.0 Eswatini 3797.301209 0.983023 0.702729 True
131 Yemen* 4197.0 Yemen, Rep. 533.367123 0.989239 0.707172 True
34 Slovakia 6391.0 Slovak Republic 24470.238538 0.989245 0.707177 True
63 Kyrgyzstan 5828.0 Kyrgyz Republic 1969.866481 0.990845 0.708320 True
109 Iran 4888.0 Iraq 5512.475720 1.055793 0.754750 True
25 Taiwan Province of China 6512.0 China 12614.060995 1.084546 0.775304 True
111 Turkey 4744.0 Turkiye 12985.753150 1.134734 0.811182 True
94 Laos 5140.0 Lao PDR 2075.401414 1.144947 0.818483 True
121 Palestinian Territories* 4483.0 Palau 14565.330253 1.310487 0.936821 True
87 Ivory Coast 5235.0 East Asia & Pacific 12927.683808 1.320558 0.944021 True


We see that some matches were unsuccesful (e.g “Palestinian Territories*” and “Palau”), because there is simply no match in the two tables.

In this case, it is better to use the threshold parameter (max_dist) so as to include only precise-enough matches:

augmented_df = fuzzy_join(
    df,
    gdp_per_capita,
    left_on="Country",
    right_on="Country Name",
    max_dist=0.9,
    add_match_info=True,
)
augmented_df.sort_values("skrub_Joiner_rescaled_distance", ascending=False).head()
Country Happiness score Country Name GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
87 Ivory Coast 5235.0 NaN NaN 1.320558 0.944021 False
121 Palestinian Territories* 4483.0 NaN NaN 1.310487 0.936821 False
94 Laos 5140.0 Lao PDR 2075.401414 1.144947 0.818483 True
111 Turkey 4744.0 Turkiye 12985.753150 1.134734 0.811182 True
25 Taiwan Province of China 6512.0 China 12614.060995 1.084546 0.775304 True


Matches that are not available (or precise enough) are marked as NaN. We will remove them using the drop_unmatched parameter:

augmented_df = fuzzy_join(
    df,
    gdp_per_capita,
    left_on="Country",
    right_on="Country Name",
    drop_unmatched=True,
    max_dist=0.9,
    add_match_info=True,
)

augmented_df.drop(columns=["Country Name"], inplace=True)

We can finally plot and look at the link between GDP per capital and happiness:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context("notebook")

plt.figure(figsize=(4, 3))
ax = sns.regplot(
    data=augmented_df,
    x="GDP per capita (current US$)",
    y="Happiness score",
    lowess=True,
)
ax.set_ylabel("Happiness index")
ax.set_title("Is a higher GDP per capita linked to happiness?")
plt.tight_layout()
plt.show()
Is a higher GDP per capita linked to happiness?

It seems that the happiest countries are those having a high GDP per capita. However, unhappy countries do not have only low levels of GDP per capita. We have to search for other patterns.

2. Joining life expectancy table#

Now let’s include other information that may be relevant, such as in the life_exp table:

augmented_df = fuzzy_join(
    augmented_df,
    life_exp,
    left_on="Country",
    right_on="Country Name",
    max_dist=0.9,
    add_match_info=True,
)

augmented_df.drop(columns=["Country Name"], inplace=True)

augmented_df.head(3)
Country Happiness score GDP per capita (current US$) skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted Life expectancy at birth, total (years)
0 Finland 7821.0 53755.911736 0.0 0.0 True 81.187805
1 Denmark 7636.0 67967.381869 0.0 0.0 True 81.304878
2 Iceland 7557.0 78811.058392 0.0 0.0 True 82.170732


Let’s plot this relation:

plt.figure(figsize=(4, 3))
fig = sns.regplot(
    data=augmented_df,
    x="Life expectancy at birth, total (years)",
    y="Happiness score",
    lowess=True,
)
fig.set_ylabel("Happiness index")
fig.set_title("Is a higher life expectancy linked to happiness?")
plt.tight_layout()
plt.show()
Is a higher life expectancy linked to happiness?

It seems the answer is yes! Countries with higher life expectancy are also happier.

Prediction model#

We now separate our covariates (X), from the target (or exogenous) variables: y.

y = augmented_df["Happiness score"]
X = augmented_df.drop(["Happiness score", "Country"], axis=1)

Let us now define the model that will be used to predict the happiness score:

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import KFold

hgdb = HistGradientBoostingRegressor(random_state=0)
cv = KFold(n_splits=5, shuffle=True, random_state=0)

To evaluate our model, we will apply a 5-fold cross-validation. We evaluate our model using the R2 score.

Let’s finally assess the results of our models:

from sklearn.model_selection import cross_validate

cv_results_t = cross_validate(hgdb, X, y, cv=cv, scoring="r2")

cv_r2_t = cv_results_t["test_score"]

print(f"Mean R² score is {cv_r2_t.mean():.2f} +- {cv_r2_t.std():.2f}")
Mean R² score is 0.64 +- 0.08

We have a satisfying first result: an R² of 0.63!

Data cleaning varies from dataset to dataset: there are as many ways to clean a table as there are errors. fuzzy_join() method is generalizable across all datasets.

Data transformation is also often very costly in both time and ressources. fuzzy_join() is fast and easy-to-use.

Now up to you, try improving our model by adding information into it and beating our result!

Using the Joiner() to fuzzy join multiple tables#

A convenient way to merge different tables from the World Bank to X in a scikit-learn Pipeline and tune the parameters is to use the Joiner().

The Joiner() is a transformer that can fuzzy-join a table on a main table.

Instantiating the transformer#

y = df["Happiness score"]
df = df.drop("Happiness score", axis=1)

from sklearn.pipeline import make_pipeline

from skrub import Joiner, SelectCols

# We create a selector that we will insert at the end of our pipeline, to
# select the relevant columns before fitting the regressor
selector = SelectCols(
    [
        "GDP per capita (current US$) gdp",
        "Life expectancy at birth, total (years) life_exp",
        "Strength of legal rights index (0=weak to 12=strong) legal_rights",
    ]
)

# And we can now put together the pipeline
pipeline = make_pipeline(
    Joiner(gdp_per_capita, main_key="Country", aux_key="Country Name", suffix=" gdp"),
    Joiner(life_exp, main_key="Country", aux_key="Country Name", suffix=" life_exp"),
    Joiner(
        legal_rights, main_key="Country", aux_key="Country Name", suffix=" legal_rights"
    ),
    selector,
    HistGradientBoostingRegressor(),
)

And the best part is that we are now able to evaluate the parameters of the fuzzy_join(). For instance, the match_score was manually picked and can now be introduced into a grid search:

from sklearn.model_selection import GridSearchCV

# We will test 2 possible values of max_dist:
params = {
    "joiner-1__max_dist": [0.1, 0.9],
    "joiner-2__max_dist": [0.1, 0.9],
    "joiner-3__max_dist": [0.1, 0.9],
}

grid = GridSearchCV(pipeline, param_grid=params, cv=cv)
grid.fit(df, y)

print("Best parameters:", grid.best_params_)
Best parameters: {'joiner-1__max_dist': 0.1, 'joiner-2__max_dist': 0.9, 'joiner-3__max_dist': 0.1}

Total running time of the script: (0 minutes 11.629 seconds)

Gallery generated by Sphinx-Gallery