.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/07_multiple_key_join.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_07_multiple_key_join.py: .. _example_multiple_key_join : Spatial join for flight data: Joining across multiple columns ============================================================= Joining tables may be difficult if one entry on one side does not have an exact match on the other side. This problem becomes even more complex when multiple columns are significant for the join. For instance, this is the case for **spatial joins** on two columns, typically longitude and latitude. |joiner| is a scikit-learn compatible transformer that enables performing joins across multiple keys, independently of the data type (numerical, string or mixed). The following example uses US domestic flights data to illustrate how space and time information from a pool of tables are combined for machine learning. .. |fj| replace:: :func:`~skrub.fuzzy_join` .. |joiner| replace:: :func:`~skrub.Joiner` .. |Pipeline| replace:: :class:`~sklearn.pipeline.Pipeline` .. GENERATED FROM PYTHON SOURCE LINES 32-38 Flight-delays data ------------------ The goal is to predict flight delays. We have a pool of tables that we will use to improve our prediction. The following tables are at our disposal: .. GENERATED FROM PYTHON SOURCE LINES 40-45 The main table: flights dataset ............................... - The `flights` datasets. It contains all US flights date, origin and destination airports and flight time. Here, we consider only flights from 2008. .. GENERATED FROM PYTHON SOURCE LINES 45-58 .. code-block:: Python import pandas as pd from skrub.datasets import fetch_flight_delays dataset = fetch_flight_delays() seed = 1 flights = dataset.flights # Sampling for faster computation. flights = flights.sample(5_000, random_state=seed, ignore_index=True) flights.head() .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading 'flight_delays' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/flight_delays.zip (attempt 1/3) .. raw:: html
Year_Month_DayofMonth DayOfWeek CRSDepTime CRSArrTime UniqueCarrier FlightNum TailNum CRSElapsedTime ArrDelay Origin Dest Distance
0 2008-01-13 7 1900-01-01 18:35:00 1900-01-01 20:08:00 CO 150 N17244 213.0 1.0 IAH ONT 1334.0
1 2008-02-21 4 1900-01-01 14:30:00 1900-01-01 16:06:00 NW 807 N590NW 216.0 2.0 MSP SEA 1399.0
2 2008-03-26 3 1900-01-01 07:00:00 1900-01-01 09:38:00 US 455 N627AW 98.0 -1.0 PHX SLC 507.0
3 2008-01-03 4 1900-01-01 08:40:00 1900-01-01 12:03:00 CO 287 N21723 383.0 46.0 EWR SNA 2433.0
4 2008-01-31 4 1900-01-01 12:50:00 1900-01-01 14:10:00 MQ 3157 N848AE 80.0 -14.0 SJC SNA 342.0


.. GENERATED FROM PYTHON SOURCE LINES 59-60 Let us see the arrival delay of the flights in the dataset: .. GENERATED FROM PYTHON SOURCE LINES 60-69 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns sns.set_theme(style="ticks") ax = sns.histplot(data=flights, x="ArrDelay") ax.set_yscale("log") plt.show() .. image-sg:: /auto_examples/images/sphx_glr_07_multiple_key_join_001.png :alt: 07 multiple key join :srcset: /auto_examples/images/sphx_glr_07_multiple_key_join_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 70-72 Interesting, most delays are relatively short (<100 min), but there are some very long ones. .. GENERATED FROM PYTHON SOURCE LINES 74-78 Airport data: an auxiliary table from the same database ....................................................... - The ``airports`` dataset, with information such as their name and location (longitude, latitude). .. GENERATED FROM PYTHON SOURCE LINES 78-82 .. code-block:: Python airports = dataset.airports airports.head() .. raw:: html
iata airport city state country lat long
0 00M Thigpen Bay Springs MS USA 31.953765 -89.234505
1 00R Livingston Municipal Livingston TX USA 30.685861 -95.017928
2 00V Meadow Lake Colorado Springs CO USA 38.945749 -104.569893
3 01G Perry-Warsaw Perry NY USA 42.741347 -78.052081
4 01J Hilliard Airpark Hilliard FL USA 30.688012 -81.905944


.. GENERATED FROM PYTHON SOURCE LINES 83-88 Weather data: auxiliary tables from external sources .................................................... - The ``weather`` table. Weather details by measurement station. Both tables are from the Global Historical Climatology Network. Here, we consider only weather measurements from 2008. .. GENERATED FROM PYTHON SOURCE LINES 88-94 .. code-block:: Python weather = dataset.weather # Sampling for faster computation. weather = weather.sample(10_000, random_state=seed, ignore_index=True) weather.head() .. raw:: html
ID YEAR/MONTH/DAY TMAX PRCP SNOW
0 RPM00098325 2008-08-20 290.0 856.0 NaN
1 ASN00023820 2008-07-20 NaN 28.0 NaN
2 MXN00024056 2008-04-13 250.0 0.0 NaN
3 GME00126742 2008-11-06 116.0 4.0 NaN
4 ASN00074201 2008-04-12 NaN 0.0 NaN


.. GENERATED FROM PYTHON SOURCE LINES 95-97 - The ``stations`` dataset. Provides location of all the weather measurement stations in the US. .. GENERATED FROM PYTHON SOURCE LINES 97-101 .. code-block:: Python stations = dataset.stations stations.head() .. raw:: html
ID LATITUDE LONGITUDE ELEVATION STATE NAME GSN FLAG HCN/CRN FLAG WMO ID
0 ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD NaN NaN NaN NaN
1 ACW00011647 17.1333 -61.7833 19.2 ST JOHNS NaN NaN NaN NaN
2 AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP NaN GSN 41196.0 NaN
3 AEM00041194 25.2550 55.3640 10.4 DUBAI INTL NaN NaN 41194.0 NaN
4 AEM00041217 24.4330 54.6510 26.8 ABU DHABI INTL NaN NaN 41217.0 NaN


.. GENERATED FROM PYTHON SOURCE LINES 102-105 Joining: feature augmentation across tables ------------------------------------------- First we join the stations with weather on the ID (exact join): .. GENERATED FROM PYTHON SOURCE LINES 105-109 .. code-block:: Python aux = pd.merge(stations, weather, on="ID") aux.head() .. raw:: html
ID LATITUDE LONGITUDE ELEVATION STATE NAME GSN FLAG HCN/CRN FLAG WMO ID YEAR/MONTH/DAY TMAX PRCP SNOW
0 AGE00147708 36.720 4.050 222.0 TIZI OUZOU NaN NaN 60395.0 NaN 2008-04-17 225.0 0.0 NaN
1 AGM00060403 36.467 7.467 228.0 GUELMA NaN NaN 60403.0 NaN 2008-09-17 324.0 0.0 NaN
2 AGM00060403 36.467 7.467 228.0 GUELMA NaN NaN 60403.0 NaN 2008-04-11 245.0 0.0 NaN
3 AGM00060419 36.276 6.620 690.4 MOHAMED BOUDIAF INTL NaN NaN 60419.0 NaN 2008-06-30 340.0 0.0 NaN
4 AGM00060430 36.300 2.233 721.0 MILIANA NaN NaN 60430.0 NaN 2008-08-17 323.0 0.0 NaN


.. GENERATED FROM PYTHON SOURCE LINES 110-112 Then we join this table with the airports so that we get all auxiliary tables into one. .. GENERATED FROM PYTHON SOURCE LINES 112-121 .. code-block:: Python from skrub import Joiner joiner = Joiner(airports, aux_key=["lat", "long"], main_key=["LATITUDE", "LONGITUDE"]) aux_augmented = joiner.fit_transform(aux) aux_augmented.head() .. raw:: html
ID LATITUDE LONGITUDE ELEVATION STATE NAME GSN FLAG HCN/CRN FLAG WMO ID YEAR/MONTH/DAY TMAX PRCP SNOW iata airport city state country lat long skrub_Joiner_distance skrub_Joiner_rescaled_distance skrub_Joiner_match_accepted
0 AGE00147708 36.720 4.050 222.0 TIZI OUZOU NaN NaN 60395.0 NaN 2008-04-17 225.0 0.0 NaN EPM Eastport Municipal Eastport ME USA 44.910111 -67.012694 3.259659 4.467246 True
1 AGM00060403 36.467 7.467 228.0 GUELMA NaN NaN 60403.0 NaN 2008-09-17 324.0 0.0 NaN EPM Eastport Municipal Eastport ME USA 44.910111 -67.012694 3.411334 4.675112 True
2 AGM00060403 36.467 7.467 228.0 GUELMA NaN NaN 60403.0 NaN 2008-04-11 245.0 0.0 NaN EPM Eastport Municipal Eastport ME USA 44.910111 -67.012694 3.411334 4.675112 True
3 AGM00060419 36.276 6.620 690.4 MOHAMED BOUDIAF INTL NaN NaN 60419.0 NaN 2008-06-30 340.0 0.0 NaN EPM Eastport Municipal Eastport ME USA 44.910111 -67.012694 3.382942 4.636201 True
4 AGM00060430 36.300 2.233 721.0 MILIANA NaN NaN 60430.0 NaN 2008-08-17 323.0 0.0 NaN EPM Eastport Municipal Eastport ME USA 44.910111 -67.012694 3.199924 4.385382 True


.. GENERATED FROM PYTHON SOURCE LINES 122-124 Joining airports with flights data: Let's instantiate another multiple key joiner on the date and the airport: .. GENERATED FROM PYTHON SOURCE LINES 124-133 .. code-block:: Python joiner = Joiner( aux_augmented, aux_key=["YEAR/MONTH/DAY", "iata"], main_key=["Year_Month_DayofMonth", "Origin"], ) flights.drop(columns=["TailNum", "FlightNum"]) .. raw:: html
Year_Month_DayofMonth DayOfWeek CRSDepTime CRSArrTime UniqueCarrier CRSElapsedTime ArrDelay Origin Dest Distance
0 2008-01-13 7 1900-01-01 18:35:00 1900-01-01 20:08:00 CO 213.0 1.0 IAH ONT 1334.0
1 2008-02-21 4 1900-01-01 14:30:00 1900-01-01 16:06:00 NW 216.0 2.0 MSP SEA 1399.0
2 2008-03-26 3 1900-01-01 07:00:00 1900-01-01 09:38:00 US 98.0 -1.0 PHX SLC 507.0
3 2008-01-03 4 1900-01-01 08:40:00 1900-01-01 12:03:00 CO 383.0 46.0 EWR SNA 2433.0
4 2008-01-31 4 1900-01-01 12:50:00 1900-01-01 14:10:00 MQ 80.0 -14.0 SJC SNA 342.0
... ... ... ... ... ... ... ... ... ... ...
4995 2008-04-01 2 1900-01-01 10:14:00 1900-01-01 10:45:00 EV 91.0 50.0 ATL PFN 247.0
4996 2008-02-25 1 1900-01-01 12:00:00 1900-01-01 13:30:00 AA 210.0 -2.0 DFW RNO 1345.0
4997 2008-01-20 7 1900-01-01 06:00:00 1900-01-01 07:30:00 AQ 90.0 -13.0 LAS OAK 407.0
4998 2008-03-14 5 1900-01-01 06:42:00 1900-01-01 08:04:00 XE 82.0 -16.0 ROC CLE 245.0
4999 2008-04-18 5 1900-01-01 19:38:00 1900-01-01 20:06:00 OO 88.0 -3.0 ICT DEN 419.0

5000 rows × 10 columns



.. GENERATED FROM PYTHON SOURCE LINES 134-139 Training data is then passed through a |Pipeline|: - We will combine all the information from our pool of tables into "flights", our main table. - We will use this main table to model the prediction of flight delay. .. GENERATED FROM PYTHON SOURCE LINES 139-150 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.pipeline import make_pipeline from skrub import TableVectorizer tv = TableVectorizer() hgb = HistGradientBoostingClassifier() pipeline_hgb = make_pipeline(joiner, tv, hgb) .. GENERATED FROM PYTHON SOURCE LINES 151-152 We isolate our target variable and remove useless ID variables: .. GENERATED FROM PYTHON SOURCE LINES 152-156 .. code-block:: Python y = flights["ArrDelay"] X = flights.drop(columns=["ArrDelay"]) .. GENERATED FROM PYTHON SOURCE LINES 157-163 We want to frame this as a classification problem: suppose that your company is obliged to reimburse the ticket price if the flight is delayed. We have a binary classification problem: the flight was delayed (1) or not (0). .. GENERATED FROM PYTHON SOURCE LINES 163-167 .. code-block:: Python y = (y > 0).astype(int) y.value_counts() .. rst-class:: sphx-glr-script-out .. code-block:: none ArrDelay 0 2727 1 2273 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 168-169 The results: .. GENERATED FROM PYTHON SOURCE LINES 169-175 .. code-block:: Python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed) pipeline_hgb.fit(X_train, y_train).score(X_test, y_test) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.5504 .. GENERATED FROM PYTHON SOURCE LINES 176-184 Conclusion ---------- In this example, we have combined multiple tables with complex joins on imprecise and multiple-key correspondences. This is made easy by skrub's |Joiner| transformer. Our final cross-validated accuracy score is 0.55. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 27.180 seconds) .. _sphx_glr_download_auto_examples_07_multiple_key_join.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.5.4?urlpath=lab/tree/notebooks/auto_examples/07_multiple_key_join.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/07_multiple_key_join.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 07_multiple_key_join.ipynb <07_multiple_key_join.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 07_multiple_key_join.py <07_multiple_key_join.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 07_multiple_key_join.zip <07_multiple_key_join.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_