Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Spatial join for flight data: Joining across multiple columns#

Joining tables may be difficult if one entry on one side does not have an exact match on the other side.

This problem becomes even more complex when multiple columns are significant for the join. For instance, this is the case for spatial joins on two columns, typically longitude and latitude.

Joiner() is a scikit-learn compatible transformer that enables performing joins across multiple keys, independently of the data type (numerical, string or mixed).

The following example uses US domestic flights data to illustrate how space and time information from a pool of tables are combined for machine learning.

Flight-delays data#

The goal is to predict flight delays. We have a pool of tables that we will use to improve our prediction.

The following tables are at our disposal:

The main table: flights dataset#

The flights datasets. It contains all US flights date, origin and destination airports and flight time. Here, we consider only flights from 2008.

import pandas as pd

from skrub.datasets import fetch_flight_delays

dataset = fetch_flight_delays()
seed = 1
flights = dataset.flights

# Sampling for faster computation.
flights = flights.sample(5_000, random_state=seed, ignore_index=True)
flights.head()

	Year_Month_DayofMonth	DayOfWeek	CRSDepTime	CRSArrTime	UniqueCarrier	FlightNum	TailNum	CRSElapsedTime	ArrDelay	Origin	Dest	Distance
0	2008-01-13	7	1900-01-01 18:35:00	1900-01-01 20:08:00	CO	150	N17244	213.0	1.0	IAH	ONT	1334.0
1	2008-02-21	4	1900-01-01 14:30:00	1900-01-01 16:06:00	NW	807	N590NW	216.0	2.0	MSP	SEA	1399.0
2	2008-03-26	3	1900-01-01 07:00:00	1900-01-01 09:38:00	US	455	N627AW	98.0	-1.0	PHX	SLC	507.0
3	2008-01-03	4	1900-01-01 08:40:00	1900-01-01 12:03:00	CO	287	N21723	383.0	46.0	EWR	SNA	2433.0
4	2008-01-31	4	1900-01-01 12:50:00	1900-01-01 14:10:00	MQ	3157	N848AE	80.0	-14.0	SJC	SNA	342.0

Let us see the arrival delay of the flights in the dataset:

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="ticks")

ax = sns.histplot(data=flights, x="ArrDelay")
ax.set_yscale("log")
plt.show()

Interesting, most delays are relatively short (<100 min), but there are some very long ones.

Airport data: an auxiliary table from the same database#

The airports dataset, with information such as their name and location (longitude, latitude).

airports = dataset.airports
airports.head()

	iata	airport	city	state	country	lat	long
0	00M	Thigpen	Bay Springs	MS	USA	31.953765	-89.234505
1	00R	Livingston Municipal	Livingston	TX	USA	30.685861	-95.017928
2	00V	Meadow Lake	Colorado Springs	CO	USA	38.945749	-104.569893
3	01G	Perry-Warsaw	Perry	NY	USA	42.741347	-78.052081
4	01J	Hilliard Airpark	Hilliard	FL	USA	30.688012	-81.905944

Weather data: auxiliary tables from external sources#

The weather table. Weather details by measurement station. Both tables are from the Global Historical Climatology Network. Here, we consider only weather measurements from 2008.

weather = dataset.weather
# Sampling for faster computation.
weather = weather.sample(10_000, random_state=seed, ignore_index=True)
weather.head()

	ID	YEAR/MONTH/DAY	TMAX	PRCP	SNOW
0	RPM00098325	2008-08-20	290.0	856.0	NaN
1	ASN00023820	2008-07-20	NaN	28.0	NaN
2	MXN00024056	2008-04-13	250.0	0.0	NaN
3	GME00126742	2008-11-06	116.0	4.0	NaN
4	ASN00074201	2008-04-12	NaN	0.0	NaN

The stations dataset. Provides location of all the weather measurement stations in the US.

stations = dataset.stations
stations.head()

	ID	LATITUDE	LONGITUDE	ELEVATION	STATE	NAME	GSN FLAG	HCN/CRN FLAG	WMO ID
0	ACW00011604	17.1167	-61.7833	10.1	ST JOHNS COOLIDGE FLD	NaN	NaN	NaN	NaN
1	ACW00011647	17.1333	-61.7833	19.2	ST JOHNS	NaN	NaN	NaN	NaN
2	AE000041196	25.3330	55.5170	34.0	SHARJAH INTER. AIRP	NaN	GSN	41196.0	NaN
3	AEM00041194	25.2550	55.3640	10.4	DUBAI INTL	NaN	NaN	41194.0	NaN
4	AEM00041217	24.4330	54.6510	26.8	ABU DHABI INTL	NaN	NaN	41217.0	NaN

Joining: feature augmentation across tables#

First we join the stations with weather on the ID (exact join):

aux = pd.merge(stations, weather, on="ID")
aux.head()

	ID	LATITUDE	LONGITUDE	ELEVATION	STATE	NAME	GSN FLAG	HCN/CRN FLAG	WMO ID	YEAR/MONTH/DAY	TMAX	SNOW
0	AGE00147708	36.720	4.050	222.0	TIZI OUZOU	NaN	NaN	60395.0	NaN	2008-04-17	225.0	NaN
1	AGM00060403	36.467	7.467	228.0	GUELMA	NaN	NaN	60403.0	NaN	2008-09-17	324.0	NaN
2	AGM00060403	36.467	7.467	228.0	GUELMA	NaN	NaN	60403.0	NaN	2008-04-11	245.0	NaN
3	AGM00060419	36.276	6.620	690.4	MOHAMED BOUDIAF INTL	NaN	NaN	60419.0	NaN	2008-06-30	340.0	NaN
4	AGM00060430	36.300	2.233	721.0	MILIANA	NaN	NaN	60430.0	NaN	2008-08-17	323.0	NaN

Then we join this table with the airports so that we get all auxiliary tables into one.

from skrub import Joiner

joiner = Joiner(airports, aux_key=["lat", "long"], main_key=["LATITUDE", "LONGITUDE"])

aux_augmented = joiner.fit_transform(aux)

aux_augmented.head()

	ID	LATITUDE	LONGITUDE	ELEVATION	STATE	NAME	GSN FLAG	HCN/CRN FLAG	WMO ID	YEAR/MONTH/DAY	TMAX	SNOW	iata	airport	city	state	country	lat	long	skrub_Joiner_distance	skrub_Joiner_rescaled_distance	skrub_Joiner_match_accepted
0	AGE00147708	36.720	4.050	222.0	TIZI OUZOU	NaN	NaN	60395.0	NaN	2008-04-17	225.0	NaN	EPM	Eastport Municipal	Eastport	ME	USA	44.910111	-67.012694	3.259659	4.467246	True
1	AGM00060403	36.467	7.467	228.0	GUELMA	NaN	NaN	60403.0	NaN	2008-09-17	324.0	NaN	EPM	Eastport Municipal	Eastport	ME	USA	44.910111	-67.012694	3.411334	4.675112	True
2	AGM00060403	36.467	7.467	228.0	GUELMA	NaN	NaN	60403.0	NaN	2008-04-11	245.0	NaN	EPM	Eastport Municipal	Eastport	ME	USA	44.910111	-67.012694	3.411334	4.675112	True
3	AGM00060419	36.276	6.620	690.4	MOHAMED BOUDIAF INTL	NaN	NaN	60419.0	NaN	2008-06-30	340.0	NaN	EPM	Eastport Municipal	Eastport	ME	USA	44.910111	-67.012694	3.382942	4.636201	True
4	AGM00060430	36.300	2.233	721.0	MILIANA	NaN	NaN	60430.0	NaN	2008-08-17	323.0	NaN	EPM	Eastport Municipal	Eastport	ME	USA	44.910111	-67.012694	3.199924	4.385382	True

Joining airports with flights data: Let’s instantiate another multiple key joiner on the date and the airport:

joiner = Joiner(
    aux_augmented,
    aux_key=["YEAR/MONTH/DAY", "iata"],
    main_key=["Year_Month_DayofMonth", "Origin"],
)

flights.drop(columns=["TailNum", "FlightNum"])

	Year_Month_DayofMonth	DayOfWeek	CRSDepTime	CRSArrTime	UniqueCarrier	CRSElapsedTime	ArrDelay	Origin	Dest	Distance
0	2008-01-13	7	1900-01-01 18:35:00	1900-01-01 20:08:00	CO	213.0	1.0	IAH	ONT	1334.0
1	2008-02-21	4	1900-01-01 14:30:00	1900-01-01 16:06:00	NW	216.0	2.0	MSP	SEA	1399.0
2	2008-03-26	3	1900-01-01 07:00:00	1900-01-01 09:38:00	US	98.0	-1.0	PHX	SLC	507.0
3	2008-01-03	4	1900-01-01 08:40:00	1900-01-01 12:03:00	CO	383.0	46.0	EWR	SNA	2433.0
4	2008-01-31	4	1900-01-01 12:50:00	1900-01-01 14:10:00	MQ	80.0	-14.0	SJC	SNA	342.0
...	...	...	...	...	...	...	...	...	...	...
4995	2008-04-01	2	1900-01-01 10:14:00	1900-01-01 10:45:00	EV	91.0	50.0	ATL	PFN	247.0
4996	2008-02-25	1	1900-01-01 12:00:00	1900-01-01 13:30:00	AA	210.0	-2.0	DFW	RNO	1345.0
4997	2008-01-20	7	1900-01-01 06:00:00	1900-01-01 07:30:00	AQ	90.0	-13.0	LAS	OAK	407.0
4998	2008-03-14	5	1900-01-01 06:42:00	1900-01-01 08:04:00	XE	82.0	-16.0	ROC	CLE	245.0
4999	2008-04-18	5	1900-01-01 19:38:00	1900-01-01 20:06:00	OO	88.0	-3.0	ICT	DEN	419.0

5000 rows × 10 columns

Training data is then passed through a Pipeline:

We will combine all the information from our pool of tables into “flights”,

our main table. - We will use this main table to model the prediction of flight delay.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline

from skrub import TableVectorizer

tv = TableVectorizer()
hgb = HistGradientBoostingClassifier()

pipeline_hgb = make_pipeline(joiner, tv, hgb)

We isolate our target variable and remove useless ID variables:

y = flights["ArrDelay"]
X = flights.drop(columns=["ArrDelay"])

We want to frame this as a classification problem: suppose that your company is obliged to reimburse the ticket price if the flight is delayed.

We have a binary classification problem: the flight was delayed (1) or not (0).

y = (y > 0).astype(int)
y.value_counts()

ArrDelay
0    2727
1    2273
Name: count, dtype: int64

The results:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)
pipeline_hgb.fit(X_train, y_train).score(X_test, y_test)

0.5616

Conclusion#

In this example, we have combined multiple tables with complex joins on imprecise and multiple-key correspondences. This is made easy by skrub’s Joiner() transformer.

Our final cross-validated accuracy score is 0.55.

Total running time of the script: (0 minutes 13.659 seconds)

Gallery generated by Sphinx-Gallery

Spatial join for flight data: Joining across multiple columns#

Flight-delays data#

The main table: flights dataset#

Airport data: an auxiliary table from the same database#

Weather data: auxiliary tables from external sources#

Joining: feature augmentation across tables#

Conclusion#

This Page