skrub.InterpolationJoiner#

Usage examples at the bottom of this page.

class skrub.InterpolationJoiner(aux_table, *, key=None, main_key=None, aux_key=None, suffix='', regressor=HistGradientBoostingRegressor(), classifier=HistGradientBoostingClassifier(), vectorizer=TableVectorizer(high_cardinality=MinHashEncoder()), n_jobs=None, on_estimator_failure='warn')[source]#

Join with a table augmented by machine-learning predictions.

This is similar to a usual equi-join, but instead of looking for actual rows in the right table that satisfy the join condition, we estimate what those rows would contain if they existed in the table.

Suppose we want to join a table buildings(latitude, longitude, n_stories) with a table annual_avg_temp(latitude, longitude, avg_temp). Our annual average temperature table may not contain data for the exact latitude and longitude of our buildings. However, we can interpolate what we need from the data points it does contain. Using annual_avg_temp, we train a model to predict the temperature, given the latitude and longitude. Then, we use this model to estimate the values we want to add to our buildings table. In a way we are joining buildings to a virtual table, in which rows for any (latitude, longitude) location are inferred, rather than retrieved, when requested. This is done with:

InterpolationJoiner(
    annual_avg_temp, on=["latitude", "longitude"]
).fit_transform(buildings)
Parameters:
aux_tableDataFrame

The (auxiliary) table to be joined to the main_table (which is the argument of transform). aux_table is used to train a model that takes as inputs the contents of the columns listed in aux_key, and predicts the contents of the other columns. In the example above, we want our transformer to add temperature data to the table it is operating on. Therefore, aux_table is the annual_avg_temp table.

keylist of str, or str

Column names to use for both main_key and aux_key, when they are the same. Provide either key (only) or both main_key and aux_key.

main_keylist of str, or str

The columns in the main table used for joining. The main table is the argument of transform, to which we add information inferred using aux_table. The column names listed in main_key will provide the inputs (features) of the interpolators at prediction (joining) time. In the example above, main_key is ["latitude", "longitude"], which refer to columns in the buildings table. When joining on a single column, we can pass its name rather than a list: "latitude" is equivalent to ["latitude"].

aux_keylist of str, or str

The columns in aux_table used for joining. Their number and types must match those of the main_key columns in the main table. These columns provide the features for the estimators to be fitted. As for main_key, it is possible to pass a string when using a single column.

suffixstr

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

regressorscikit-learn regressor

Model used to predict the numerical columns of aux_table.

classifierscikit-learn classifier

Model used to predict the categorical (string) columns of aux_table.

vectorizerscikit-learn transformer that can operate on a DataFrame

Used to transform the feature columns before passing them to the scikit-learn estimators. This is useful if we are joining on columns that need some transformation, such as dates or strings representing high-cardinality categories. By default we use a MinHashEncoder to vectorize text columns. This is because the MinHashEncoder is very fast and usually gives good results with downstream learners based on trees like the gradient-boosted trees used by default for regressor and classifier. If you replace the default regressor and classifier with models such as nearest-neighbors or linear models, consider passing vectorizer=TableVectorizer() which will encode text with a GapEncoder rather than a MinHashEncoder.

n_jobsint or None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. Depending on the estimators used and the contents of aux_table, several estimators may need to be fitted – for example one for continuous outputs (regressor) and one for categorical outputs (classifier), or one for each column when the provided estimators do not support multi-output tasks. Fitting and querying these estimators can be done in parallel.

on_estimator_failure“warn”, “raise” or “pass”

How to handle exceptions raised when fitting one of the estimators (regressors and classifiers) or querying them for a prediction. If “raise”, exceptions are propagated. If “pass” (i) if an exception is raised during fit the corresponding columns are ignored – they will not appear in the join and (ii) if an exception is raised during transform, the corresponding column will be filled with nulls. Columns are filled with nulls during transform rather than dropped so that the output always has the same shape. If “warn” (the default), behave like “pass” but issue a warning.

See also

Joiner

Works in a similar way but instead of inferring values, picks the closest row from the auxiliary table.

Examples

>>> import pandas as pd
>>> buildings = pd.DataFrame(
...     {"latitude": [1.0, 2.0], "longitude": [1.0, 2.0], "n_stories": [3, 7]}
... )
>>> annual_avg_temp = pd.DataFrame(
...     {
...         "latitude": [1.2, 0.9, 1.9, 1.7, 5.0],
...         "longitude": [0.8, 1.1, 1.8, 1.8, 5.0],
...         "avg_temp": [10.0, 11.0, 15.0, 16.0, 20.0],
...     }
... )
>>> buildings
   latitude  longitude  n_stories
0       1.0        1.0          3
1       2.0        2.0          7
>>> annual_avg_temp
   latitude  longitude  avg_temp
0       1.2        0.8      10.0
1       0.9        1.1      11.0
2       1.9        1.8      15.0
3       1.7        1.8      16.0
4       5.0        5.0      20.0

Let’s interpolate the average temperature:

>>> from sklearn.neighbors import KNeighborsRegressor
>>> from skrub import InterpolationJoiner
>>> InterpolationJoiner(
...     annual_avg_temp,
...     key=["latitude", "longitude"],
...     regressor=KNeighborsRegressor(2),
... ).fit_transform(buildings)
   latitude  longitude  n_stories  avg_temp
0       1.0        1.0          3      10.5
1       2.0        2.0          7      15.5
Attributes:
vectorizer_scikit-learn transformer

The transformer used to vectorize the feature columns.

estimators_list of dicts

The estimators used to infer values to be joined. Each entry in this list is a dictionary with keys "estimator" (the fitted estimator) and "columns" (the list of columns in aux_table that it is trained to predict).

Methods

fit(X[, y])

Fit estimators to the aux_table provided during initialization.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a table by joining inferred values to it.

fit(X, y=None)[source]#

Fit estimators to the aux_table provided during initialization.

X and y are mostly for scikit-learn compatibility.

Parameters:
Xarray_like or None

The main table to which self.aux_table could be joined. If X is not None, an error is raised if any of the matching columns listed in self.main_key (or self.key) is missing from X.

yarray_like

Ignored; only exists for compatibility with scikit-learn.

Returns:
selfInterpolationJoiner

Returns self.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input samples.

yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a table by joining inferred values to it.

The values of the main_key columns in X (the main table) are used to predict likely values for the contents of a matching row in self.aux_table (the auxiliary table).

Parameters:
XDataFrame

The (main) table to transform.

Returns:
joinDataFrame

The result of the join between X and inferred rows from self.aux_table.

Examples using skrub.InterpolationJoiner#

Interpolation join: infer missing rows when joining two tables

Interpolation join: infer missing rows when joining two tables