skrub
.InterpolationJoiner#
Usage examples at the bottom of this page.
- class skrub.InterpolationJoiner(aux_table, *, main_key=None, aux_key=None, key=None, suffix='', regressor=HistGradientBoostingRegressor(), classifier=HistGradientBoostingClassifier(), vectorizer=TableVectorizer(high_cardinality_transformer=MinHashEncoder()), n_jobs=None, on_estimator_failure='warn')[source]#
Join with a table augmented by machine-learning predictions.
This is similar to a usual equi-join, but instead of looking for actual rows in the right table that satisfy the join condition, we estimate what those rows would contain if they existed in the table.
Suppose we want to join a table
buildings(latitude, longitude, n_stories)
with a tableannual_avg_temp(latitude, longitude, avg_temp)
. Our annual average temperature table may not contain data for the exact latitude and longitude of our buildings. However, we can interpolate what we need from the data points it does contain. Usingannual_avg_temp
, we train a model to predict the temperature, given the latitude and longitude. Then, we use this model to estimate the values we want to add to ourbuildings
table. In a way we are joiningbuildings
to a virtual table, in which rows for any (latitude, longitude) location are inferred, rather than retrieved, when requested. This is done with:InterpolationJoiner( annual_avg_temp, on=["latitude", "longitude"] ).fit_transform(buildings)
- Parameters:
- aux_tableDataFrame
The (auxiliary) table to be joined to the main_table (which is the argument of
transform
).aux_table
is used to train a model that takes as inputs the contents of the columns listed inaux_key
, and predicts the contents of the other columns. In the example above, we want our transformer to add temperature data to the table it is operating on. Therefore,aux_table
is theannual_avg_temp
table.- main_key
list
ofstr
, orstr
The columns in the main table used for joining. The main table is the argument of
transform
, to which we add information inferred usingaux_table
. The column names listed inmain_key
will provide the inputs (features) of the interpolators at prediction (joining) time. In the example above,main_key
is["latitude", "longitude"]
, which refer to columns in thebuildings
table. When joining on a single column, we can pass its name rather than a list:"latitude"
is equivalent to["latitude"]
.- aux_key
list
ofstr
, orstr
The columns in
aux_table
used for joining. Their number and types must match those of themain_key
columns in the main table. These columns provide the features for the estimators to be fitted. As formain_key
, it is possible to pass a string when using a single column.- key
list
ofstr
, orstr
Column names to use for both main_key and aux_key, when they are the same. Provide either key (only) or both main_key and aux_key.
- suffix
str
Suffix to append to the
aux_table
’s column names. You can use it to avoid duplicate column names in the join.- regressorscikit-learn regressor
Model used to predict the numerical columns of
aux_table
.- classifierscikit-learn classifier
Model used to predict the categorical (string) columns of
aux_table
.- vectorizerscikit-learn transformer that can operate on a DataFrame
Used to transform the feature columns before passing them to the scikit-learn estimators. This is useful if we are joining on columns that need some transformation, such as dates or strings representing high-cardinality categories. By default we use a
MinHashEncoder
to vectorize text columns. This is because theMinHashEncoder
is very fast and usually gives good results with downstream learners based on trees like the gradient-boosted trees used by default forregressor
andclassifier
. If you replace the default regressor and classifier with models such as nearest-neighbors or linear models, consider passingvectorizer=TableVectorizer()
which will encode text with aGapEncoder
rather than aMinHashEncoder
.- n_jobs
int
orNone
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context. -1 means using all processors. Depending on the estimators used and the contents ofaux_table
, several estimators may need to be fitted – for example one for continuous outputs (regressor) and one for categorical outputs (classifier), or one for each column when the provided estimators do not support multi-output tasks. Fitting and querying these estimators can be done in parallel.- on_estimator_failure“warn”, “raise” or “pass”
How to handle exceptions raised when fitting one of the estimators (regressors and classifiers) or querying them for a prediction. If “raise”, exceptions are propagated. If “pass” (i) if an exception is raised during
fit
the corresponding columns are ignored – they will not appear in the join and (ii) if an exception is raised duringtransform
, the corresponding column will be filled with nulls. Columns are filled with nulls duringtransform
rather than dropped so that the output always has the same shape. If “warn” (the default), behave like “pass” but issue a warning.
See also
Joiner
Works in a similar way but instead of inferring values, picks the closest row from the auxiliary table.
Examples
>>> buildings = pd.DataFrame( ... {"latitude": [1.0, 2.0], "longitude": [1.0, 2.0], "n_stories": [3, 7]} ... ) >>> annual_avg_temp = pd.DataFrame( ... { ... "latitude": [1.2, 0.9, 1.9, 1.7, 5.0], ... "longitude": [0.8, 1.1, 1.8, 1.8, 5.0], ... "avg_temp": [10.0, 11.0, 15.0, 16.0, 20.0], ... } ... )
>>> buildings latitude longitude n_stories 0 1.0 1.0 3 1 2.0 2.0 7
>>> annual_avg_temp latitude longitude avg_temp 0 1.2 0.8 10.0 1 0.9 1.1 11.0 2 1.9 1.8 15.0 3 1.7 1.8 16.0 4 5.0 5.0 20.0
>>> from sklearn.neighbors import KNeighborsRegressor
>>> InterpolationJoiner( ... annual_avg_temp, ... key=["latitude", "longitude"], ... regressor=KNeighborsRegressor(2), ... ).fit_transform(buildings) latitude longitude n_stories avg_temp 0 1.0 1.0 3 10.5 1 2.0 2.0 7 15.5
- Attributes:
- vectorizer_scikit-learn transformer
The transformer used to vectorize the feature columns.
- estimators_
list
of dicts The estimators used to infer values to be joined. Each entry in this list is a dictionary with keys
"estimator"
(the fitted estimator) and"columns"
(the list of columns inaux_table
that it is trained to predict).
Methods
fit
(X[, y])Fit estimators to the aux_table provided during initialization.
fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform a table by joining inferred values to it.
- fit(X, y=None)[source]#
Fit estimators to the aux_table provided during initialization.
X and y are mostly for scikit-learn compatibility.
- Parameters:
- Xarray_like or
None
The main table to which
self.aux_table
could be joined. If X is notNone
, an error is raised if any of the matching columns listed inself.main_key
(orself.key
) is missing from X.- yarray_like
Ignored; only exists for compatibility with scikit-learn.
- Xarray_like or
- Returns:
- selfInterpolationJoiner
Returns self.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)[source]#
Transform a table by joining inferred values to it.
The values of the main_key columns in X (the main table) are used to predict likely values for the contents of a matching row in self.aux_table (the auxiliary table).
- Parameters:
- XDataFrame
The (main) table to transform.
- Returns:
- joinDataFrame
The result of the join between X and inferred rows from
self.aux_table
.
Examples using skrub.InterpolationJoiner
#

Interpolation join: infer missing rows when joining two tables