InterpolationJoiner#
- class skrub.InterpolationJoiner(aux_table, *, key=None, main_key=None, aux_key=None, suffix='', regressor=HistGradientBoostingRegressor(), classifier=HistGradientBoostingClassifier(), vectorizer=TableVectorizer(high_cardinality=MinHashEncoder()), n_jobs=None, on_estimator_failure='warn')[source]#
Join with a table augmented by machine-learning predictions.
This is similar to a usual equi-join, but instead of looking for actual rows in the right table that satisfy the join condition, we estimate what those rows would contain if they existed in the table.
Suppose we want to join a table
buildings(latitude, longitude, n_stories)
with a tableannual_avg_temp(latitude, longitude, avg_temp)
. Our annual average temperature table may not contain data for the exact latitude and longitude of our buildings. However, we can interpolate what we need from the data points it does contain. Usingannual_avg_temp
, we train a model to predict the temperature, given the latitude and longitude. Then, we use this model to estimate the values we want to add to ourbuildings
table. In a way we are joiningbuildings
to a virtual table, in which rows for any (latitude, longitude) location are inferred, rather than retrieved, when requested. This is done with:InterpolationJoiner( annual_avg_temp, on=["latitude", "longitude"] ).fit_transform(buildings)
- Parameters:
- aux_tableDataFrame
The (auxiliary) table to be joined to the
main_table
(which is the argument oftransform
).aux_table
is used to train a model that takes as inputs the contents of the columns listed inaux_key
, and predicts the contents of the other columns. In the example above, we want our transformer to add temperature data to the table it is operating on. Therefore,aux_table
is theannual_avg_temp
table.- key
str
orlist
ofstr
, default=None Column names to use for both
main_key
andaux_key
, when they are the same. Provide eitherkey
(only) or bothmain_key
andaux_key
.- main_key
str
orlist
ofstr
, default=None The columns in the main table used for joining. The main table is the argument of
transform
, to which we add information inferred usingaux_table
. The column names listed inmain_key
will provide the inputs (features) of the interpolators at prediction (joining) time. In the example above,main_key
is["latitude", "longitude"]
, which refer to columns in thebuildings
table. When joining on a single column, we can pass its name rather than a list:"latitude"
is equivalent to["latitude"]
.- aux_key
str
orlist
ofstr
, default=None The columns in
aux_table
used for joining. Their number and types must match those of themain_key
columns in the main table. These columns provide the features for the estimators to be fitted. As formain_key
, it is possible to pass a string when using a single column.- suffix
str
, default=”” Suffix to append to the
aux_table
’s column names. If duplicate column names are found, a __skrub_<random string>__ is added at the end of columns that would otherwise be duplicates.- regressorscikit-learn regressor, default=HistGradientBoostingRegressor
Model used to predict the numerical columns of
aux_table
.- classifierscikit-learn classifier, default=HistGradientBoostingClassifier
Model used to predict the string and categorical columns of
aux_table
.- vectorizerscikit-learn transformer that can operate on a DataFrame
Used to transform the feature columns before passing them to the scikit-learn estimators. This is useful if we are joining on columns that need some transformation, such as dates or strings representing high-cardinality categories. By default we use a
MinHashEncoder
to vectorize text columns. This is because theMinHashEncoder
is very fast and usually gives good results with downstream learners based on trees like the gradient-boosted trees used by default forregressor
andclassifier
. If you replace the default regressor and classifier with models such as nearest-neighbors or linear models, consider passingvectorizer=TableVectorizer()
which will encode text with aGapEncoder
rather than aMinHashEncoder
.- n_jobs
int
orNone
, default=None Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context. -1 means using all processors. Depending on the estimators used and the contents ofaux_table
, several estimators may need to be fitted – for example one for continuous outputs (regressor) and one for categorical outputs (classifier), or one for each column when the provided estimators do not support multi-output tasks. Fitting and querying these estimators can be done in parallel.- on_estimator_failure“warn”, “raise” or “pass”, default=”warn”
How to handle exceptions raised when fitting one of the estimators (regressors and classifiers) or querying them for a prediction. If “raise”, exceptions are propagated. If “pass” (i) if an exception is raised during
fit
the corresponding columns are ignored – they will not appear in the join and (ii) if an exception is raised duringtransform
, the corresponding column will be filled with nulls. Columns are filled with nulls duringtransform
rather than dropped so that the output always has the same shape. If “warn” (the default), behave like “pass” but issue a warning.
- Attributes:
- vectorizer_scikit-learn transformer
The transformer used to vectorize the feature columns.
- estimators_
list
of dicts The estimators used to infer values to be joined. Each entry in this list is a dictionary with keys
"estimator"
(the fitted estimator) and"columns"
(the list of columns inaux_table
that it is trained to predict).
See also
Joiner
Works in a similar way but instead of inferring values, picks the closest row from the auxiliary table.
Examples
>>> import pandas as pd >>> buildings = pd.DataFrame( ... {"latitude": [1.0, 2.0], "longitude": [1.0, 2.0], "n_stories": [3, 7]} ... ) >>> annual_avg_temp = pd.DataFrame( ... { ... "latitude": [1.2, 0.9, 1.9, 1.7, 5.0], ... "longitude": [0.8, 1.1, 1.8, 1.8, 5.0], ... "avg_temp": [10.0, 11.0, 15.0, 16.0, 20.0], ... } ... ) >>> buildings latitude longitude n_stories 0 1.0 1.0 3 1 2.0 2.0 7 >>> annual_avg_temp latitude longitude avg_temp 0 1.2 0.8 10.0 1 0.9 1.1 11.0 2 1.9 1.8 15.0 3 1.7 1.8 16.0 4 5.0 5.0 20.0
Let’s interpolate the average temperature:
>>> from sklearn.neighbors import KNeighborsRegressor >>> from skrub import InterpolationJoiner >>> InterpolationJoiner( ... annual_avg_temp, ... key=["latitude", "longitude"], ... regressor=KNeighborsRegressor(2), ... ).fit_transform(buildings) latitude longitude n_stories avg_temp 0 1.0 1.0 3 10.5 1 2.0 2.0 7 15.5
Methods
fit
(X[, y])Fit estimators to the
aux_table
provided during initialization.fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform a table by joining inferred values to it.
- fit(X, y=None)[source]#
Fit estimators to the
aux_table
provided during initialization.X
andy
are mostly for scikit-learn compatibility.- Parameters:
- Xarray_like or
None
The main table to which
self.aux_table
could be joined. IfX
is notNone
, an error is raised if any of the matching columns listed inself.main_key
(orself.key
) are missing fromX
.- yarray_like
Ignored; only exists for compatibility with scikit-learn.
- Xarray_like or
- Returns:
- InterpolationJoiner
Fitted
InterpolationJoiner
instance (self).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)[source]#
Transform a table by joining inferred values to it.
The values of the
main_key
columns inX
(the main table) are used to predict likely values for the contents of a matching row inaux_table
(the auxiliary table).- Parameters:
- XDataFrame
The (main) table to transform.
- Returns:
- DataFrame
The result of the join between
X
and inferred rows fromaux_table
.
Gallery examples#
Interpolation join: infer missing rows when joining two tables
Interpolation join: infer missing rows when joining two tables
Interpolation join: infer missing rows when joining two tables