InterpolationJoiner#
- class skrub.InterpolationJoiner(aux_table, *, key=None, main_key=None, aux_key=None, suffix='', regressor=HistGradientBoostingRegressor(), classifier=HistGradientBoostingClassifier(), vectorizer=TableVectorizer(high_cardinality=MinHashEncoder()), n_jobs=None, on_estimator_failure='warn')[source]#
Join with a table augmented by machine-learning predictions.
This is similar to a usual equi-join, but instead of looking for actual rows in the right table that satisfy the join condition, we estimate what those rows would contain if they existed in the table.
Suppose we want to join a table
buildings(latitude, longitude, n_stories)with a tableannual_avg_temp(latitude, longitude, avg_temp). Our annual average temperature table may not contain data for the exact latitude and longitude of our buildings. However, we can interpolate what we need from the data points it does contain. Usingannual_avg_temp, we train a model to predict the temperature, given the latitude and longitude. Then, we use this model to estimate the values we want to add to ourbuildingstable. In a way we are joiningbuildingsto a virtual table, in which rows for any (latitude, longitude) location are inferred, rather than retrieved, when requested. This is done with:InterpolationJoiner( annual_avg_temp, on=["latitude", "longitude"] ).fit_transform(buildings)
- Parameters:
- aux_tableDataFrame
The (auxiliary) table to be joined to the
main_table(which is the argument oftransform).aux_tableis used to train a model that takes as inputs the contents of the columns listed inaux_key, and predicts the contents of the other columns. In the example above, we want our transformer to add temperature data to the table it is operating on. Therefore,aux_tableis theannual_avg_temptable.- key
strorlistofstr, default=None Column names to use for both
main_keyandaux_key, when they are the same. Provide eitherkey(only) or bothmain_keyandaux_key.- main_key
strorlistofstr, default=None The columns in the main table used for joining. The main table is the argument of
transform, to which we add information inferred usingaux_table. The column names listed inmain_keywill provide the inputs (features) of the interpolators at prediction (joining) time. In the example above,main_keyis["latitude", "longitude"], which refer to columns in thebuildingstable. When joining on a single column, we can pass its name rather than a list:"latitude"is equivalent to["latitude"].- aux_key
strorlistofstr, default=None The columns in
aux_tableused for joining. Their number and types must match those of themain_keycolumns in the main table. These columns provide the features for the estimators to be fitted. As formain_key, it is possible to pass a string when using a single column.- suffix
str, default=”” Suffix to append to the
aux_table’s column names. If duplicate column names are found, a __skrub_<random string>__ is added at the end of columns that would otherwise be duplicates.- regressorscikit-learn regressor, default=HistGradientBoostingRegressor
Model used to predict the numerical columns of
aux_table.- classifierscikit-learn classifier, default=HistGradientBoostingClassifier
Model used to predict the string and categorical columns of
aux_table.- vectorizerscikit-learn transformer that can operate on a DataFrame
Used to transform the feature columns before passing them to the scikit-learn estimators. This is useful if we are joining on columns that need some transformation, such as dates or strings representing high-cardinality categories. By default we use a
MinHashEncoderto vectorize text columns. This is because theMinHashEncoderis very fast and usually gives good results with downstream learners based on trees like the gradient-boosted trees used by default forregressorandclassifier. If you replace the default regressor and classifier with models such as nearest-neighbors or linear models, consider passingvectorizer=TableVectorizer()which will encode text with aGapEncoderrather than aMinHashEncoder.- n_jobs
intorNone, default=None Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext. -1 means using all processors. Depending on the estimators used and the contents ofaux_table, several estimators may need to be fitted – for example one for continuous outputs (regressor) and one for categorical outputs (classifier), or one for each column when the provided estimators do not support multi-output tasks. Fitting and querying these estimators can be done in parallel.- on_estimator_failure“warn”, “raise” or “pass”, default=”warn”
How to handle exceptions raised when fitting one of the estimators (regressors and classifiers) or querying them for a prediction. If “raise”, exceptions are propagated. If “pass” (i) if an exception is raised during
fitthe corresponding columns are ignored – they will not appear in the join and (ii) if an exception is raised duringtransform, the corresponding column will be filled with nulls. Columns are filled with nulls duringtransformrather than dropped so that the output always has the same shape. If “warn” (the default), behave like “pass” but issue a warning.
- Attributes:
- vectorizer_scikit-learn transformer
The transformer used to vectorize the feature columns.
- estimators_
listof dicts The estimators used to infer values to be joined. Each entry in this list is a dictionary with keys
"estimator"(the fitted estimator) and"columns"(the list of columns inaux_tablethat it is trained to predict).
See also
JoinerWorks in a similar way but instead of inferring values, picks the closest row from the auxiliary table.
Examples
>>> import pandas as pd >>> buildings = pd.DataFrame( ... {"latitude": [1.0, 2.0], "longitude": [1.0, 2.0], "n_stories": [3, 7]} ... ) >>> annual_avg_temp = pd.DataFrame( ... { ... "latitude": [1.2, 0.9, 1.9, 1.7, 5.0], ... "longitude": [0.8, 1.1, 1.8, 1.8, 5.0], ... "avg_temp": [10.0, 11.0, 15.0, 16.0, 20.0], ... } ... ) >>> buildings latitude longitude n_stories 0 1.0 1.0 3 1 2.0 2.0 7 >>> annual_avg_temp latitude longitude avg_temp 0 1.2 0.8 10.0 1 0.9 1.1 11.0 2 1.9 1.8 15.0 3 1.7 1.8 16.0 4 5.0 5.0 20.0
Let’s interpolate the average temperature:
>>> from sklearn.neighbors import KNeighborsRegressor >>> from skrub import InterpolationJoiner >>> InterpolationJoiner( ... annual_avg_temp, ... key=["latitude", "longitude"], ... regressor=KNeighborsRegressor(2), ... ).fit_transform(buildings) latitude longitude n_stories avg_temp 0 1.0 1.0 3 10.5 1 2.0 2.0 7 15.5
Methods
fit(X[, y])Fit estimators to the
aux_tableprovided during initialization.fit_transform(X[, y])Fit to data, then transform it.
get_params([deep])Get parameters for this estimator.
set_output(*[, transform])Set output container.
set_params(**params)Set the parameters of this estimator.
transform(X)Transform a table by joining inferred values to it.
- fit(X, y=None)[source]#
Fit estimators to the
aux_tableprovided during initialization.Xandyare mostly for scikit-learn compatibility.- Parameters:
- Xarray_like or
None The main table to which
self.aux_tablecould be joined. IfXis notNone, an error is raised if any of the matching columns listed inself.main_key(orself.key) are missing fromX.- yarray_like
Ignored; only exists for compatibility with scikit-learn.
- Xarray_like or
- Returns:
- InterpolationJoiner
Fitted
InterpolationJoinerinstance (self).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict Additional fit parameters.
- Returns:
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- transform(X)[source]#
Transform a table by joining inferred values to it.
The values of the
main_keycolumns inX(the main table) are used to predict likely values for the contents of a matching row inaux_table(the auxiliary table).- Parameters:
- XDataFrame
The (main) table to transform.
- Returns:
- DataFrame
The result of the join between
Xand inferred rows fromaux_table.
Gallery examples#
Interpolation join: infer missing rows when joining two tables