skrub
.Joiner#
Usage examples at the bottom of this page.
- class skrub.Joiner(aux_table, *, main_key=None, aux_key=None, key=None, suffix='', match_score=0.0, analyzer='char_wb', ngram_range=(2, 4))[source]#
Augment a main table by fuzzy joining an auxiliary table to it.
Given an auxiliary table and matching column names, fuzzy join it to the main table. The principle is as follows:
The auxiliary table and the matching column names are provided at initialisation.
The main table is provided for fitting, and will be joined when
Joiner.transform
is called.
It is advised to use hyperparameter tuning tools such as GridSearchCV to determine the best match_score parameter, as this can significantly improve your results. (see example ‘Fuzzy joining dirty tables with the Joiner’ for an illustration)
- Parameters:
- aux_table
DataFrame
The auxiliary table, which will be fuzzy-joined to the main table when calling
transform
.- main_key
str
orlist
ofstr
, default=None The column names in the main table on which the join will be performed. Can be a string if joining on a single column. If
None
, aux_key must also beNone
and key must be provided.- aux_key
str
orlist
ofstr
, default=None The column names in the auxiliary table on which the join will be performed. Can be a string if joining on a single column. If
None
, main_key must also beNone
and key must be provided.- key
str
orlist
ofstr
, default=None The column names to use for both
main_key
andaux_key
when they are the same. Provide eitherkey
or bothmain_key
andaux_key
.- suffix
str
, default=”” Suffix to append to the
aux_table
’s column names. You can use it to avoid duplicate column names in the join.- match_score
float
, default=0 Distance score between the closest matches that will be accepted. In a [0, 1] interval. 1 means that only a perfect match will be accepted, and zero means that the closest match will be accepted, no matter how distant. For numerical joins, this defines the maximum Euclidean distance between the matches.
- analyzer{‘word’, ‘char’, ‘char_wb’}, default=`char_wb`
Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- ngram_range2-tuple of
int
, default=(2, 4) - The lower and upper boundaries of the range of n-values for different
n-grams used in the string similarity. All values of n such that
min_n <= n <= max_n
will be used.
- aux_table
See also
AggJoiner
Aggregate auxiliary dataframes before joining them on a base dataframe.
fuzzy_join
Join two tables (dataframes) based on approximate column matching.
get_ken_embeddings
Download vector embeddings for many common entities (cities, places, people…).
Examples
>>> X = pd.DataFrame(['France', 'Germany', 'Italy'], columns=['Country']) >>> X Country 0 France 1 Germany 2 Italy
>>> aux_table = pd.DataFrame([['germany', 84_000_000], ... ['france', 68_000_000], ... ['italy', 59_000_000]], ... columns=['Country', 'Population']) >>> aux_table Country Population 0 germany 84000000 1 france 68000000 2 italy 59000000
>>> joiner = Joiner(aux_table, key='Country', suffix='_aux')
>>> augmented_table = joiner.fit_transform(X) >>> augmented_table Country Country_aux Population 0 France france 68000000 1 Germany germany 84000000 2 Italy italy 59000000
Methods
fit
(X[, y])Fit the instance to the main table.
fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Transform X using the specified encoding scheme.
- fit(X, y=None)[source]#
Fit the instance to the main table.
In practice, just checks if the key columns in X, the main table, and in the auxiliary tables exist.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
Examples using skrub.Joiner
#

Spatial join for flight data: Joining across multiple columns