skrub.Joiner#

Usage examples at the bottom of this page.

class skrub.Joiner(aux_table, *, main_key=None, aux_key=None, key=None, suffix='', match_score=0.0, analyzer='char_wb', ngram_range=(2, 4))[source]#

Augment a main table by fuzzy joining an auxiliary table to it.

Given an auxiliary table and matching column names, fuzzy join it to the main table. The principle is as follows:

  1. The auxiliary table and the matching column names are provided at initialisation.

  2. The main table is provided for fitting, and will be joined when Joiner.transform is called.

It is advised to use hyperparameter tuning tools such as GridSearchCV to determine the best match_score parameter, as this can significantly improve your results. (see example ‘Fuzzy joining dirty tables with the Joiner’ for an illustration)

Parameters:
aux_tableDataFrame

The auxiliary table, which will be fuzzy-joined to the main table when calling transform.

main_keystr or list of str, default=None

The column names in the main table on which the join will be performed. Can be a string if joining on a single column. If None, aux_key must also be None and key must be provided.

aux_keystr or list of str, default=None

The column names in the auxiliary table on which the join will be performed. Can be a string if joining on a single column. If None, main_key must also be None and key must be provided.

keystr or list of str, default=None

The column names to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key.

suffixstr, default=””

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

match_scorefloat, default=0

Distance score between the closest matches that will be accepted. In a [0, 1] interval. 1 means that only a perfect match will be accepted, and zero means that the closest match will be accepted, no matter how distant. For numerical joins, this defines the maximum Euclidean distance between the matches.

analyzer{‘word’, ‘char’, ‘char_wb’}, default=`char_wb`

Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

ngram_range2-tuple of int, default=(2, 4)
The lower and upper boundaries of the range of n-values for different

n-grams used in the string similarity. All values of n such that min_n <= n <= max_n will be used.

See also

AggJoiner

Aggregate auxiliary dataframes before joining them on a base dataframe.

fuzzy_join

Join two tables (dataframes) based on approximate column matching.

get_ken_embeddings

Download vector embeddings for many common entities (cities, places, people…).

Examples

>>> X = pd.DataFrame(['France', 'Germany', 'Italy'], columns=['Country'])
>>> X
   Country
0   France
1  Germany
2    Italy
>>> aux_table = pd.DataFrame([['germany', 84_000_000],
...                         ['france', 68_000_000],
...                         ['italy', 59_000_000]],
...                         columns=['Country', 'Population'])
>>> aux_table
   Country  Population
0  germany    84000000
1   france    68000000
2    italy    59000000
>>> joiner = Joiner(aux_table, key='Country', suffix='_aux')
>>> augmented_table = joiner.fit_transform(X)
>>> augmented_table
   Country Country_aux  Population
0   France      france    68000000
1  Germany     germany    84000000
2    Italy       italy    59000000

Methods

fit(X[, y])

Fit the instance to the main table.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform X using the specified encoding scheme.

fit(X, y=None)[source]#

Fit the instance to the main table.

In practice, just checks if the key columns in X, the main table, and in the auxiliary tables exist.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
Joiner

Fitted Joiner instance (self).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input samples.

yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, y=None)[source]#

Transform X using the specified encoding scheme.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The final joined table.

Examples using skrub.Joiner#

Fuzzy joining dirty tables with the Joiner

Fuzzy joining dirty tables with the Joiner

Wikipedia embeddings to enrich the data

Wikipedia embeddings to enrich the data

Spatial join for flight data: Joining across multiple columns

Spatial join for flight data: Joining across multiple columns