skrub.Joiner#

Usage examples at the bottom of this page.

class skrub.Joiner(aux_table, *, main_key=None, aux_key=None, key=None, suffix='', max_dist=inf, ref_dist='random_pairs', string_encoder=Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function _as_str>)), ('hashingvectorizer', HashingVectorizer(analyzer='char_wb', ngram_range=(2, 4))), ('tfidftransformer', TfidfTransformer())]), add_match_info=True)[source]#

Augment features in a main table by fuzzy-joining an auxiliary table to it.

This transformer is initialized with an auxiliary table aux_table. It transforms a main table by joining it, with approximate (“fuzzy”) matching, to the auxiliary table. The output of transform has the same rows as the main table (ie as the argument passed to transform), but each row is augmented with values from the best match in the auxiliary table.

To identify the best match for each row, values from the matching columns (main_key and aux_key) are vectorized, ie represented by vectors of continuous values. Then, the Euclidean distances between these vectors are computed to find, for each main table row, its nearest neighbor within the auxiliary table.

Optionally, a maximum distance threshold, max_dist, can be set. Matches between vectors that are separated by a distance (strictly) greater than max_dist will be rejected. We will consider that main table rows that are farther than max_dist from their nearest neighbor do not have a matching row in the auxiliary table, and the output will contain nulls for the entries that would normally have come from the auxiliary table (as in a traditional left join).

To make it easier to set a max_dist threshold, the distances are rescaled by dividing them by a reference distance, which can be chosen with ref_dist. The default is 'random_pairs'. The possible choices are:

‘random_pairs’

Pairs of rows are sampled randomly from the auxiliary table and their distance is computed. The reference distance is the first quartile of those distances.

‘second_neighbor’

The reference distance is the distance to the second nearest neighbor in the auxiliary table.

‘self_join_neighbor’

Once the match candidate (ie the nearest neigbor from the auxiliary table) has been found, we find its nearest neighbor in the auxiliary table (excluding itself). The reference distance is the distance that separates those 2 auxiliary rows.

‘no_rescaling’

The reference distance is 1.0, ie no rescaling of the distances is applied.

Parameters:
aux_tableDataFrame

The auxiliary table, which will be fuzzy-joined to the main table when calling transform.

main_keystr or list of str, default=None

The column names in the main table on which the join will be performed. Can be a string if joining on a single column. If None, aux_key must also be None and key must be provided.

aux_keystr or list of str, default=None

The column names in the auxiliary table on which the join will be performed. Can be a string if joining on a single column. If None, main_key must also be None and key must be provided.

keystr or list of str, default=None

The column names to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key.

suffixstr, default=””

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

max_distfloat, default=np.inf

Maximum acceptable (rescaled) distance between a row in the main_table and its nearest neighbor in the aux_table. Rows that are farther apart are not considered to match. By default, the distance is rescaled so that a value between 0 and 1 is typically a good choice, although rescaled distances can be greater than 1 for some choices of ref_dist. None, "inf", float("inf") or numpy.inf mean that no matches are rejected.

ref_distreference distance for rescaling, default = ‘random_pairs’

Options are {“random_pairs”, “second_neighbor”, “self_join_neighbor”, “no_rescaling”}. See above for a description of each option. To facilitate the choice of max_dist, distances between rows in main_table and their nearest neighbor in aux_table will be rescaled by this reference distance.

string_encoderscikit-learn transformer used to vectorize text columns

By default a HashingVectorizer combined with a TfidfTransformer is used. Here we use raw TF-IDF features rather than transforming them for example with GapEncoder or MinHashEncoder because it is faster, these features are only used to find nearest neighbors and not used by downstream estimators, and distances between TF-IDF vectors have a somewhat simpler interpretation.

add_match_infobool, default=True

Insert some columns whose names start with skrub_Joiner containing the distance, rescaled distance and whether the rescaled distance is above the threshold. Those values can be helpful for an estimator that uses the joined features, or to inspect the result of the join and set a max_dist threshold.

See also

AggJoiner

Aggregate auxiliary dataframes before joining them on a base dataframe.

fuzzy_join

Join two tables (dataframes) based on approximate column matching. This is the same functionality as provided by the Joiner but exposed as a function rather than a transformer.

Examples

>>> import pandas as pd
>>> main_table = pd.DataFrame({"Country": ["France", "Italia", "Spain"]})
>>> aux_table = pd.DataFrame( {"Country": ["Germany", "France", "Italy"],
...                            "Capital": ["Berlin", "Paris", "Rome"]} )
>>> main_table
  Country
0  France
1  Italia
2   Spain
>>> aux_table
   Country Capital
0  Germany  Berlin
1   France   Paris
2    Italy    Rome
>>> joiner = Joiner(
...     aux_table,
...     key="Country",
...     suffix="_capitals",
...     max_dist=0.9,
...     add_match_info=False,
... )
>>> joiner.fit_transform(main_table)
  Country Country_capitals Capital_capitals
0  France           France            Paris
1  Italia            Italy             Rome
2   Spain              NaN              NaN
Attributes:
max_dist_the maximum distance for a match to be accepted

Equal to the parameter max_dist except that "inf" and None are mapped to np.inf (ie accept all matches).

vectorizer_scikit-learn ColumnTransformer

The fitted transformer used to transform the matching columns into numerical vectors.

Methods

fit(X[, y])

Fit the instance to the main table.

fit_transform(X[, y])

Fit to data, then transform it.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform X using the specified encoding scheme.

fit(X, y=None)[source]#

Fit the instance to the main table.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
Joiner

Fitted Joiner instance (self).

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input samples.

yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, y=None)[source]#

Transform X using the specified encoding scheme.

Parameters:
XDataFrame, shape [n_samples, n_features]

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The final joined table.

Examples using skrub.Joiner#

Fuzzy joining dirty tables with the Joiner

Fuzzy joining dirty tables with the Joiner

Wikipedia embeddings to enrich the data

Wikipedia embeddings to enrich the data

Spatial join for flight data: Joining across multiple columns

Spatial join for flight data: Joining across multiple columns