Joiner#

class skrub.Joiner(aux_table, *, key=None, main_key=None, aux_key=None, suffix='', max_dist=inf, ref_dist='random_pairs', string_encoder=Pipeline(steps=[('functiontransformer', FunctionTransformer(func=functools.partial(<function fill_nulls>, value=''))), ('tostr', ToStr()), ('hashingvectorizer', HashingVectorizer(analyzer='char_wb', ngram_range=(2, 4))), ('tfidftransformer', TfidfTransformer())]), add_match_info=True)[source]#

Augment features in a main table by fuzzy-joining an auxiliary table to it.

This transformer is initialized with an auxiliary table aux_table. It transforms a main table by joining it, with approximate (“fuzzy”) matching, to the auxiliary table. The output of transform has the same rows as the main table (i.e. as the argument passed to transform), but each row is augmented with values from the best match in the auxiliary table.

To identify the best match for each row, values from the matching columns (main_key and aux_key) are vectorized, i.e. represented by vectors of continuous values. Then, the Euclidean distances between these vectors are computed to find, for each main table row, its nearest neighbor within the auxiliary table.

Optionally, a maximum distance threshold, max_dist, can be set. Matches between vectors that are separated by a distance (strictly) greater than max_dist will be rejected. We will consider that main table rows that are farther than max_dist from their nearest neighbor do not have a matching row in the auxiliary table, and the output will contain nulls for the entries that would normally have come from the auxiliary table (as in a traditional left join).

To make it easier to set a max_dist threshold, the distances are rescaled by dividing them by a reference distance, which can be chosen with ref_dist. The default is ‘random_pairs’. The possible choices are:

‘random_pairs’

Pairs of rows are sampled randomly from the auxiliary table and their distance is computed. The reference distance is the first quartile of those distances.

‘second_neighbor’

The reference distance is the distance to the second nearest neighbor in the auxiliary table.

‘self_join_neighbor’

Once the match candidate (i.e. the nearest neighbor from the auxiliary table) has been found, we find its nearest neighbor in the auxiliary table (excluding itself). The reference distance is the distance that separates those 2 auxiliary rows.

‘no_rescaling’

The reference distance is 1.0, i.e. no rescaling of the distances is applied.

Parameters:
aux_tabledataframe

The auxiliary table, which will be fuzzy-joined to the main table when calling transform.

keystr or list of str, default=None

The column names to use for both main_key and aux_key when they are the same. Provide either key or both main_key and aux_key.

main_keystr or list of str, default=None

The column names in the main table on which the join will be performed. Can be a string if joining on a single column. If None, aux_key must also be None and key must be provided.

aux_keystr or list of str, default=None

The column names in the auxiliary table on which the join will be performed. Can be a string if joining on a single column. If None, main_key must also be None and key must be provided.

suffixstr, default=””

Suffix to append to the aux_table’s column names. You can use it to avoid duplicate column names in the join.

max_distint, float, None or np.inf, default=`np.inf`

Maximum acceptable (rescaled) distance between a row in the main_table and its nearest neighbor in the aux_table. Rows that are farther apart are not considered to match. By default, the distance is rescaled so that a value between 0 and 1 is typically a good choice, although rescaled distances can be greater than 1 for some choices of ref_dist. None, “inf”, float(“inf”) or numpy.inf mean that no matches are rejected.

ref_distreference distance for rescaling, default=’random_pairs’

Options are {“random_pairs”, “second_neighbor”, “self_join_neighbor”, “no_rescaling”}. See above for a description of each option. To facilitate the choice of max_dist, distances between rows in main_table and their nearest neighbor in aux_table will be rescaled by this reference distance.

string_encoderscikit-learn transformer used to vectorize text columns

By default a HashingVectorizer combined with a TfidfTransformer is used. Here we use raw TF-IDF features rather than transforming them for example with GapEncoder or MinHashEncoder because it is faster, these features are only used to find nearest neighbors and not used by downstream estimators, and distances between TF-IDF vectors have a somewhat simpler interpretation.

add_match_infobool, default=True

Insert some columns whose names start with skrub_Joiner containing the distance, rescaled distance and whether the rescaled distance is above the threshold. Those values can be helpful for an estimator that uses the joined features, or to inspect the result of the join and set a max_dist threshold.

Attributes:
max_dist_the maximum distance for a match to be accepted

Equal to the parameter max_dist except that “inf” and None are mapped to np.inf (i.e. accept all matches).

vectorizer_scikit-learn ColumnTransformer

The fitted transformer used to transform the matching columns into numerical vectors.

See also

AggJoiner

Aggregate an auxiliary dataframe before joining it on a base dataframe.

fuzzy_join

Join two tables (dataframes) based on approximate column matching. This is the same functionality as provided by the Joiner but exposed as a function rather than a transformer.

Examples

>>> import pandas as pd
>>> from skrub import Joiner
>>> main_table = pd.DataFrame({"Country": ["France", "Italia", "Georgia"]})
>>> aux_table = pd.DataFrame( {"Country": ["Germany", "France", "Italy"],
...                            "Capital": ["Berlin", "Paris", "Rome"]} )
>>> main_table
  Country
0  France
1  Italia
2   Georgia
>>> aux_table
   Country Capital
0  Germany  Berlin
1   France   Paris
2    Italy    Rome
>>> joiner = Joiner(
...     aux_table,
...     key="Country",
...     suffix="_aux",
...     max_dist=0.8,
...     add_match_info=False,
... )
>>> joiner.fit_transform(main_table)
  Country      Country_aux      Capital_aux
0  France           France            Paris
1  Italia            Italy             Rome
2  Georgia              NaN              NaN

Methods

fit(X[, y])

Fit the instance to the main table.

fit_transform(X[, y])

Fit the instance to the main table.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform X using the specified encoding scheme.

fit(X, y=None)[source]#

Fit the instance to the main table.

Parameters:
XDataFrame

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
Joiner

Fitted :class:`Joiner`instance (self).

fit_transform(X, y=None)[source]#

Fit the instance to the main table.

Parameters:
Xdataframe

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
DataFrame

The final joined table.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X, y=None)[source]#

Transform X using the specified encoding scheme.

Parameters:
Xdataframe

The main table, to be joined to the auxiliary ones.

yNone

Unused, only here for compatibility.

Returns:
dataframe

The final joined table.