skrub
.Joiner#
Usage examples at the bottom of this page.
- class skrub.Joiner(aux_table, *, main_key=None, aux_key=None, key=None, suffix='', max_dist=inf, ref_dist='random_pairs', string_encoder=Pipeline(steps=[('functiontransformer', FunctionTransformer(func=<function _as_str>)), ('hashingvectorizer', HashingVectorizer(analyzer='char_wb', ngram_range=(2, 4))), ('tfidftransformer', TfidfTransformer())]), add_match_info=True)[source]#
Augment features in a main table by fuzzy-joining an auxiliary table to it.
This transformer is initialized with an auxiliary table
aux_table
. It transforms a main table by joining it, with approximate (“fuzzy”) matching, to the auxiliary table. The output oftransform
has the same rows as the main table (ie as the argument passed totransform
), but each row is augmented with values from the best match in the auxiliary table.To identify the best match for each row, values from the matching columns (
main_key
andaux_key
) are vectorized, ie represented by vectors of continuous values. Then, the Euclidean distances between these vectors are computed to find, for each main table row, its nearest neighbor within the auxiliary table.Optionally, a maximum distance threshold,
max_dist
, can be set. Matches between vectors that are separated by a distance (strictly) greater thanmax_dist
will be rejected. We will consider that main table rows that are farther thanmax_dist
from their nearest neighbor do not have a matching row in the auxiliary table, and the output will contain nulls for the entries that would normally have come from the auxiliary table (as in a traditional left join).To make it easier to set a
max_dist
threshold, the distances are rescaled by dividing them by a reference distance, which can be chosen withref_dist
. The default is'random_pairs'
. The possible choices are:- ‘random_pairs’
Pairs of rows are sampled randomly from the auxiliary table and their distance is computed. The reference distance is the first quartile of those distances.
- ‘second_neighbor’
The reference distance is the distance to the second nearest neighbor in the auxiliary table.
- ‘self_join_neighbor’
Once the match candidate (ie the nearest neigbor from the auxiliary table) has been found, we find its nearest neighbor in the auxiliary table (excluding itself). The reference distance is the distance that separates those 2 auxiliary rows.
- ‘no_rescaling’
The reference distance is 1.0, ie no rescaling of the distances is applied.
- Parameters:
- aux_table
DataFrame
The auxiliary table, which will be fuzzy-joined to the main table when calling
transform
.- main_key
str
orlist
ofstr
, default=None The column names in the main table on which the join will be performed. Can be a string if joining on a single column. If
None
, aux_key must also beNone
and key must be provided.- aux_key
str
orlist
ofstr
, default=None The column names in the auxiliary table on which the join will be performed. Can be a string if joining on a single column. If
None
, main_key must also beNone
and key must be provided.- key
str
orlist
ofstr
, default=None The column names to use for both
main_key
andaux_key
when they are the same. Provide eitherkey
or bothmain_key
andaux_key
.- suffix
str
, default=”” Suffix to append to the
aux_table
’s column names. You can use it to avoid duplicate column names in the join.- max_dist
float
, default=np.inf Maximum acceptable (rescaled) distance between a row in the
main_table
and its nearest neighbor in theaux_table
. Rows that are farther apart are not considered to match. By default, the distance is rescaled so that a value between 0 and 1 is typically a good choice, although rescaled distances can be greater than 1 for some choices ofref_dist
.None
,"inf"
,float("inf")
ornumpy.inf
mean that no matches are rejected.- ref_distreference distance for rescaling, default = ‘random_pairs’
Options are {“random_pairs”, “second_neighbor”, “self_join_neighbor”, “no_rescaling”}. See above for a description of each option. To facilitate the choice of
max_dist
, distances between rows inmain_table
and their nearest neighbor inaux_table
will be rescaled by this reference distance.- string_encoderscikit-learn transformer used to vectorize text columns
By default a
HashingVectorizer
combined with aTfidfTransformer
is used. Here we use raw TF-IDF features rather than transforming them for example withGapEncoder
orMinHashEncoder
because it is faster, these features are only used to find nearest neighbors and not used by downstream estimators, and distances between TF-IDF vectors have a somewhat simpler interpretation.- add_match_info
bool
, default=True Insert some columns whose names start with skrub_Joiner containing the distance, rescaled distance and whether the rescaled distance is above the threshold. Those values can be helpful for an estimator that uses the joined features, or to inspect the result of the join and set a
max_dist
threshold.
- aux_table
See also
AggJoiner
Aggregate auxiliary dataframes before joining them on a base dataframe.
fuzzy_join
Join two tables (dataframes) based on approximate column matching. This is the same functionality as provided by the
Joiner
but exposed as a function rather than a transformer.
Examples
>>> import pandas as pd >>> main_table = pd.DataFrame({"Country": ["France", "Italia", "Spain"]}) >>> aux_table = pd.DataFrame( {"Country": ["Germany", "France", "Italy"], ... "Capital": ["Berlin", "Paris", "Rome"]} ) >>> main_table Country 0 France 1 Italia 2 Spain >>> aux_table Country Capital 0 Germany Berlin 1 France Paris 2 Italy Rome >>> joiner = Joiner( ... aux_table, ... key="Country", ... suffix="_capitals", ... max_dist=0.9, ... add_match_info=False, ... ) >>> joiner.fit_transform(main_table) Country Country_capitals Capital_capitals 0 France France Paris 1 Italia Italy Rome 2 Spain NaN NaN
- Attributes:
- max_dist_the maximum distance for a match to be accepted
Equal to the parameter
max_dist
except that"inf"
andNone
are mapped tonp.inf
(ie accept all matches).- vectorizer_scikit-learn
ColumnTransformer
The fitted transformer used to transform the matching columns into numerical vectors.
Methods
fit
(X[, y])Fit the instance to the main table.
fit_transform
(X[, y])Fit to data, then transform it.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Transform X using the specified encoding scheme.
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
Examples using skrub.Joiner
#
Fuzzy joining dirty tables with the Joiner
Wikipedia embeddings to enrich the data
Spatial join for flight data: Joining across multiple columns