SimilarityEncoder#
Usage examples at the bottom of this page.
- class skrub.SimilarityEncoder(*, ngram_range=(2, 4), analyzer='char', categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='ignore', handle_missing='', hashing_dim=None, n_jobs=None)[source]#
Encode string categories to a similarity matrix, to capture fuzziness across a few categories.
The input to this transformer should be an array-like of strings. The method is based on calculating the morphological similarities between the categories. This encoding is an alternative to OneHotEncoder for dirty categorical variables.
The principle of this encoder is as follows:
Given an input string array
X = [x1, ..., xn]
with k unique categories[c1, ..., ck]
and a similarity measuresim(s1, s2)
between strings, we define the encoded vector of xi as[sim(xi, c1), ... , sim(xi, ck)]
. Similarity encoding of X results in a matrix with shape (n, k) that captures morphological similarities between string entries.To avoid dealing with high-dimensional encodings when k is high, we can use
d << k
prototypes[p1, ..., pd]
with which similarities will be computed:xi -> [sim(xi, p1), ..., sim(xi, pd)]
. These prototypes can be provided by the user. Otherwise, we recommend using the MinHashEncoder or GapEncoder when taking all unique entries leads to too many prototypes.
The similarity measure is based on the proportion of common n-grams between two strings.
- Parameters:
- ngram_range
int
2-tuple (min_n, max_n), default=(2, 4) The lower and upper boundaries of the range of n-values for different n-grams used in the string similarity. All values of n such that
min_n <= n <= max_n
will be used.- analyzer{‘word’, ‘char’, ‘char_wb’}, default=’char’
Analyzer parameter for the HashingVectorizer / CountVectorizer. Describes whether the matrix V to factorize should be made of word counts or character-level n-gram counts. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- categories{‘auto’} or
list
oflist
ofstr
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the i-th column. The passed categories must be sorted and should not mix strings and numeric values.
The categories used can be found in the SimilarityEncoder.categories_ attribute.
- dtypenumber type, default=float64
Desired dtype of output.
- handle_unknown‘error’ or ‘ignore’, default=’’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to ignore). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
- handle_missing‘error’ or ‘’, default=’’
Whether to raise an error or impute with blank string ‘’ if missing values (NaN) are present during fit (default is to impute). When this parameter is set to ‘’, and a missing value is encountered during fit_transform, the resulting encoded columns for this feature will be all zeros. In the inverse transform, the missing category will be denoted as None. “Missing values” are any value for which
pandas.isna
returnsTrue
, such asnumpy.nan
orNone
.- hashing_dim
int
, optional If None, the base vectorizer is a CountVectorizer, otherwise it is a HashingVectorizer with a number of features equal to hashing_dim.
- n_jobs
int
, optional Maximum number of processes used to compute similarity matrices. Used only if fast=True in SimilarityEncoder.transform.
- ngram_range
- Attributes:
See also
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
deduplicate
Deduplicate data by hierarchically clustering similar strings.
Notes
The functionality of SimilarityEncoder is easy to explain and understand, but it is not scalable. It is useful only to capture links across a few categories (eg eg: “west”, “north”, “north-west”), but not when there are many categories, as with open-ended entries. Instead, the GapEncoder is usually recommended.
References
For a detailed description of the method, see Similarity encoding for learning with dirty categorical variables by Cerda, Varoquaux, Kegl. 2018 (Machine Learning journal, Springer).
Examples
>>> from skrub import SimilarityEncoder >>> enc = SimilarityEncoder() >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) SimilarityEncoder()
It inherits the same methods as theOneHotEncoder:
>>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
But it provides a continuous encoding based on similarity instead of a discrete one based on exact matches:
>>> enc.transform([['Female', 1], ['Male', 4]]) array([[1. , 0.42..., 1. , 0. , 0. ], [0.42..., 1. , 0. , 0. , 0. ]]) >>> enc.get_feature_names_out(['gender', 'group']) array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], dtype=object)
Methods
fit
(X[, y])Fit the instance to X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Get output feature names for transformation.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
Convert the data back to the original representation.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, fast])Request metadata passed to the
transform
method.transform
(X[, fast])Transform X using specified encoding scheme.
- fit(X, y=None)[source]#
Fit the instance to X.
- Parameters:
- Xarray_like, shape [n_samples, n_features]
The data to determine the categories of each feature.
- y
None
Unused, only here for compatibility.
- Returns:
SimilarityEncoder
The fitted SimilarityEncoder instance (self).
- fit_transform(X, y=None, **fit_params)[source]#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input samples.
- yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_params
dict
Additional fit parameters.
- Returns:
- get_feature_names_out(input_features=None)[source]#
Get output feature names for transformation.
- Parameters:
- input_featuresarray_like of
str
orNone
, default=None Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- input_featuresarray_like of
- Returns:
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- property infrequent_categories_#
Infrequent categories for each feature.
- inverse_transform(X)[source]#
Convert the data back to the original representation.
When unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse.For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.
- Parameters:
- X{array_like, sparse matrix} of shape (n_samples, n_encoded_features)
The transformed data.
- Returns:
- X_tr
ndarray
of shape (n_samples, n_features) Inverse transformed array.
- X_tr
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, fast='$UNCHANGED$')[source]#
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- transform(X, fast=True)[source]#
Transform X using specified encoding scheme.
- Parameters:
- Xarray_like, shape [n_samples, n_features]
The data to encode.
- fast
bool
, default=True Whether to use the fast computation of ngrams.
- Returns:
ndarray
, shape [n_samples, n_features_new]Transformed input.