StringEncoder#

class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb')[source]#

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

Note

StringEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2')) instead of make_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2'])).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.

New features will be named {col_name}_{component} if the series has a name, and tsvd_{component} if it does not.

Parameters:
n_componentsint, default=30

Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.

vectorizerstr, “tfidf” or “hashing”

Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.

ngram_rangetuple of (int, int) pairs, default=(3,4)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstr, “char”, “word” or “char_wb”, default=”char_wb”

Whether the feature should be made of word or character n-grams. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

See also

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

GapEncoder

Encode string columns by constructing latent topics.

TextEncoder

Encode string columns using pre-trained language models.

Examples

>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')
>>> enc.fit_transform(X) 
   video comments_0  video comments_1
0      8.218069e-01      4.557474e-17
1      6.971618e-16      1.000000e+00
2      8.218069e-01     -3.046564e-16

Methods

fit(column[, y])

Fit the transformer.

fit_transform(X[, y])

Fit the encoder and transform a column.

get_feature_names_out()

Get output feature names for transformation.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, column])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(X, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
XPandas or Polars series

The column to transform.

yNone

Unused. Here for compatibility with scikit-learn.

Returns:
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:
feature_names_outlist of str objects

Transformed feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a column.

Parameters:
XPandas or Polars series

The column to transform.

Returns:
result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.