StringEncoder#

class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb', random_state=None)[source]#

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

Note

StringEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2')) instead of make_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2'])).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.

New features will be named {col_name}_{component} if the series has a name, and tsvd_{component} if it does not.

Parameters:

n_componentsint, default=30: Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.
vectorizerstr, “tfidf” or “hashing”, default=”tfidf”: Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.
ngram_rangetuple of (int, int) pairs, default=(3,4): The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
analyzerstr, “char”, “word” or “char_wb”, default=”char_wb”: Whether the feature should be made of word or character n-grams. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
random_stateint, RandomState instance or None, default=None: Used during randomized svd. Pass an int for reproducible results across multiple function calls.

See also

MinHashEncoder: Encode string columns as a numeric array with the minhash method.
GapEncoder: Encode string columns by constructing latent topics.
TextEncoder: Encode string columns using pre-trained language models.

Examples

>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')

>>> enc.fit_transform(X)
   video comments_0  video comments_1
0      8.218069e-01      4.557474e-17
1      6.971618e-16      1.000000e+00
2      8.218069e-01     -3.046564e-16

Methods

`fit`(column[, y])	Fit the transformer.
`fit_transform`(X[, y])	Fit the encoder and transform a column.
`get_feature_names_out`()	Get output feature names for transformation.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:

columna pandas or polars Series: Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
ycolumn or dataframe: Prediction targets.

Returns:

self: The fitted transformer.

fit_transform(X, y=None)[source]#

Fit the encoder and transform a column.

Parameters:

XPandas or Polars series: The column to transform.
yNone: Unused. Here for compatibility with scikit-learn.

Returns:

X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components): The embedding representation of the input.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:

feature_names_outlist of str objects: Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#

Transform a column.

Parameters:

XPandas or Polars series: The column to transform.

Returns:

result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components): The embedding representation of the input.

Gallery examples#

Various string encoders: a sentiment analysis example

Building complex tabular pipelines

Tuning pipelines

StringEncoder#

Gallery examples#

This Page