StringEncoder#

class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb')[source]#

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

Note

StringEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2')) instead of make_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2'])).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.

New features will be named {col_name}_{component} if the series has a name, and tsvd_{component} if it does not.

Parameters:
n_componentsint, default=30

Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.

vectorizerstr, “tfidf” or “hashing”, default=”tfidf”

Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.

ngram_rangetuple of (int, int) pairs, default=(3,4)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstr, “char”, “word” or “char_wb”, default=”char_wb”

Whether the feature should be made of word or character n-grams. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

See also

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

GapEncoder

Encode string columns by constructing latent topics.

TextEncoder

Encode string columns using pre-trained language models.

Examples

>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')
>>> enc.fit_transform(X)
   video comments_0  video comments_1
0      8.218069e-01      4.557474e-17
1      6.971618e-16      1.000000e+00
2      8.218069e-01     -3.046564e-16

Methods

fit(column[, y])

Fit the transformer.

fit_transform(X[, y])

Fit the encoder and transform a column.

get_feature_names_out()

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(X, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
XPandas or Polars series

The column to transform.

yNone

Unused. Here for compatibility with scikit-learn.

Returns:
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.

get_feature_names_out()[source]#

Get output feature names for transformation.

Returns:
feature_names_outlist of str objects

Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a column.

Parameters:
XPandas or Polars series

The column to transform.

Returns:
result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.