StringEncoder#

class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb', stop_words=None, random_state=None)[source]#

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

Note

StringEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2')) instead of make_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2'])).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.

New features will be named {col_name}_{component} if the series has a name, and tsvd_{component} if it does not.

Parameters:
n_componentsint, default=30

Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.

vectorizerstr, “tfidf” or “hashing”, default=”tfidf”

Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.

ngram_rangetuple of (int, int) pairs, default=(3,4)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstr, “char”, “word” or “char_wb”, default=”char_wb”

Whether the feature should be made of word or character n-grams. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

stop_words{‘english’}, list, default=None

If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used.

random_stateint, RandomState instance or None, default=None

Used during randomized svd. Pass an int for reproducible results across multiple function calls.

Attributes:
input_name_str

Name of the fitted column, or “string_enc” if the column has no name.

n_components_int

The number of dimensions of the embeddings after dimensionality reduction.

all_outputs_list of str

A list that contains the name of all the features generated from the fitted column.

See also

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

GapEncoder

Encode string columns by constructing latent topics.

TextEncoder

Encode string columns using pre-trained language models.

Notes

Skrub provides StringEncoder as a simple interface to perform Latent Semantic Analysis (LSA). As such, it doesn’t support all hyper-parameters exposed by the underlying {TfidfVectorizer, HashingVectorizer} and TruncatedSVD. If you need more flexibility than the proposed hyper-parameters of StringEncoder, you must create your own LSA using scikit-learn Pipeline, such as:

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.decomposition import TruncatedSVD
>>> make_pipeline(TfidfVectorizer(max_df=300), TruncatedSVD())
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(max_df=300)),
            ('truncatedsvd', TruncatedSVD())])

Examples

>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')
>>> enc.fit_transform(X)
   video comments_0  video comments_1
0      8.218069e-01      4.557474e-17
1      6.971618e-16      1.000000e+00
2      8.218069e-01     -3.046564e-16

Methods

fit(column[, y])

Fit the transformer.

fit_transform(X[, y])

Fit the encoder and transform a column.

get_feature_names_out()

Return a list of features generated by the transformer.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(X, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
XPandas or Polars series

The column to transform.

yNone

Unused. Here for compatibility with scikit-learn.

Returns:
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.

get_feature_names_out()[source]#

Return a list of features generated by the transformer.

Each feature has format {input_name}_{n_component} where input_name is the name of the input column, or a default name for the encoder, and n_component is the idx of the specific feature.

Returns:
list of str

The list of feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a column.

Parameters:
XPandas or Polars series

The column to transform.

Returns:
result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)

The embedding representation of the input.