StringEncoder#

class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb', stop_words=None, random_state=None)[source]#

Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).

Note

StringEncoder is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2')) instead of make_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2'])).

First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.

New features will be named {col_name}_{component} if the series has a name, and tsvd_{component} if it does not.

Parameters:

n_componentsint, default=30

Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.

vectorizerstr, “tfidf” or “hashing”, default=”tfidf”

Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.

ngram_rangetuple of (int, int) pairs, default=(3,4)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

analyzerstr, “char”, “word” or “char_wb”, default=”char_wb”

Whether the feature should be made of word or character n-grams. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

stop_words{‘english’}, list, default=None

If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used.

random_stateint, RandomState instance or None, default=None

Used during randomized svd. Pass an int for reproducible results across multiple function calls.

Attributes:

input_name_str: Name of the fitted column, or “string_enc” if the column has no name.
n_components_int: The number of dimensions of the embeddings after dimensionality reduction.
all_outputs_list of str: A list that contains the name of all the features generated from the fitted column.

See also

MinHashEncoder: Encode string columns as a numeric array with the minhash method.
GapEncoder: Encode string columns by constructing latent topics.
TextEncoder: Encode string columns using pre-trained language models.

Notes

Skrub provides StringEncoder as a simple interface to perform Latent Semantic Analysis (LSA). As such, it doesn’t support all hyper-parameters exposed by the underlying {TfidfVectorizer, HashingVectorizer} and TruncatedSVD. If you need more flexibility than the proposed hyper-parameters of StringEncoder, you must create your own LSA using scikit-learn Pipeline, such as:

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.decomposition import TruncatedSVD

>>> make_pipeline(TfidfVectorizer(max_df=300), TruncatedSVD())
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(max_df=300)),
            ('truncatedsvd', TruncatedSVD())])

Examples

>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
...   "The professor snatched a good interview out of the jaws of these questions.",
...   "Bookmarking this to watch later.",
...   "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')

>>> enc.fit_transform(X)
   video comments_0  video comments_1
0          1.322973         -0.163070
1          0.379688          1.659319
2          1.306400         -0.317120

Methods

`fit`(column[, y])	Fit the transformer.
`fit_transform`(X[, y])	Fit the encoder and transform a column.
`get_feature_names_out`()	Return a list of features generated by the transformer.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:

columna pandas or polars Series: Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
ycolumn or dataframe: Prediction targets.

Returns:

self: The fitted transformer.

fit_transform(X, y=None)[source]#

Fit the encoder and transform a column.

Parameters:

XPandas or Polars series: The column to transform.
yNone: Unused. Here for compatibility with scikit-learn.

Returns:

X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components): The embedding representation of the input.

get_feature_names_out()[source]#

Return a list of features generated by the transformer.

Each feature has format {input_name}_{n_component} where input_name is the name of the input column, or a default name for the encoder, and n_component is the idx of the specific feature.

Returns:

list of str: The list of feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

transform(X)[source]#

Transform a column.

Parameters:

XPandas or Polars series: The column to transform.

Returns:

result: Pandas or Polars dataframe with shape (len(X), tsvd_n_components): The embedding representation of the input.

Gallery examples#

Various string encoders: a sentiment analysis example

Hands-On with Column Selection and Transformers

Multiples tables: building machine learning pipelines with DataOps

Hyperparameter tuning with DataOps

StringEncoder#

Gallery examples#

This Page