StringEncoder#
- class skrub.StringEncoder(n_components=30, vectorizer='tfidf', ngram_range=(3, 4), analyzer='char_wb')[source]#
Generate a lightweight string encoding of a given column using tf-idf vectorization and truncated singular value decomposition (SVD).
Note
StringEncoder
is a type of single-column transformer. Unlike most scikit-learn estimators, itsfit
,transform
andfit_transform
methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.TableVectorizer
orsklearn.compose.ColumnTransformer
. In theColumnTransformer
, pass a single column:make_column_transformer((StringEncoder(), 'col_name_1'), (StringEncoder(), 'col_name_2'))
instead ofmake_column_transformer((StringEncoder(), ['col_name_1', 'col_name_2']))
.First, apply a tf-idf vectorization of the text, then reduce the dimensionality with a truncated SVD with the given number of parameters.
New features will be named
{col_name}_{component}
if the series has a name, andtsvd_{component}
if it does not.- Parameters:
- n_components
int
, default=30 Number of components to be used for the singular value decomposition (SVD). Must be a positive integer.
- vectorizer
str
, “tfidf” or “hashing”, default=”tfidf” Vectorizer to apply to the strings, either tfidf or hashing for scikit-learn TfidfVectorizer or HashingVectorizer respectively.
- ngram_range
tuple
of (int
,int
) pairs, default=(3,4) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams.- analyzer
str
, “char”, “word” or “char_wb”, default=”char_wb” Whether the feature should be made of word or character n-grams. Option
char_wb
creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- n_components
See also
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
GapEncoder
Encode string columns by constructing latent topics.
TextEncoder
Encode string columns using pre-trained language models.
Examples
>>> import pandas as pd >>> from skrub import StringEncoder
We will encode the comments using 2 components:
>>> enc = StringEncoder(n_components=2) >>> X = pd.Series([ ... "The professor snatched a good interview out of the jaws of these questions.", ... "Bookmarking this to watch later.", ... "When you don't know the lyrics of the song except the chorus", ... ], name='video comments')
>>> enc.fit_transform(X) video comments_0 video comments_1 0 8.218069e-01 4.557474e-17 1 6.971618e-16 1.000000e+00 2 8.218069e-01 -3.046564e-16
Methods
fit
(column[, y])Fit the transformer.
fit_transform
(X[, y])Fit the encoder and transform a column.
Get output feature names for transformation.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform a column.
- fit(column, y=None)[source]#
Fit the transformer.
Subclasses should implement
fit_transform
andtransform
.- Parameters:
- columna pandas or polars
Series
Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- fit_transform(X, y=None)[source]#
Fit the encoder and transform a column.
- Parameters:
- XPandas or Polars series
The column to transform.
- y
None
Unused. Here for compatibility with scikit-learn.
- Returns:
- X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)
The embedding representation of the input.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
Gallery examples#

Various string encoders: a sentiment analysis example