TextEncoder#
- class skrub.TextEncoder(model_name='intfloat/e5-small-v2', n_components=30, device=None, batch_size=32, token_env_variable=None, cache_folder=None, store_weights_in_pickle=False, random_state=None, verbose=False)[source]#
Encode string features by applying a pretrained language model downloaded from the HuggingFace Hub.
Note
TextEncoder
is a type of single-column transformer. Unlike most scikit-learn estimators, itsfit
,transform
andfit_transform
methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.TableVectorizer
orsklearn.compose.ColumnTransformer
. In theColumnTransformer
, pass a single column:make_column_transformer((TextEncoder(), 'col_name_1'), (TextEncoder(), 'col_name_2'))
instead ofmake_column_transformer((TextEncoder(), ['col_name_1', 'col_name_2']))
.This is a thin wrapper around
SentenceTransformer
that follows the scikit-learn API, making it usable within a scikit-learn pipeline.Warning
To use this class, you need to install the optional
transformers
dependencies for skrub. See the “deep learning dependencies” section in the Install guide for more details.This class uses a pre-trained model, so calling
fit
orfit_transform
will not train or fine-tune the model. Instead, the model is loaded from disk, and a PCA is fitted to reduce the dimension of the language model’s output, ifn_components
is not None.When PCA is disabled, this class is essentially stateless, with loading the pre-trained model from disk being the only difference between
fit_transform
andtransform
.Be aware that parallelizing this class (e.g., using
TableVectorizer
withn_jobs
> 1) may be computationally expensive. This is because a copy of the pre-trained model is loaded into memory for each thread. Therefore, we recommend you to let the default n_jobs=None (or set to 1) of the TableVectorizer and let pytorch handle parallelism.If memory usage is a concern, check the characteristics of your selected model.
- Parameters:
- model_name
str
, default=”intfloat/e5-small-v2” If a filepath on disk is passed, this class loads the model from that path.
Otherwise, it first tries to download a pre-trained
SentenceTransformer
model. If that fails, tries to construct a model from Huggingface models repository with that name.
The following models have a good performance/memory usage tradeoff:
intfloat/e5-small-v2
all-MiniLM-L6-v2
all-mpnet-base-v2
You can find more options on the sentence-transformers documentation.
The default model is a shrunk version of e5-v2, which has shown good performance in the benchmark of [1].
- n_components
int
orNone
, default=30, The number of embedding dimensions. As the number of dimensions is different across embedding models, this class uses a
PCA
to set the number of embedding ton_components
duringtransform
. Setn_components=None
to skip the PCA dimension reduction mechanism.See [1] for more details on the choice of the PCA and default
n_components
.- device
str
, default=None Device (e.g. “cpu”, “cuda”, “mps”) that should be used for computation. If None, checks if a GPU can be used. Note that macOS ARM64 users can enable the GPU on their local machine by setting
device="mps"
.- batch_size
int
, default=32 The batch size to use during
transform
.- token_env_variable
str
, default=None The name of the environment variable which stores your HuggingFace authentication token to download private models. Note that we only store the name of the variable but not the token itself.
- cache_folder
str
, default=None Path to store models. By default
~/skrub_data
. Seeskrub.datasets._utils.get_data_dir()
. Note that when unpicklingTextEncoder
on another machine, thecache_folder
path needs to be accessible to store the downloaded model.- store_weights_in_pickle
bool
, default=False Whether or not to keep the loaded sentence-transformers model in the
TextEncoder
when pickling.When set to False, the
_estimator
property is removed from the object to pickle, which significantly reduces the size of the serialized object. Note that when the serialized object is unpickled on another machine, theTextEncoder
will try to download the sentence-transformer model again from HuggingFace Hub. This process could fail if, for example, the machine doesn’t have internet access. Additionally, if you use weights stored on disk that are not on the HuggingFace Hub (by passing a path tomodel_name
), these weights will not be pickled either. Therefore you would need to copy them to the machine where you unpickle theTextEncoder
.When set to True, the
_estimator
property is included in the serialized object. Users deploying fine-tuned models stored on disk are recommended to use this option. Note that the machine where theTextEncoder
is unpickled must have the same device than the machine where it was pickled.
- random_state
int
,RandomState
instance orNone
, default=None Used when the PCA dimension reduction mechanism is used, for reproducible results across multiple function calls.
- verbose
bool
, default=True Verbose level, controls whether to show a progress bar or not during
transform
.
- model_name
- Attributes:
See also
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
GapEncoder
Encode string columns by constructing latent topics.
SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
References
[1] (1,2)L. Grinsztajn, M. Kim, E. Oyallon, G. Varoquaux “Vectorizing string entries for data processing on tables: when are larger language models better?”, 2023. https://hal.science/hal-04345931
Examples
>>> import pandas as pd >>> from skrub import TextEncoder
Let’s encode video comments using only 2 embedding dimensions:
>>> enc = TextEncoder(model_name='intfloat/e5-small-v2', n_components=2) >>> X = pd.Series([ ... "The professor snatched a good interview out of the jaws of these questions.", ... "Bookmarking this to watch later.", ... "When you don't know the lyrics of the song except the chorus", ... ], name='video comments')
Fitting does not train the underlying pre-trained deep-learning model, but ensure various checks and enable dimension reduction.
>>> enc.fit_transform(X) video comments_1 video comments_2 0 0.411395 0.096504 1 -0.105210 -0.344567 2 -0.306184 0.248063
Methods
fit
(column[, y])Fit the transformer.
fit_transform
(column[, y])Fit the TextEncoder from
column
.Get output feature names for transformation.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_fit_request
(*[, column])Request metadata passed to the
fit
method.set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, column])Request metadata passed to the
transform
method.transform
(column)Transform
column
using the TextEncoder.- fit(column, y=None)[source]#
Fit the transformer.
Subclasses should implement
fit_transform
andtransform
.- Parameters:
- columna pandas or polars
Series
Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- fit_transform(column, y=None)[source]#
Fit the TextEncoder from
column
.In practice, it loads the pre-trained model from disk and returns the embeddings of the column.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_fit_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”, “polars”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- transform(column)[source]#
Transform
column
using the TextEncoder.This method uses the embedding model loaded in memory during
fit
orfit_transform
.- Parameters:
- columnpandas or polars
Series
of shape (n_samples,) The string column to compute embeddings from.
- columnpandas or polars
- Returns:
- X_outpandas or polars DataFrame of shape (n_samples, n_components)
The embedding representation of the input.
Gallery examples#
Various string encoders: a sentiment analysis example