ToCategorical#
- class skrub.ToCategorical[source]#
Convert a string column to Categorical dtype.
Note
ToCategorical
is a type of single-column transformer. Unlike most scikit-learn estimators, itsfit
,transform
andfit_transform
methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in askrub.TableVectorizer
orsklearn.compose.ColumnTransformer
. In theColumnTransformer
, pass a single column:make_column_transformer((ToCategorical(), 'col_name_1'), (ToCategorical(), 'col_name_2'))
instead ofmake_column_transformer((ToCategorical(), ['col_name_1', 'col_name_2']))
.The main benefit is that categorical columns can then be recognized by scikit-learn’s
HistGradientBoostingRegressor
andHistGradientBoostingClassifier
with theircategorical_features='from_dtype'
option. This transformer is therefore particularly useful as thelow_cardinality_transformer
parameter of theTableVectorizer
when combined with one of those supervised learners.A pandas column with dtype
string
orobject
containing strings, or a polars column with dtypeString
, is converted to a categorical column. Categorical columns are passed through.Any other type of column is rejected by raising a
RejectColumn
exception. Note: theTableVectorizer
only sends string or categorical columns to itslow_cardinality_transformer
. Therefore it is always safe to use aToCategorical
instance as thelow_cardinality_transformer
.The output of
transform
also always has a Categorical dtype. The categories are not necessarily the same across different calls totransform
. Indeed, scikit-learn estimators do not inspect the dtype’s categories but the actual values. Converting to a Categorical is therefore just a way to mark a column and indicate to downstream estimators that this column should be treated as categorical. Ensuring they are encoded consistently, handling unseen categories at test time, etc. is the responsibility of encoders such asOneHotEncoder
andLabelEncoder
, or of estimators that handle categories themselves such asHistGradientBoostingRegressor
.Examples
>>> import pandas as pd >>> from skrub import ToCategorical
A string column is converted to a categorical column.
>>> s = pd.Series(['one', 'two', None], name='c') >>> s 0 one 1 two 2 None Name: c, dtype: object >>> to_cat = ToCategorical() >>> to_cat.fit_transform(s) 0 one 1 two 2 NaN Name: c, dtype: category Categories (2, object): ['one', 'two']
The dtypes (the list of categories) of the outputs of
transform
may vary. This transformer only ensures the dtype is Categorical to mark the column as such for downstream encoders which will perform the actual encoding.>>> s = pd.Series(['four', 'five'], name='c') >>> to_cat.transform(s) 0 four 1 five Name: c, dtype: category Categories (2, object): ['five', 'four']
Columns that already have a Categorical dtype are passed through:
>>> s = pd.Series(['one', 'two'], name='c', dtype='category') >>> to_cat.fit_transform(s) is s True
Columns that are not strings nor categorical are rejected:
>>> to_cat.fit_transform(pd.Series([1.1, 2.2], name='c')) Traceback (most recent call last): ... skrub._on_each_column.RejectColumn: Column 'c' does not contain strings.
object
columns that do not contain only strings are also rejected:>>> s = pd.Series(['one', 1], name='c') >>> to_cat.fit_transform(s) Traceback (most recent call last): ... skrub._on_each_column.RejectColumn: Column 'c' does not contain strings.
No special handling of
StringDtype
vsobject
columns is done, the behavior is the same aspd.astype('category')
: if the input uses the extension dtype, the categories of the output will, too.>>> s = pd.Series(['cat A', 'cat B', None], name='c', dtype='string') >>> s 0 cat A 1 cat B 2 <NA> Name: c, dtype: string >>> to_cat.fit_transform(s) 0 cat A 1 cat B 2 <NA> Name: c, dtype: category Categories (2, string): [cat A, cat B] >>> _.cat.categories.dtype string[python]
Polars string columns are converted to the
Categorical
dtype (notEnum
). As for pandas, categories may vary across calls totransform
.>>> import pytest >>> pl = pytest.importorskip("polars") >>> s = pl.Series('c', ['one', 'two', None]) >>> to_cat.fit_transform(s) shape: (3,) Series: 'c' [cat] [ "one" "two" null ]
Polars Categorical or Enum columns are passed through:
>>> s = pl.Series('c', ['one', 'two'], dtype=pl.Enum(['one', 'two', 'three'])) >>> s shape: (2,) Series: 'c' [enum] [ "one" "two" ] >>> to_cat.fit_transform(s) is s True
Methods
fit
(column[, y])Fit the transformer.
fit_transform
(column[, y])Fit the encoder and transform a column.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_fit_request
(*[, column])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, column])Request metadata passed to the
transform
method.transform
(column)Transform a column.
- fit(column, y=None)[source]#
Fit the transformer.
Subclasses should implement
fit_transform
andtransform
.- Parameters:
- columna pandas or polars
Series
Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
- ycolumn or dataframe
Prediction targets.
- columna pandas or polars
- Returns:
- self
The fitted transformer.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- set_fit_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- set_transform_request(*, column='$UNCHANGED$')[source]#
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.
Gallery examples#
Encoding: from a dataframe to a numerical matrix for machine learning