ToCategorical#

class skrub.ToCategorical[source]#

Convert a string column to Categorical dtype.

Note

ToCategorical is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((ToCategorical(), 'col_name_1'), (ToCategorical(), 'col_name_2')) instead of make_column_transformer((ToCategorical(), ['col_name_1', 'col_name_2'])).

This transformer ensures that a given string or categorical column has Categorical dtype. This is done to mark columns to be treated as categorical by downstream transformers and learners.

Notes

The main benefit of converting columns to categorical is that categorical columns can be recognized by scikit-learn’s HistGradientBoostingRegressor and HistGradientBoostingClassifier with their categorical_features='from_dtype' option. This transformer is therefore particularly useful as the low_cardinality_transformer parameter of the TableVectorizer when combined with one of those supervised learners.

A pandas column with dtype string or object containing strings, or a polars column with dtype String, is converted to a categorical column. Categorical columns are passed through.

Any other type of column is rejected by raising a RejectColumn exception. Note: the TableVectorizer only sends string or categorical columns to its low_cardinality_transformer. Therefore it is always safe to use a ToCategorical instance as the low_cardinality_transformer.

The output of transform also always has a Categorical dtype. The categories are not necessarily the same across different calls to transform. Indeed, scikit-learn estimators do not inspect the dtype’s categories but the actual values. Converting to a Categorical is therefore just a way to mark a column and indicate to downstream estimators that this column should be treated as categorical. Ensuring they are encoded consistently, handling unseen categories at test time, etc. is the responsibility of encoders such as OneHotEncoder and LabelEncoder, or of estimators that handle categories themselves such as HistGradientBoostingRegressor.

Examples

>>> import pandas as pd
>>> from skrub import ToCategorical

A string column is converted to a categorical column.

>>> s = pd.Series(['one', 'two', None], name='c')
>>> s
0     one
1     two
2    None
Name: c, dtype: object
>>> to_cat = ToCategorical()
>>> to_cat.fit_transform(s)
0    one
1    two
2    NaN
Name: c, dtype: category
Categories (2, object): ['one', 'two']

The dtypes (the list of categories) of the outputs of transform may vary. This transformer only ensures the dtype is Categorical to mark the column as such for downstream encoders which will perform the actual encoding.

>>> s = pd.Series(['four', 'five'], name='c')
>>> to_cat.transform(s)
0    four
1    five
Name: c, dtype: category
Categories (2, object): ['five', 'four']

Columns that already have a Categorical dtype are passed through:

>>> s = pd.Series(['one', 'two'], name='c', dtype='category')
>>> to_cat.fit_transform(s) is s
True

Columns that are not strings nor categorical are rejected:

>>> to_cat.fit_transform(pd.Series([1.1, 2.2], name='c'))
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Column 'c' does not contain strings.

object columns that do not contain only strings are also rejected:

>>> s = pd.Series(['one', 1], name='c')
>>> to_cat.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._apply_to_cols.RejectColumn: Column 'c' does not contain strings.

No special handling of StringDtype vs object columns is done, the behavior is the same as pd.astype('category'): if the input uses the extension dtype, the categories of the output will, too.

>>> s = pd.Series(['cat A', 'cat B', None], name='c', dtype='string')
>>> s
0    cat A
1    cat B
2     <NA>
Name: c, dtype: string
>>> to_cat.fit_transform(s)
0    cat A
1    cat B
2     <NA>
Name: c, dtype: category
Categories (2, string): [cat A, cat B]
>>> _.cat.categories.dtype
string[python]

Polars string columns are converted to the Categorical dtype (not Enum). As for pandas, categories may vary across calls to transform.

>>> import pytest
>>> pl = pytest.importorskip("polars")
>>> s = pl.Series('c', ['one', 'two', None])
>>> to_cat.fit_transform(s)
shape: (3,)
Series: 'c' [cat]
[
    "one"
    "two"
    null
]

Polars Categorical or Enum columns are passed through:

>>> s = pl.Series('c', ['one', 'two'], dtype=pl.Enum(['one', 'two', 'three']))
>>> s
shape: (2,)
Series: 'c' [enum]
[
    "one"
    "two"
]
>>> to_cat.fit_transform(s) is s
True

Methods

`fit`(column[, y])	Fit the transformer.
`fit_transform`(column[, y])	Fit the encoder and transform a column.
`get_feature_names_out`()	Return a list of features generated by the transformer.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.
`set_transform_request`(*[, column])	Request metadata passed to the `transform` method.
`transform`(column)	Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:

columna pandas or polars Series: Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.
ycolumn or dataframe: Prediction targets.

Returns:

self: The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:

columnpandas or polars Series: The input to transform.
yNone: Ignored.

Returns:

transformedpandas or polars Series: The input transformed to Categorical.

get_feature_names_out()[source]#

Return a list of features generated by the transformer.

Each feature has format {input_name}_{n_component} where input_name is the name of the input column, or a default name for the encoder, and n_component is the idx of the specific feature.

Returns:

list of str: The list of feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for column parameter in transform.

Returns:

selfobject: The updated object.

transform(column)[source]#

Transform a column.

Parameters:

columnpandas or polars Series: The input to transform.

Returns:

transformedpandas or polars Series: The input transformed to Categorical.

Gallery examples#

Encoding: from a dataframe to a numerical matrix for machine learning

ToCategorical#

Gallery examples#

This Page