skrub.ToCategorical#

Usage examples at the bottom of this page.

class skrub.ToCategorical[source]#

Convert a string column to Categorical dtype.

Note

ToCategorical is a type of single-column transformer. Unlike most scikit-learn estimators, its fit, transform and fit_transform methods expect a single column (a pandas or polars Series) rather than a full dataframe. To apply this transformer to one or more columns in a dataframe, use it as a parameter in a skrub.TableVectorizer or sklearn.compose.ColumnTransformer. In the ColumnTransformer, pass a single column: make_column_transformer((ToCategorical(), 'col_name_1'), (ToCategorical(), 'col_name_2')) instead of make_column_transformer((ToCategorical(), ['col_name_1', 'col_name_2'])).

The main benefit is that categorical columns can then be recognized by scikit-learn’s HistGradientBoostingRegressor and HistGradientBoostingClassifier with their categorical_features='from_dtype' option. This transformer is therefore particularly useful as the low_cardinality_transformer parameter of the TableVectorizer when combined with one of those supervised learners.

A pandas column with dtype string or object containing strings, or a polars column with dtype String, is converted to a categorical column. Categorical columns are passed through.

Any other type of column is rejected by raising a RejectColumn exception. Note: the TableVectorizer only sends string or categorical columns to its low_cardinality_transformer. Therefore it is always safe to use a ToCategorical instance as the low_cardinality_transformer.

The output of transform also always has a Categorical dtype. The categories are not necessarily the same across different calls to transform. Indeed, scikit-learn estimators do not inspect the dtype’s categories but the actual values. Converting to a Categorical is therefore just a way to mark a column and indicate to downstream estimators that this column should be treated as categorical. Ensuring they are encoded consistently, handling unseen categories at test time, etc. is the responsibility of encoders such as OneHotEncoder and LabelEncoder, or of estimators that handle categories themselves such as HistGradientBoostingRegressor.

Examples

>>> import pandas as pd
>>> from skrub import ToCategorical

A string column is converted to a categorical column.

>>> s = pd.Series(['one', 'two', None], name='c')
>>> s
0     one
1     two
2    None
Name: c, dtype: object
>>> to_cat = ToCategorical()
>>> to_cat.fit_transform(s)
0    one
1    two
2    NaN
Name: c, dtype: category
Categories (2, object): ['one', 'two']

The dtypes (the list of categories) of the outputs of transform may vary. This transformer only ensures the dtype is Categorical to mark the column as such for downstream encoders which will perform the actual encoding.

>>> s = pd.Series(['four', 'five'], name='c')
>>> to_cat.transform(s)
0    four
1    five
Name: c, dtype: category
Categories (2, object): ['five', 'four']

Columns that already have a Categorical dtype are passed through:

>>> s = pd.Series(['one', 'two'], name='c', dtype='category')
>>> to_cat.fit_transform(s) is s
True

Columns that are not strings nor categorical are rejected:

>>> to_cat.fit_transform(pd.Series([1.1, 2.2], name='c'))
Traceback (most recent call last):
    ...
skrub._on_each_column.RejectColumn: Column 'c' does not contain strings.

object columns that do not contain only strings are also rejected:

>>> s = pd.Series(['one', 1], name='c')
>>> to_cat.fit_transform(s)
Traceback (most recent call last):
    ...
skrub._on_each_column.RejectColumn: Column 'c' does not contain strings.

No special handling of StringDtype vs object columns is done, the behavior is the same as pd.astype('category'): if the input uses the extension dtype, the categories of the output will, too.

>>> s = pd.Series(['cat A', 'cat B', None], name='c', dtype='string')
>>> s
0    cat A
1    cat B
2     <NA>
Name: c, dtype: string
>>> to_cat.fit_transform(s)
0    cat A
1    cat B
2     <NA>
Name: c, dtype: category
Categories (2, string): [cat A, cat B]
>>> _.cat.categories.dtype
string[python]

Polars string columns are converted to the Categorical dtype (not Enum). As for pandas, categories may vary across calls to transform.

>>> import pytest
>>> pl = pytest.importorskip("polars")
>>> s = pl.Series('c', ['one', 'two', None])
>>> to_cat.fit_transform(s)
shape: (3,)
Series: 'c' [cat]
[
    "one"
    "two"
    null
]

Polars Categorical or Enum columns are passed through:

>>> s = pl.Series('c', ['one', 'two'], dtype=pl.Enum(['one', 'two', 'three']))
>>> s
shape: (2,)
Series: 'c' [enum]
[
    "one"
    "two"
]
>>> to_cat.fit_transform(s) is s
True

Methods

fit(column[, y])

Fit the transformer.

fit_transform(column[, y])

Fit the encoder and transform a column.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, column])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_transform_request(*[, column])

Request metadata passed to the transform method.

transform(column)

Transform a column.

fit(column, y=None)[source]#

Fit the transformer.

Subclasses should implement fit_transform and transform.

Parameters:
columna pandas or polars Series

Unlike most scikit-learn transformers, single-column transformers transform a single column, not a whole dataframe.

ycolumn or dataframe

Prediction targets.

Returns:
self

The fitted transformer.

fit_transform(column, y=None)[source]#

Fit the encoder and transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

yNone

Ignored.

Returns:
transformedpandas or polars Series

The input transformed to Categorical.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_fit_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in fit.

Returns:
selfobject

The updated object.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

set_transform_request(*, column='$UNCHANGED$')[source]#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
columnstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for column parameter in transform.

Returns:
selfobject

The updated object.

transform(column)[source]#

Transform a column.

Parameters:
columnpandas or polars Series

The input to transform.

Returns:
transformedpandas or polars Series

The input transformed to Categorical.

Examples using skrub.ToCategorical#

Encoding: from a dataframe to a numerical matrix for machine learning

Encoding: from a dataframe to a numerical matrix for machine learning