Transforming a table into an array of numeric features: TableVectorizer#

In tabular machine learning pipelines, practitioners often convert categorical features to numeric features using various encodings (OneHotEncoder, OrdinalEncoder, etc.).

The objective of the TableVectorizer is to take any dataframe as input, and produce as output a feature-engineered version of the dataframe.

Initially, the TableVectorizer parses the data type of each column and maps each column to an encoder, in order to produce numeric features for machine learning models.

Parsing is handled internally by running a Cleaner on the input data. Note that in this case numeric values are always converted to float32 (whereas the default Cleaner behavior is to keep the original datatype): this is to ensure that the numeric dtype (including that of the missing values) is consistent for the downstream methods. For most applications, float32 has a sufficient precision, and reduces the memory footprint of the resulting features.

The same parameters used for the Cleaner can also be set when creating the TableVectorizer: this includes parameters for DropUninformative (drop_null_fraction etc.), and a datetime_format parameter for the datetime parsing step.

After detecting the datatypes, the TableVectorizer maps columns to one of four groups depending either on the datatype, and the number of unique values for categorical/string columns

The default transformers used by the TableVectorizer for each column category are the following:

  • High-cardinality categorical columns: StringEncoder

  • Low-cardinality categorical columns: scikit-learn OneHotEncoder

  • Numeric columns: “passthrough” (no transformation)

  • Datetime columns: DatetimeEncoder

High cardinality categorical columns are those with more than 40 unique values, while all other categorical columns are considered low cardinality: the threshold can be changed by setting the cardinality_threshold parameter of TableVectorizer, or by changing the configuration parameter with the same name using set_config().

To change the encoder or alter default parameters, instantiate an encoder and pass it to TableVectorizer.

>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder, SquashingScaler
>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> num_enc = SquashingScaler()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc, numeric=num_enc)
>>> table_vec
TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='circular'),
                high_cardinality=TextEncoder(), numeric=SquashingScaler())

Besides the transformers provided by skrub, the TableVectorizer can also take user-specified transformers that are applied to given columns.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> import pandas as pd
>>> encoder = OrdinalEncoder()
>>> df = pd.DataFrame({
...     "values": ["A", "B", "C"]
... })

We define the list of column-specific transformers:

>>> specific_transformers=[(encoder, ["values"])]

We can then encode the result:

>>> TableVectorizer(specific_transformers=specific_transformers).fit_transform(df)
   values
0     0.0
1     1.0
2     2.0

Note that the columns specified in specific_transformers are passed to the transformer without any modification, which means that the transformer must be able to handle the content of the column on its own.

If you need to define complex transformers to pass to a single instance of TableVectorizer, consider using the skrub Data Ops, ApplyToCols, or the skrub selectors instead, as they are more versatile and allow a higher degree of control over which operations are applied to which columns.

The TableVectorizer is used in Encoding: from a dataframe to a numerical matrix for machine learning, while the docstring of the class provides more details on the parameters and usage, as well as various examples.