Transforming a table into an array of numeric features: TableVectorizer
#
In tabular machine learning pipelines, practitioners often convert categorical
features to numeric features using various encodings (OneHotEncoder
, OrdinalEncoder
,
etc.).
The objective of the TableVectorizer
is to take any dataframe as input, and
produce as output a feature-engineered version of the dataframe.
Initially, the TableVectorizer
parses the data type of each column and maps each
column to an encoder, in order to produce numeric features for machine learning
models.
Parsing is handled internally by running a Cleaner
on the input data.
Note that in this case numeric values are always converted to float32
(whereas the default Cleaner
behavior is to keep the original datatype): this
is to ensure that the numeric dtype (including that of the missing values) is
consistent for the downstream methods. For most applications, float32
has a
sufficient precision, and reduces the memory footprint of the resulting features.
The same parameters used for the Cleaner
can also be set when creating the
TableVectorizer
: this includes parameters for DropUninformative
(drop_null_fraction
etc.), and a datetime_format
parameter for the
datetime parsing step.
After detecting the datatypes, the TableVectorizer
maps columns to one of
four groups depending either on the datatype, and the number of unique values
for categorical/string columns
The default transformers used by the TableVectorizer
for each column category
are the following:
High-cardinality categorical columns:
StringEncoder
Low-cardinality categorical columns: scikit-learn
OneHotEncoder
Numeric columns: “passthrough” (no transformation)
Datetime columns:
DatetimeEncoder
High cardinality categorical columns are those with more than 40 unique values,
while all other categorical columns are considered low cardinality: the
threshold can be changed by setting the cardinality_threshold
parameter of
TableVectorizer
, or by changing the configuration parameter with the same name
using set_config()
.
To change the encoder or alter default parameters, instantiate an encoder and pass
it to TableVectorizer
.
>>> from skrub import TableVectorizer, DatetimeEncoder, TextEncoder, SquashingScaler
>>> datetime_enc = DatetimeEncoder(periodic_encoding="circular")
>>> text_enc = TextEncoder()
>>> num_enc = SquashingScaler()
>>> table_vec = TableVectorizer(datetime=datetime_enc, high_cardinality=text_enc, numeric=num_enc)
>>> table_vec
TableVectorizer(datetime=DatetimeEncoder(periodic_encoding='circular'),
high_cardinality=TextEncoder(), numeric=SquashingScaler())
Besides the transformers provided by skrub, the TableVectorizer
can also take
user-specified transformers that are applied to given columns.
>>> from sklearn.preprocessing import OrdinalEncoder
>>> import pandas as pd
>>> encoder = OrdinalEncoder()
>>> df = pd.DataFrame({
... "values": ["A", "B", "C"]
... })
We define the list of column-specific transformers:
>>> specific_transformers=[(encoder, ["values"])]
We can then encode the result:
>>> TableVectorizer(specific_transformers=specific_transformers).fit_transform(df)
values
0 0.0
1 1.0
2 2.0
Note that the columns specified in specific_transformers
are passed to the
transformer without any modification, which means that the transformer must be
able to handle the content of the column on its own.
If you need to define complex transformers to pass to a single instance of
TableVectorizer
, consider using the skrub Data Ops,
ApplyToCols
, or the skrub selectors instead, as
they are more versatile and allow a higher degree
of control over which operations are applied to which columns.
The TableVectorizer
is used in Encoding: from a dataframe to a numerical matrix for machine learning, while the
docstring of the class provides more details on the parameters and usage, as well
as various examples.