skrub
.TableVectorizer#
Usage examples at the bottom of this page.
- class skrub.TableVectorizer(*, cardinality_threshold=40, low_cardinality_transformer=OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), high_cardinality_transformer=GapEncoder(n_components=30), numerical_transformer='passthrough', datetime_transformer=DatetimeEncoder(), specific_transformers=None, auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.0, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=False)[source]#
Automatically transform a heterogeneous dataframe to a numerical array.
Easily transforms a heterogeneous data table (such as a
pandas.DataFrame
) to a numerical array for machine learning. To do so, the TableVectorizer transforms each column depending on its data type.- Parameters:
- cardinality_threshold
int
, default=40 Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_cardinality_transformer and high_cardinality_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).
- low_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or
Transformer
, optional Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns. The default transformer is
OneHotEncoder(handle_unknown="ignore", drop="if_binary")
. Features classified under this category are imputed based on the strategy defined with impute_missing.- high_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or
Transformer
, optional Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns. The default transformer is
GapEncoder(n_components=30)
. Features classified under this category are imputed based on the strategy defined with impute_missing.- numerical_transformer{‘drop’, ‘remainder’, ‘passthrough’} or
Transformer
, optional Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns (default). Features classified under this category are not imputed at all (regardless of impute_missing).
- datetime_transformer{‘drop’, ‘remainder’, ‘passthrough’} or
Transformer
, optional Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns. The default transformer is
DatetimeEncoder()
. Features classified under this category are not imputed at all (regardless of impute_missing).- specific_transformers
list
of tuples ({‘drop’, ‘remainder’, ‘passthrough’} orTransformer
,list
ofstr
orint
) or (str
, {‘drop’, ‘remainder’, ‘passthrough’} orTransformer
,list
ofstr
orint
), optional On top of the default column type classification (see parameters above), this parameter allows you to manually specify transformers for specific columns. This is equivalent to using a ColumnTransformer for assigning the column-specific transformers, and passing the
TableVectorizer
as theremainder
. This parameter can take two different formats, either: - a list of 2-tuples (transformer, column names or indices) - a list of 3-tuple (name, transformer, column names or indices) In the latter format, you can specify the name of the assignment. Mixing the two is not supported.- auto_cast
bool
, default=True If set to True, will try to convert each column to the best possible data type (dtype).
- impute_missing{‘auto’, ‘force’, ‘skip’}, default=’auto’
When to impute missing values in categorical (textual) columns. ‘auto’ will impute missing values if it is considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute missing values in all categorical columns. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’ before being encoded. As imputation logic for numerical features can be quite intricate, it is left to the user to manage. See also attribute
imputed_columns_
.- remainder{‘drop’, ‘passthrough’} or
Transformer
, default=’passthrough’ By default, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. (default ‘passthrough’). By specifying remainder=’drop’, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.
- sparse_threshold
float
, default=0.0 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.
- n_jobs
int
, default=None Number of jobs to run in parallel. This number of jobs will be dispatched to the underlying transformers, if those support parallelization and they do not set specifically
n_jobs
.None
(the default) means 1 unless in ajoblib.parallel_config()
context.-1
means using all processors.- transformer_weights
dict
, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
- verbose
bool
, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.
- verbose_feature_names_out
bool
, default=False If True,
TableVectorizer.get_feature_names_out()
will prefix all feature names with the name of the transformer that generated that feature. If False,TableVectorizer.get_feature_names_out()
will not prefix any feature names and will error if feature names are not unique.
- cardinality_threshold
See also
GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
Notes
The column order of the input data is not guaranteed to be the same as the output data (returned by TableVectorizer.transform). This is a due to the way the underlying ColumnTransformer works. However, the output column order will always be the same for different calls to
TableVectorize.transform
on a same fitted TableVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year’], then output columns might be shuffled, e.g. [‘job’, ‘year’, ‘name’], but every call toTableVectorizer.transform
on this instance will return this order.Examples
Fit a TableVectorizer on an example dataset:
>>> from skrub.datasets import fetch_employee_salaries >>> ds = fetch_employee_salaries() >>> ds.X.head(3) gender department ... date_first_hired year_first_hired 0 F POL ... 09/22/1986 1986 1 M POL ... 09/12/1988 1988 2 F HHS ... 11/19/1989 1989 [3 rows x 8 columns]
>>> tv = TableVectorizer() >>> tv.fit(ds.X) TableVectorizer()
Now, we can inspect the transformers assigned to each column:
>>> tv.transformers_ [('numeric', 'passthrough', ['year_first_hired']), ('datetime', DatetimeEncoder(), ['date_first_hired']), ('low_card_cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), ['gender', 'department', 'department_name', 'assignment_category']), ('high_card_cat', GapEncoder(n_components=30), ['division', 'employee_position_title'])]
- Attributes:
transformers_
list
of 3-tuples (str
,Transformer
orstr
,list
ofstr
)Transformers applied to the different columns.
- types_
dict
mapping ofint
to type A mapping of inferred types per column. Key is the index of a column, value is the inferred dtype. Exists only if auto_cast=True.
- imputed_columns_
list
ofstr
The list of columns in which we imputed the missing values.
Methods
fit
(X[, y])Fit all transformers using X.
fit_transform
(X[, y])Fit all transformers, transform the data, and concatenate the results.
get_feature_names_out
([input_features])Return clean feature names.
Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X by applying the fitted transformers on the columns.
- fit(X, y=None)[source]#
Fit all transformers using X.
- Parameters:
- X{array_like, dataframe} of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray_like of shape (n_samples, …), default=None
Targets for supervised learning.
- Returns:
- self
TableVectorizer
This estimator.
- self
- fit_transform(X, y=None)[source]#
Fit all transformers, transform the data, and concatenate the results.
In practice, it (1) converts features to their best possible types if auto_cast=True, (2) classify columns based on their data type, (3) replaces “false missing” (see _replace_false_missing), and imputes categorical columns depending on impute_missing, and finally, transforms X.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
Input data, of which specified subsets are used to fit the transformers.
- yarray_like of shape (n_samples,), optional
Targets for supervised learning.
- Returns:
- {array_like, sparse matrix} of shape (n_samples, sum_n_components)
Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- get_feature_names_out(input_features=None)[source]#
Return clean feature names.
Feature names are formatted like: “<column_name>_<value>” if encoded by OneHotEncoder or alike, (e.g. “job_title_Police officer”), or “<column_name>” otherwise.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- property named_transformers_#
Map transformer names to transformer objects.
Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.
- property output_indices_#
Map the transformer names to their input indices.
A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).
- set_output(*, transform=None)[source]#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
- transform{“default”, “pandas”}, default=None
Configure output of transform and fit_transform.
“default”: Default output format of a transformer
“pandas”: DataFrame output
None: Transform configuration is unchanged
- Returns:
- selfestimator instance
Estimator instance.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **params
dict
Estimator parameters.
- **params
- Returns:
- selfestimator instance
Estimator instance.
- property sparse_output_#
Whether the output of
transform
is sparse or dense.Boolean flag indicating whether the output of
transform
is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.
- transform(X)[source]#
Transform X by applying the fitted transformers on the columns.
- Parameters:
- Xarray_like of shape (n_samples, n_features)
The data to be transformed.
- Returns:
- {array_like, sparse matrix} of shape (n_samples, sum_n_components)
Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.
- property transformers_#
Transformers applied to the different columns.
Examples using skrub.TableVectorizer
#

Encoding: from a dataframe to a numerical matrix for machine learning

Handling datetime features with the DatetimeEncoder

Spatial join for flight data: Joining across multiple columns