skrub.TableVectorizer#

Usage examples at the bottom of this page.

class skrub.TableVectorizer(*, cardinality_threshold=40, low_cardinality_transformer=OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), high_cardinality_transformer=GapEncoder(n_components=30), numerical_transformer='passthrough', datetime_transformer=DatetimeEncoder(), specific_transformers=None, auto_cast=True, impute_missing='auto', remainder='passthrough', sparse_threshold=0.0, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=False)[source]#

Automatically transform a heterogeneous dataframe to a numerical array.

Easily transforms a heterogeneous data table (such as a pandas.DataFrame) to a numerical array for machine learning. To do so, the TableVectorizer transforms each column depending on its data type.

Parameters:
cardinality_thresholdint, default=40

Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_cardinality_transformer and high_cardinality_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).

low_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. OneHotEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns. The default transformer is OneHotEncoder(handle_unknown="ignore", drop="if_binary"). Features classified under this category are imputed based on the strategy defined with impute_missing.

high_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns. The default transformer is GapEncoder(n_components=30). Features classified under this category are imputed based on the strategy defined with impute_missing.

numerical_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns (default). Features classified under this category are not imputed at all (regardless of impute_missing).

datetime_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns. The default transformer is DatetimeEncoder(). Features classified under this category are not imputed at all (regardless of impute_missing).

specific_transformerslist of tuples ({‘drop’, ‘remainder’, ‘passthrough’} or Transformer, list of str or int) or (str, {‘drop’, ‘remainder’, ‘passthrough’} or Transformer, list of str or int), optional

On top of the default column type classification (see parameters above), this parameter allows you to manually specify transformers for specific columns. This is equivalent to using a ColumnTransformer for assigning the column-specific transformers, and passing the TableVectorizer as the remainder. This parameter can take two different formats, either: - a list of 2-tuples (transformer, column names or indices) - a list of 3-tuple (name, transformer, column names or indices) In the latter format, you can specify the name of the assignment. Mixing the two is not supported.

auto_castbool, default=True

If set to True, will try to convert each column to the best possible data type (dtype).

impute_missing{‘auto’, ‘force’, ‘skip’}, default=’auto’

When to impute missing values in categorical (textual) columns. ‘auto’ will impute missing values if it is considered appropriate (we are using an encoder that does not support missing values and/or specific versions of pandas, numpy and scikit-learn). ‘force’ will impute missing values in all categorical columns. ‘skip’ will not impute at all. When imputed, missing values are replaced by the string ‘missing’ before being encoded. As imputation logic for numerical features can be quite intricate, it is left to the user to manage. See also attribute imputed_columns_.

remainder{‘drop’, ‘passthrough’} or Transformer, default=’passthrough’

By default, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. (default ‘passthrough’). By specifying remainder=’drop’, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

sparse_thresholdfloat, default=0.0

If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.

n_jobsint, default=None

Number of jobs to run in parallel. This number of jobs will be dispatched to the underlying transformers, if those support parallelization and they do not set specifically n_jobs. None (the default) means 1 unless in a joblib.parallel_config() context. -1 means using all processors.

transformer_weightsdict, default=None

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed.

verbose_feature_names_outbool, default=False

If True, TableVectorizer.get_feature_names_out() will prefix all feature names with the name of the transformer that generated that feature. If False, TableVectorizer.get_feature_names_out() will not prefix any feature names and will error if feature names are not unique.

See also

GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Notes

The column order of the input data is not guaranteed to be the same as the output data (returned by TableVectorizer.transform). This is a due to the way the underlying ColumnTransformer works. However, the output column order will always be the same for different calls to TableVectorize.transform on a same fitted TableVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year’], then output columns might be shuffled, e.g. [‘job’, ‘year’, ‘name’], but every call to TableVectorizer.transform on this instance will return this order.

Examples

Fit a TableVectorizer on an example dataset:

>>> from skrub.datasets import fetch_employee_salaries
>>> ds = fetch_employee_salaries()
>>> ds.X.head(3)
  gender department  ... date_first_hired year_first_hired
0      F        POL  ...       09/22/1986             1986
1      M        POL  ...       09/12/1988             1988
2      F        HHS  ...       11/19/1989             1989
[3 rows x 8 columns]
>>> tv = TableVectorizer()
>>> tv.fit(ds.X)
TableVectorizer()

Now, we can inspect the transformers assigned to each column:

>>> tv.transformers_
[('numeric', 'passthrough', ['year_first_hired']), ('datetime', DatetimeEncoder(), ['date_first_hired']), ('low_card_cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), ['gender', 'department', 'department_name', 'assignment_category']), ('high_card_cat', GapEncoder(n_components=30), ['division', 'employee_position_title'])]
Attributes:
transformers_list of 3-tuples (str, Transformer or str, list of str)

Transformers applied to the different columns.

types_dict mapping of int to type

A mapping of inferred types per column. Key is the index of a column, value is the inferred dtype. Exists only if auto_cast=True.

imputed_columns_list of str

The list of columns in which we imputed the missing values.

Methods

fit(X[, y])

Fit all transformers using X.

fit_transform(X[, y])

Fit all transformers, transform the data, and concatenate the results.

get_feature_names_out([input_features])

Return clean feature names.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X by applying the fitted transformers on the columns.

fit(X, y=None)[source]#

Fit all transformers using X.

Parameters:
X{array_like, dataframe} of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray_like of shape (n_samples, …), default=None

Targets for supervised learning.

Returns:
selfTableVectorizer

This estimator.

fit_transform(X, y=None)[source]#

Fit all transformers, transform the data, and concatenate the results.

In practice, it (1) converts features to their best possible types if auto_cast=True, (2) classify columns based on their data type, (3) replaces “false missing” (see _replace_false_missing), and imputes categorical columns depending on impute_missing, and finally, transforms X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray_like of shape (n_samples,), optional

Targets for supervised learning.

Returns:
{array_like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names_out(input_features=None)[source]#

Return clean feature names.

Feature names are formatted like: “<column_name>_<value>” if encoded by OneHotEncoder or alike, (e.g. “job_title_Police officer”), or “<column_name>” otherwise.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
list of str

Feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

property named_transformers_#

Map transformer names to transformer objects.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

property output_indices_#

Map the transformer names to their input indices.

A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

property sparse_output_#

Whether the output of transform is sparse or dense.

Boolean flag indicating whether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

transform(X)[source]#

Transform X by applying the fitted transformers on the columns.

Parameters:
Xarray_like of shape (n_samples, n_features)

The data to be transformed.

Returns:
{array_like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

property transformers_#

Transformers applied to the different columns.

Examples using skrub.TableVectorizer#

Encoding: from a dataframe to a numerical matrix for machine learning

Encoding: from a dataframe to a numerical matrix for machine learning

Handling datetime features with the DatetimeEncoder

Handling datetime features with the DatetimeEncoder

Spatial join for flight data: Joining across multiple columns

Spatial join for flight data: Joining across multiple columns

Self-aggregation on MovieLens

Self-aggregation on MovieLens