skrub.TableVectorizer#

Usage examples at the bottom of this page.

class skrub.TableVectorizer(*, cardinality_threshold=40, low_cardinality_transformer=OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), high_cardinality_transformer=GapEncoder(n_components=30), numerical_transformer='passthrough', datetime_transformer=DatetimeEncoder(), specific_transformers=None, auto_cast=True, remainder='passthrough', sparse_threshold=0.0, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=False)[source]#

Automatically transform a heterogeneous dataframe to a numerical array.

Easily transforms a heterogeneous data table (such as a pandas.DataFrame) to a numerical array for machine learning. To do so, the TableVectorizer transforms each column depending on its data type.

Parameters:
cardinality_thresholdint, default=40

Two lists of features will be created depending on this value: strictly under this value, the low cardinality categorical features, and above or equal, the high cardinality categorical features. Different transformers will be applied to these two groups, defined by the parameters low_cardinality_transformer and high_cardinality_transformer respectively. Note: currently, missing values are counted as a single unique value (so they count in the cardinality).

low_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with low cardinality (threshold is defined by cardinality_threshold). Can either be a: - transformer object instance (e.g. OneHotEncoder) - a Pipeline containing the preprocessing steps - ‘drop’ for dropping the columns - ‘remainder’ for applying remainder - ‘passthrough’ to return the unencoded columns

The default transformer is

``` OneHotEncoder(

handle_unknown=’ignore’, drop=’if_binary’, sparse_output=False,

When the downstream estimator is a tree-based model (e.g., scikit-learn HistGradientBoostingRegressor), the OneHotEncoder may lead to lower performances than other transformers, such as the OrdinalEncoder.

high_cardinality_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on categorical/string features with high cardinality (threshold is defined by cardinality_threshold). Can either be a transformer object instance (e.g. GapEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns. The default transformer is GapEncoder(n_components=30)

numerical_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns (default).

datetime_transformer{‘drop’, ‘remainder’, ‘passthrough’} or Transformer, optional

Transformer used on datetime features. Can either be a transformer object instance (e.g. DatetimeEncoder), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, ‘passthrough’ to return the unencoded columns. The default transformer is DatetimeEncoder().

specific_transformerslist of tuples ({‘drop’, ‘remainder’, ‘passthrough’} or Transformer, list of str or int) or (str, {‘drop’, ‘remainder’, ‘passthrough’} or Transformer, list of str or int), optional

On top of the default column type classification (see parameters above), this parameter allows you to manually specify transformers for specific columns. This is equivalent to using a ColumnTransformer for assigning the column-specific transformers, and passing the TableVectorizer as the remainder. This parameter can take two different formats, either: - a list of 2-tuples (transformer, column names or indices) - a list of 3-tuple (name, transformer, column names or indices) In the latter format, you can specify the name of the assignment. Mixing the two is not supported.

auto_castbool, default=True

If set to True, calling fit, transform or fit_transform will call _auto_cast to convert each column to the “optimal” dtype for scikit-learn estimators. The main heuristics are the following: - pandas extension dtypes conversion to numpy dtype - datetime conversion using skrub.to_datetime - numeric conversion using pandas.to_numeric - numeric columns with missing values are converted to float to input np.nan - categorical columns dtypes are updated with the new entries (if any)

during transform.

remainder{‘drop’, ‘passthrough’} or Transformer, default=’passthrough’

By default, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. (default ‘passthrough’). By specifying remainder='drop', only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

sparse_thresholdfloat, default=0.0

If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored. Note that with the default encoders, the output will never be sparse.

n_jobsint, default=None

Number of jobs to run in parallel. This number of jobs will be dispatched to the underlying transformers, if those support parallelization and they do not set specifically n_jobs. None (the default) means 1 unless in a joblib.parallel_config() context. -1 means using all processors.

transformer_weightsdict, default=None

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

verbosebool, default=False

If True, the time elapsed while fitting each transformer will be printed as it is completed.

verbose_feature_names_outbool, default=False

If True, TableVectorizer.get_feature_names_out() will prefix all feature names with the name of the transformer that generated that feature. If False, TableVectorizer.get_feature_names_out() will not prefix any feature names and will error if feature names are not unique.

See also

GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Notes

The column order of the input data is not guaranteed to be the same as the output data (returned by TableVectorizer.transform). This is a due to the way the underlying ColumnTransformer works. However, the output column order will always be the same for different calls to TableVectorize.transform on a same fitted TableVectorizer instance. For example, if input data has columns [‘name’, ‘job’, ‘year’], then output columns might be shuffled, e.g. [‘job’, ‘year’, ‘name’], but every call to TableVectorizer.transform on this instance will return this order.

Examples

Fit a TableVectorizer on an example dataset:

>>> from skrub.datasets import fetch_employee_salaries
>>> ds = fetch_employee_salaries()
>>> ds.X.head(3)
  gender department  ... date_first_hired year_first_hired
0      F        POL  ...       09/22/1986             1986
1      M        POL  ...       09/12/1988             1988
2      F        HHS  ...       11/19/1989             1989
[3 rows x 8 columns]
>>> tv = TableVectorizer()
>>> tv.fit(ds.X)
TableVectorizer()

Now, we can inspect the transformers assigned to each column:

>>> tv.transformers_
[('numeric', 'passthrough', ['year_first_hired']), ('datetime', DatetimeEncoder(), ['date_first_hired']), ('low_cardinality', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False), ['gender', 'department', 'department_name', 'assignment_category']), ('high_cardinality', GapEncoder(n_components=30),     ['division', 'employee_position_title'])]
Attributes:
transformers_list of 3-tuples (str, Transformer or str, list of str)

Transformers applied to the different columns.

inferred_column_types_dict mapping of int to type

A mapping of inferred types per column.

Methods

fit(X[, y])

Fit all transformers using X.

fit_transform(X[, y])

Fit all transformers, transform the data, and concatenate the results.

get_feature_names_out([input_features])

Return clean feature names.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform X by applying the fitted transformers on the columns.

fit(X, y=None)[source]#

Fit all transformers using X.

Parameters:
Xdataframe of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray_like of shape (n_samples, …), default=None

Targets for supervised learning.

Returns:
selfTableVectorizer

This estimator.

fit_transform(X, y=None)[source]#

Fit all transformers, transform the data, and concatenate the results.

In practice, it: - Converts features to their best possible types for scikit-learn estimators

if auto_cast=True (see auto_cast docstring).

  • Classify columns based on their data types and match them to each dtype-specific transformers.

  • Use scikit-learn ColumnTransformer to run fit_transform on all transformers.

Parameters:
Xdataframe of shape (n_samples, n_features)

Input data, of which specified subsets are used to fit the transformers.

yarray_like of shape (n_samples,), optional

Targets for supervised learning.

Returns:
{array_like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names_out(input_features=None)[source]#

Return clean feature names.

Feature names are formatted like: “<column_name>_<value>” if encoded by OneHotEncoder or alike, (e.g. “job_title_Police officer”), or “<column_name>” otherwise.

Parameters:
input_featuresNone

Unused, only here for compatibility.

Returns:
feature_namesndarray of str

Feature names.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

property named_transformers_#

Map transformer names to transformer objects.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

property output_indices_#

Map the transformer names to their input indices.

A dictionary from each transformer name to a slice, where the slice corresponds to indices in the transformed output. This is useful to inspect which transformer is responsible for which transformed feature(s).

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

property sparse_output_#

Whether the output of transform is sparse or dense.

Boolean flag indicating whether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

transform(X)[source]#

Transform X by applying the fitted transformers on the columns.

Parameters:
Xdataframe of shape (n_samples, n_features)

The data to be transformed.

Returns:
{array_like, sparse matrix} of shape (n_samples, sum_n_components)

Hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

property transformers_#

Transformers applied to the different columns.

Examples using skrub.TableVectorizer#

Encoding: from a dataframe to a numerical matrix for machine learning

Encoding: from a dataframe to a numerical matrix for machine learning

Handling datetime features with the DatetimeEncoder

Handling datetime features with the DatetimeEncoder

Spatial join for flight data: Joining across multiple columns

Spatial join for flight data: Joining across multiple columns

Self-aggregation on MovieLens

Self-aggregation on MovieLens