SelectCols#

class skrub.SelectCols(cols)[source]#

Select a subset of a DataFrame’s columns.

A ValueError is raised if any of the provided column names are not in the dataframe.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:

colslist of str, str or selector: The columns to select, or a selector. A single column name can be passed as a str: "col_name" is the same as ["col_name"]. See the selectors user guide for more info on selectors.

See also

DropCols: Dropping cols by name, dtypes, or general skrub selectors.
Cleaner: Can be used to drop columns with too many NaNs.

Examples

>>> import pandas as pd
>>> from skrub import SelectCols
>>> df = pd.DataFrame({"A": [1, 2], "B": [10, 20], "C": ["x", "y"]})
>>> df
   A   B  C
0  1  10  x
1  2  20  y
>>> SelectCols(["C", "A"]).fit_transform(df)
   C  A
0  x  1
1  y  2
>>> SelectCols(["X", "A"]).fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: The following columns are requested for selection but missing from dataframe: ['X']

Methods

`fit`(X[, y])	Fit the transformer.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Get output feature names for transformation.
`get_params`([deep])	Get parameters for this estimator.
`set_output`(*[, transform])	Set output container.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform a dataframe by selecting columns.

fit(X, y=None)[source]#

Fit the transformer.

Parameters:

XDataFrame or None: If X is a DataFrame, the transformer checks that all the column names provided in self.cols can be found in X.
yNone: Unused.

Returns:

SelectCols: The transformer itself.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

Xarray_like of shape (n_samples, n_features): Input samples.
yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:

input_featuresarray_like of str or None, default=None: Ignored.

Returns:

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

“default”: Default output format of a transformer
“pandas”: DataFrame output
“polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

selfestimator instance: Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: