SelectCols#

class skrub.SelectCols(cols)[source]#

Select a subset of a DataFrame’s columns.

A ValueError is raised if any of the provided column names are not in the dataframe.

Accepts pandas.DataFrame and polars.DataFrame inputs.

Parameters:
colslist of str, str or selector

The columns to select, or a selector. A single column name can be passed as a str: "col_name" is the same as ["col_name"]. See the selectors user guide for more info on selectors.

See also

DropCols

Dropping cols by name, dtypes, or general skrub selectors.

Cleaner

Can be used to drop columns with too many NaNs.

Examples

>>> import pandas as pd
>>> from skrub import SelectCols
>>> df = pd.DataFrame({"A": [1, 2], "B": [10, 20], "C": ["x", "y"]})
>>> df
   A   B  C
0  1  10  x
1  2  20  y
>>> SelectCols(["C", "A"]).fit_transform(df)
   C  A
0  x  1
1  y  2
>>> SelectCols(["X", "A"]).fit_transform(df)
Traceback (most recent call last):
    ...
ValueError: The following columns are requested for selection but missing from dataframe: ['X']

Methods

fit(X[, y])

Fit the transformer.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform a dataframe by selecting columns.

fit(X, y=None)[source]#

Fit the transformer.

Parameters:
XDataFrame or None

If X is a DataFrame, the transformer checks that all the column names provided in self.cols can be found in X.

yNone

Unused.

Returns:
SelectCols

The transformer itself.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray_like of shape (n_samples, n_features)

Input samples.

yarray_like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation.

Parameters:
input_featuresarray_like of str or None, default=None

Ignored.

Returns:
feature_names_outndarray of str objects

Transformed feature names.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

set_output(*, transform=None)[source]#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:
transform{“default”, “pandas”, “polars”}, default=None

Configure output of transform and fit_transform.

  • “default”: Default output format of a transformer

  • “pandas”: DataFrame output

  • “polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:
selfestimator instance

Estimator instance.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

transform(X)[source]#

Transform a dataframe by selecting columns.

Parameters:
XDataFrame

The DataFrame on which to apply the selection.

Returns:
DataFrame

The input DataFrame X after selecting only the columns listed in self.cols (in the provided order).