skrub.DataOp.skb.apply#
- DataOp.skb.apply(estimator, *, y=None, cols=all(), exclude_cols=None, how='auto', allow_reject=False, unsupervised=False)[source]#
Apply a scikit-learn estimator to a dataframe or numpy array.
- Parameters:
- estimatorscikit-learn estimator
The transformer or predictor to apply.
- ydataframe, column or
numpy
array
, optional The prediction targets when
estimator
is a supervised estimator.- cols
str
,list
of strings or skrub selector, optional The columns to transform, when
estimator
is a transformer.- exclude_cols
str
,list
of strings or skrub selector, optional When
estimator
is a transformer, columns to which it should _not_ be applied. The columns that are matched bycols
AND not matched byexclude_cols
are transformed.- how“auto”, “cols”, “frame” or “no_wrap”, optional
How the estimator is applied. In most cases the default “auto” is appropriate. - “cols” means estimator is wrapped in a
ApplyToCols
transformer, which fits a separate clone of estimator each column in cols. estimator must be a transformer (have a
fit_transform
method).“frame” means estimator is wrapped in a
ApplyToFrame
transformer, which fits a single clone of estimator to the selected part of the input dataframe. estimator must be a transformer.“no_wrap” means no wrapping, estimator is applied directly to the unmodified input.
“auto” chooses the wrapping depending on the input and estimator. If the input is not a dataframe or the estimator is not a transformer, the “no_wrap” strategy is chosen. Otherwise if the estimator has a
__single_column_transformer__
attribute, “cols” is chosen. Otherwise “frame” is chosen.
- allow_reject
bool
, optional Whether the transformer can refuse to transform columns for which it does not apply, in which case they are passed through unchanged. This can be useful to avoid specifying exactly which columns should be transformed. For example if we apply
skrub.ToDatetime()
to all columns withallow_reject=True
, string columns that can be parsed as dates will be converted and all other columns will be passed through. If we useallow_reject=False
(the default), an error would be raised if the dataframe contains columns for whichToDatetime
does not apply (eg a column of numbers).- unsupervised
bool
, optional Use this to indicate that
y
is required for scoring but not fitting, as is the case for clustering algorithms. Ify
is not required at all (for example when applying an unsupervised transformer, or when we are not interested in scoring with ground-truth labels), simply leave the defaulty=None
and there is no need to pass a value forunsupervised
.
- Returns:
- result
The transformed dataframe when
estimator
is a transformer, and the fittedestimator
’s predictions if it is a supervised predictor.
See also
skrub.DataOp.skb.make_learner
Get a skrub learner for this DataOp.
skrub.ApplyToCols
Transformer that applies a given transformer separately to each selected column.
skrub.ApplyToFrame
Transformer that applies a given transformer to part of a dataframe.
Examples
>>> import skrub >>> data = skrub.datasets.toy_orders() >>> x = skrub.X(data.X) >>> x <Var 'X'> Result: ――――――― ID product quantity date 0 1 pen 2 2020-04-03 1 2 cup 3 2020-04-04 2 3 cup 5 2020-04-04 3 4 spoon 1 2020-04-05
>>> datetime_encoder = skrub.DatetimeEncoder(add_total_seconds=False) >>> x.skb.apply(skrub.TableVectorizer(datetime=datetime_encoder)) <Apply TableVectorizer> Result: ――――――― ID product_cup product_pen ... date_year date_month date_day 0 1.0 0.0 1.0 ... 2020.0 4.0 3.0 1 2.0 1.0 0.0 ... 2020.0 4.0 4.0 2 3.0 1.0 0.0 ... 2020.0 4.0 4.0 3 4.0 0.0 0.0 ... 2020.0 4.0 5.0
Transform only the
'product'
column:>>> x.skb.apply(skrub.StringEncoder(n_components=2), cols='product') <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date 0 1 -2.560113e-16 1.000000e+00 2 2020-04-03 1 2 1.000000e+00 7.447602e-17 3 2020-04-04 2 3 1.000000e+00 7.447602e-17 5 2020-04-04 3 4 -3.955170e-16 -8.326673e-17 1 2020-04-05
Transform all but the
'ID'
and'quantity'
columns:>>> x.skb.apply( ... skrub.StringEncoder(n_components=2), exclude_cols=["ID", "quantity"] ... ) <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date_0 date_1 0 1 9.775252e-08 7.830415e-01 2 0.766318 -0.406667 1 2 9.999999e-01 0.000000e+00 3 0.943929 0.330148 2 3 9.999998e-01 -1.490116e-08 5 0.943929 0.330149 3 4 9.910963e-08 -6.219692e-01 1 0.766318 -0.406668
More complex selection of the columns to transform, here all numeric columns except the
'ID'
:>>> from sklearn.preprocessing import StandardScaler >>> from skrub import selectors as s
>>> x.skb.apply(StandardScaler(), cols=s.numeric() - "ID") <Apply StandardScaler> Result: ――――――― ID product date quantity 0 1 pen 2020-04-03 -0.507093 1 2 cup 2020-04-04 0.169031 2 3 cup 2020-04-04 1.521278 3 4 spoon 2020-04-05 -1.183216
For supervised estimators, pass the targets as the argument for
y
:>>> from sklearn.dummy import DummyClassifier >>> y = skrub.y(data.y) >>> y <Var 'y'> Result: ――――――― 0 False 1 False 2 True 3 False Name: delayed, dtype: bool
>>> x.skb.apply(skrub.TableVectorizer()).skb.apply(DummyClassifier(), y=y) <Apply DummyClassifier> Result: ――――――― 0 False 1 False 2 False 3 False Name: delayed, dtype: bool
Sometimes we want to pass a value for
y
because it is required for scoring and cross-validation, but it is not needed for fitting the estimator. In this case passunsupervised=True
.>>> from sklearn.datasets import make_blobs >>> from sklearn.cluster import KMeans
>>> X, y = make_blobs(n_samples=10, random_state=0) >>> e = skrub.X(X).skb.apply( ... KMeans(n_clusters=2, n_init=1, random_state=0), ... y=skrub.y(y), ... unsupervised=True, ... ) >>> e.skb.cross_validate()["test_score"] 0 -19.437348 1 -12.463938 2 -11.804288 3 -37.238832 4 -4.857855 Name: test_score, dtype: float64 >>> learner = e.skb.make_learner().fit({"X": X}) >>> learner.predict({"X": X}) array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int32)