skrub.DataOp.skb.apply#
- DataOp.skb.apply(estimator, *, y=None, cols=all(), exclude_cols=None, how='auto', allow_reject=False, unsupervised=False)[source]#
Apply a scikit-learn estimator to a dataframe or numpy array.
- Parameters:
- estimatorscikit-learn estimator
The transformer or predictor to apply.
- ydataframe, column or
numpyarray, optional The prediction targets when
estimatoris a supervised estimator.- cols
str,listof strings or skrub selector, optional The columns to transform, when
estimatoris a transformer.- exclude_cols
str,listof strings or skrub selector, optional When
estimatoris a transformer, columns to which it should _not_ be applied. The columns that are matched bycolsAND not matched byexclude_colsare transformed.- how“auto”, “cols”, “frame” or “no_wrap”, optional
How the estimator is applied. In most cases the default “auto” is appropriate. - “cols” means estimator is wrapped in a
ApplyToColstransformer, which fits a separate clone of estimator each column in cols. estimator must be a transformer (have a
fit_transformmethod).“frame” means estimator is wrapped in a
ApplyToFrametransformer, which fits a single clone of estimator to the selected part of the input dataframe. estimator must be a transformer.“no_wrap” means no wrapping, estimator is applied directly to the unmodified input.
“auto” chooses the wrapping depending on the input and estimator. If the input is not a dataframe or the estimator is not a transformer, the “no_wrap” strategy is chosen. Otherwise if the estimator has a
__single_column_transformer__attribute, “cols” is chosen. Otherwise “frame” is chosen.
- allow_reject
bool, optional Whether the transformer can refuse to transform columns for which it does not apply, in which case they are passed through unchanged. This can be useful to avoid specifying exactly which columns should be transformed. For example if we apply
skrub.ToDatetime()to all columns withallow_reject=True, string columns that can be parsed as dates will be converted and all other columns will be passed through. If we useallow_reject=False(the default), an error would be raised if the dataframe contains columns for whichToDatetimedoes not apply (eg a column of numbers).- unsupervised
bool, optional Use this to indicate that
yis required for scoring but not fitting, as is the case for clustering algorithms. Ifyis not required at all (for example when applying an unsupervised transformer, or when we are not interested in scoring with ground-truth labels), simply leave the defaulty=Noneand there is no need to pass a value forunsupervised.
- Returns:
- result
The transformed dataframe when
estimatoris a transformer, and the fittedestimator’s predictions if it is a supervised predictor.
See also
skrub.DataOp.skb.make_learnerGet a skrub learner for this DataOp.
skrub.ApplyToColsTransformer that applies a given transformer separately to each selected column.
skrub.ApplyToFrameTransformer that applies a given transformer to part of a dataframe.
Examples
>>> import skrub >>> data = skrub.datasets.toy_orders() >>> x = skrub.X(data.X) >>> x <Var 'X'> Result: ――――――― ID product quantity date 0 1 pen 2 2020-04-03 1 2 cup 3 2020-04-04 2 3 cup 5 2020-04-04 3 4 spoon 1 2020-04-05
>>> datetime_encoder = skrub.DatetimeEncoder(add_total_seconds=False) >>> x.skb.apply(skrub.TableVectorizer(datetime=datetime_encoder)) <Apply TableVectorizer> Result: ――――――― ID product_cup product_pen ... date_year date_month date_day 0 1.0 0.0 1.0 ... 2020.0 4.0 3.0 1 2.0 1.0 0.0 ... 2020.0 4.0 4.0 2 3.0 1.0 0.0 ... 2020.0 4.0 4.0 3 4.0 0.0 0.0 ... 2020.0 4.0 5.0
Transform only the
'product'column:>>> x.skb.apply(skrub.StringEncoder(n_components=2), cols='product') <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date 0 1 -2.560113e-16 1.000000e+00 2 2020-04-03 1 2 1.000000e+00 7.447602e-17 3 2020-04-04 2 3 1.000000e+00 7.447602e-17 5 2020-04-04 3 4 -3.955170e-16 -8.326673e-17 1 2020-04-05
Transform all but the
'ID'and'quantity'columns:>>> x.skb.apply( ... skrub.StringEncoder(n_components=2), exclude_cols=["ID", "quantity"] ... ) <Apply StringEncoder> Result: ――――――― ID product_0 product_1 quantity date_0 date_1 0 1 9.775252e-08 7.830415e-01 2 0.766318 -0.406667 1 2 9.999999e-01 0.000000e+00 3 0.943929 0.330148 2 3 9.999998e-01 -1.490116e-08 5 0.943929 0.330149 3 4 9.910963e-08 -6.219692e-01 1 0.766318 -0.406668
More complex selection of the columns to transform, here all numeric columns except the
'ID':>>> from sklearn.preprocessing import StandardScaler >>> from skrub import selectors as s
>>> x.skb.apply(StandardScaler(), cols=s.numeric() - "ID") <Apply StandardScaler> Result: ――――――― ID product date quantity 0 1 pen 2020-04-03 -0.507093 1 2 cup 2020-04-04 0.169031 2 3 cup 2020-04-04 1.521278 3 4 spoon 2020-04-05 -1.183216
For supervised estimators, pass the targets as the argument for
y:>>> from sklearn.dummy import DummyClassifier >>> y = skrub.y(data.y) >>> y <Var 'y'> Result: ――――――― 0 False 1 False 2 True 3 False Name: delayed, dtype: bool
>>> x.skb.apply(skrub.TableVectorizer()).skb.apply(DummyClassifier(), y=y) <Apply DummyClassifier> Result: ――――――― 0 False 1 False 2 False 3 False Name: delayed, dtype: bool
Sometimes we want to pass a value for
ybecause it is required for scoring and cross-validation, but it is not needed for fitting the estimator. In this case passunsupervised=True.>>> from sklearn.datasets import make_blobs >>> from sklearn.cluster import KMeans
>>> X, y = make_blobs(n_samples=10, random_state=0) >>> e = skrub.X(X).skb.apply( ... KMeans(n_clusters=2, n_init=1, random_state=0), ... y=skrub.y(y), ... unsupervised=True, ... ) >>> e.skb.cross_validate()["test_score"] 0 -19.437348 1 -12.463938 2 -11.804288 3 -37.238832 4 -4.857855 Name: test_score, dtype: float64 >>> learner = e.skb.make_learner().fit({"X": X}) >>> learner.predict({"X": X}) array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0], dtype=int32)