skrub.DataOp.skb.mark_as_X#
- DataOp.skb.mark_as_X()[source]#
Mark this DataOp as being the
X
table.This is used for cross-validation and hyperparameter selection: operations done before
skb.mark_as_X()
andskb.mark_as_y()
are executed on the entire data and cannot benefit from hyperparameter tuning. Returns a copy; the original DataOp is left unchanged.- Returns:
- The input DataOp, which has been marked as being
X
- The input DataOp, which has been marked as being
See also
skrub.X()
skrub.X(value)
can be used as a shorthand forskrub.var('X', value).skb.mark_as_X()
.
Notes
During cross-validation, all the previous steps are first executed, until X and y have been materialized. Then, those are split into training and testing sets. The following steps in the DataOp are fitted on the train data, and applied to test data, within each split.
This means that any step that comes before
mark_as_X()
ormark_as_y()
, meaning that it is needed to compute X and y, sees the full dataset and cannot benefit from hyperparameter tuning. So we should be careful to start our learner by building X and y, and to usemark_as_X()
andmark_as_y()
as soon as possible.skrub.X(value)
can be used as a shorthand forskrub.var('X', value).skb.mark_as_X()
.Note: this marks the DataOp in-place and also returns it.
Examples
>>> import skrub >>> orders = skrub.var('orders', skrub.datasets.toy_orders(split='all').orders) >>> features = orders.drop(columns='delayed', errors='ignore') >>> features.skb.is_X False >>> X = features.skb.mark_as_X() >>> X.skb.is_X True
Note the original is left unchanged
>>> features.skb.is_X False
>>> y = orders['delayed'].skb.mark_as_y() >>> y.skb.is_y True
Now if we run cross-validation:
>>> from sklearn.dummy import DummyClassifier >>> pred = X.skb.apply(DummyClassifier(), y=y) >>> pred.skb.cross_validate(cv=2)['test_score'] 0 0.666667 1 0.666667 Name: test_score, dtype: float64
First (outside of the cross-validation loop)
X
andy
are computed. Then, they are split into training and test sets. Then the rest of the learner (in this case the last step, theDummyClassifier
) is evaluated on those splits.Please see the examples gallery for more information.