skrub.Expr.skb.mark_as_X#
- Expr.skb.mark_as_X()[source]#
Mark this expression as being the
X
table.Returns a copy; the original expression is left unchanged.
This is used for cross-validation and hyperparameter selection: the nodes marked with
.skb.mark_as_X()
and.skb.mark_as_y()
define the cross-validation splits.During cross-validation, all the previous steps are first executed, until X and y have been materialized. Then, those are split into training and testing sets. The following steps in the expression are fitted on the train data, and applied to test data, within each split.
This means that any step that comes before
mark_as_X()
ormark_as_y()
, meaning that it is needed to compute X and y, sees the full dataset and cannot benefit from hyperparameter tuning. So we should be careful to start our pipeline by building X and y, and to usemark_as_X()
andmark_as_y()
as soon as possible.skrub.X(value)
can be used as a shorthand forskrub.var('X', value).skb.mark_as_X()
.Please see the examples gallery for more information.
Note: this marks the expression in-place and also returns it.
- Returns:
- The input expression, which has been marked as being
X
- The input expression, which has been marked as being
Examples
>>> import skrub >>> orders = skrub.var('orders', skrub.toy_orders(split='all').orders) >>> features = orders.drop(columns='delayed', errors='ignore') >>> features.skb.is_X False >>> X = features.skb.mark_as_X() >>> X.skb.is_X True
Note the original is left unchanged
>>> features.skb.is_X False
>>> y = orders['delayed'].skb.mark_as_y() >>> y.skb.is_y True
Now if we run cross-validation:
>>> from sklearn.dummy import DummyClassifier >>> pred = X.skb.apply(DummyClassifier(), y=y) >>> pred.skb.cross_validate(cv=2)['test_score'] 0 0.666667 1 0.666667 Name: test_score, dtype: float64
First (outside of the cross-validation loop)
X
andy
are computed. Then, they are split into training and test sets. Then the rest of the pipeline (in this case the last step, theDummyClassifier
) is evaluated on those splits.