skrub.Expr.skb.mark_as_X#

Expr.skb.mark_as_X()[source]#

Mark this expression as being the X table.

This is used for cross-validation and hyperparameter selection: operations done before skb.mark_as_X() and skb.mark_as_y() are executed on the entire data and cannot benefit from hyperparameter tuning. Returns a copy; the original expression is left unchanged.

Returns:

The input expression, which has been marked as being X

See also

skrub.X(): skrub.X(value) can be used as a shorthand for skrub.var('X', value).skb.mark_as_X().

Notes

During cross-validation, all the previous steps are first executed, until X and y have been materialized. Then, those are split into training and testing sets. The following steps in the expression are fitted on the train data, and applied to test data, within each split.

This means that any step that comes before mark_as_X() or mark_as_y(), meaning that it is needed to compute X and y, sees the full dataset and cannot benefit from hyperparameter tuning. So we should be careful to start our pipeline by building X and y, and to use mark_as_X() and mark_as_y() as soon as possible.

skrub.X(value) can be used as a shorthand for skrub.var('X', value).skb.mark_as_X().

Note: this marks the expression in-place and also returns it.

Examples

>>> import skrub
>>> orders = skrub.var('orders', skrub.toy_orders(split='all').orders)
>>> features = orders.drop(columns='delayed', errors='ignore')
>>> features.skb.is_X
False
>>> X = features.skb.mark_as_X()
>>> X.skb.is_X
True

Note the original is left unchanged

>>> features.skb.is_X
False

>>> y = orders['delayed'].skb.mark_as_y()
>>> y.skb.is_y
True

Now if we run cross-validation:

>>> from sklearn.dummy import DummyClassifier
>>> pred = X.skb.apply(DummyClassifier(), y=y)
>>> pred.skb.cross_validate(cv=2)['test_score']
0    0.666667
1    0.666667
Name: test_score, dtype: float64

First (outside of the cross-validation loop) X and y are computed. Then, they are split into training and test sets. Then the rest of the pipeline (in this case the last step, the DummyClassifier) is evaluated on those splits.

Please see the examples gallery for more information.

skrub.Expr.skb.mark_as_X#

This Page