skrub.Expr.skb.mark_as_X#

Expr.skb.mark_as_X()[source]#

Mark this expression as being the X table.

Returns a copy; the original expression is left unchanged.

This is used for cross-validation and hyperparameter selection: the nodes marked with .skb.mark_as_X() and .skb.mark_as_y() define the cross-validation splits.

During cross-validation, all the previous steps are first executed, until X and y have been materialized. Then, those are split into training and testing sets. The following steps in the expression are fitted on the train data, and applied to test data, within each split.

This means that any step that comes before mark_as_X() or mark_as_y(), meaning that it is needed to compute X and y, sees the full dataset and cannot benefit from hyperparameter tuning. So we should be careful to start our pipeline by building X and y, and to use mark_as_X() and mark_as_y() as soon as possible.

skrub.X(value) can be used as a shorthand for skrub.var('X', value).skb.mark_as_X().

Please see the examples gallery for more information.

Note: this marks the expression in-place and also returns it.

Returns:
The input expression, which has been marked as being X

Examples

>>> import skrub
>>> orders = skrub.var('orders', skrub.toy_orders(split='all').orders)
>>> features = orders.drop(columns='delayed', errors='ignore')
>>> features.skb.is_X
False
>>> X = features.skb.mark_as_X()
>>> X.skb.is_X
True

Note the original is left unchanged

>>> features.skb.is_X
False
>>> y = orders['delayed'].skb.mark_as_y()
>>> y.skb.is_y
True

Now if we run cross-validation:

>>> from sklearn.dummy import DummyClassifier
>>> pred = X.skb.apply(DummyClassifier(), y=y)
>>> pred.skb.cross_validate(cv=2)['test_score']
0    0.666667
1    0.666667
Name: test_score, dtype: float64

First (outside of the cross-validation loop) X and y are computed. Then, they are split into training and test sets. Then the rest of the pipeline (in this case the last step, the DummyClassifier) is evaluated on those splits.