skrub.DataOp.skb.iter_cv_splits#

DataOp.skb.iter_cv_splits(environment=None, *, keep_subsampling=False, cv=KFold(5))[source]#

Yield splits of an environment into training and testing environments.

Parameters:
environmentdict, optional

The environment (dict mapping variable names to values) containing the full data. If None (the default), the data is retrieved from the DataOp.

keep_subsamplingbool, default=False

If True, and if subsampling has been configured (see DataOp.skb.subsample()), use a subsample of the data. By default subsampling is not applied and all the data is used.

cvint, cross-validation generator or iterable, default=KFold(5)

The default is 5-fold without shuffling. Can be a cross-validation splitter, an iterable yielding pairs of (train, test) indices, or an int to specify the number of folds for KFold splitting.

Yields:
dict

For each split, a dict is produced, containing the following keys:

  • train: a dictionary containing the training environment

  • test: a dictionary containing the test environment

  • X_train: the value of the variable marked with skb.mark_as_X() in the train environment

  • X_test: the value of the variable marked with skb.mark_as_X() in the test environment

  • y_train: the value of the variable marked with skb.mark_as_y() in the train environment, if there is one (may not be the case for unsupervised learning).

  • y_test: the value of the variable marked with skb.mark_as_y() in the test environment, if there is one (may not be the case for unsupervised learning).

Examples

>>> import skrub
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.metrics import accuracy_score
>>> orders = skrub.var("orders")
>>> X = orders.skb.drop("delayed").skb.mark_as_X()
>>> y = orders["delayed"].skb.mark_as_y()
>>> delayed = X.skb.apply(skrub.TableVectorizer()).skb.apply(
...     DummyClassifier(), y=y
... )
>>> df = skrub.datasets.toy_orders().orders
>>> accuracies = []
>>> for split in delayed.skb.iter_cv_splits({"orders": df}, cv=3):
...     learner = delayed.skb.make_learner().fit(split["train"])
...     prediction = learner.predict(split["test"])
...     accuracies.append(accuracy_score(split["y_test"], prediction))
>>> accuracies
[1.0, 0.0, 1.0]