skrub.DataOp.skb.iter_cv_splits#

DataOp.skb.iter_cv_splits(environment=None, *, keep_subsampling=False, cv=KFold(5))[source]#

Yield splits of an environment into training and testing environments.

Parameters:

environmentdict, optional: The environment (dict mapping variable names to values) containing the full data. If None (the default), the data is retrieved from the DataOp.
keep_subsamplingbool, default=False: If True, and if subsampling has been configured (see DataOp.skb.subsample()), use a subsample of the data. By default subsampling is not applied and all the data is used.
cvint, cross-validation generator or iterable, default=KFold(5): The default is 5-fold without shuffling. Can be a cross-validation splitter, an iterable yielding pairs of (train, test) indices, or an int to specify the number of folds for KFold splitting.

Yields:

dict

For each split, a dict is produced, containing the following keys:

train: a dictionary containing the training environment
test: a dictionary containing the test environment
X_train: the value of the variable marked with skb.mark_as_X() in the train environment
X_test: the value of the variable marked with skb.mark_as_X() in the test environment
y_train: the value of the variable marked with skb.mark_as_y() in the train environment, if there is one (may not be the case for unsupervised learning).
y_test: the value of the variable marked with skb.mark_as_y() in the test environment, if there is one (may not be the case for unsupervised learning).

Examples

>>> import skrub
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.metrics import accuracy_score

>>> orders = skrub.var("orders")
>>> X = orders.skb.drop("delayed").skb.mark_as_X()
>>> y = orders["delayed"].skb.mark_as_y()
>>> delayed = X.skb.apply(skrub.TableVectorizer()).skb.apply(
...     DummyClassifier(), y=y
... )
>>> df = skrub.datasets.toy_orders().orders
>>> accuracies = []
>>> for split in delayed.skb.iter_cv_splits({"orders": df}, cv=3):
...     learner = delayed.skb.make_learner().fit(split["train"])
...     prediction = learner.predict(split["test"])
...     accuracies.append(accuracy_score(split["y_test"], prediction))
>>> accuracies
[1.0, 0.0, 1.0]

skrub.DataOp.skb.iter_cv_splits#

This Page