skrub.DataOp.skb.iter_cv_splits#
- DataOp.skb.iter_cv_splits(environment=None, *, keep_subsampling=False, cv=KFold(5))[source]#
Yield splits of an environment into training and testing environments.
- Parameters:
- environment
dict, optional The environment (dict mapping variable names to values) containing the full data. If
None(the default), the data is retrieved from the DataOp.- keep_subsampling
bool, default=False If True, and if subsampling has been configured (see
DataOp.skb.subsample()), use a subsample of the data. By default subsampling is not applied and all the data is used.- cv
int, cross-validation generator or iterable, default=KFold(5) The default is 5-fold without shuffling. Can be a cross-validation splitter, an iterable yielding pairs of (train, test) indices, or an int to specify the number of folds for KFold splitting.
- environment
- Yields:
dictFor each split, a dict is produced, containing the following keys:
train: a dictionary containing the training environment
test: a dictionary containing the test environment
X_train: the value of the variable marked with
skb.mark_as_X()in the train environmentX_test: the value of the variable marked with
skb.mark_as_X()in the test environmenty_train: the value of the variable marked with
skb.mark_as_y()in the train environment, if there is one (may not be the case for unsupervised learning).y_test: the value of the variable marked with
skb.mark_as_y()in the test environment, if there is one (may not be the case for unsupervised learning).
Examples
>>> import skrub >>> from sklearn.dummy import DummyClassifier >>> from sklearn.metrics import accuracy_score
>>> orders = skrub.var("orders") >>> X = orders.skb.drop("delayed").skb.mark_as_X() >>> y = orders["delayed"].skb.mark_as_y() >>> delayed = X.skb.apply(skrub.TableVectorizer()).skb.apply( ... DummyClassifier(), y=y ... ) >>> df = skrub.datasets.toy_orders().orders >>> accuracies = [] >>> for split in delayed.skb.iter_cv_splits({"orders": df}, cv=3): ... learner = delayed.skb.make_learner().fit(split["train"]) ... prediction = learner.predict(split["test"]) ... accuracies.append(accuracy_score(split["y_test"], prediction)) >>> accuracies [1.0, 0.0, 1.0]