Subsampling data for easier development and debugging#

If the data used for the preview is large, it can be useful to work on a subsample of the data to speed up the development and debugging process. This can be done by calling the .skb.subsample() method on a variable: this signals to skrub that what is shown when printing DataOps, or returned by .skb.preview() is computed on a subsample of the data.

Note that subsampling is “local”: if it is applied to a variable, it only affects the variable itself. This may lead to unexpected results and errors if, for example, X is subsampled but y is not.

Subsampling is turned off by default when we call other methods such as .skb.eval(), .skb.cross_validate(), .skb.train_test_split, DataOp.skb.make_learner(), DataOp.skb.make_randomized_search(), etc. However, all of those methods have a keep_subsampling parameter that we can set to True to force using the subsampling when we call them. Note that even if we set keep_subsampling=True, subsampling is not applied when using predict.

See more details in a full example.