skrub.Expr.skb.subsample#

Expr.skb.subsample(n=1000, *, how='head')[source]#

Configure subsampling of a dataframe or numpy array.

Enables faster development by computing the previews on a subsample of the available data. Outside of previews, no subsampling takes place by default but it can be turned on with the keep_subsampling parameter – see the Notes section for details.

Parameters:

nint, default=1000: Number of rows to keep.
how‘head’ or ‘random’: How subsampling should be done (when it takes place). If ‘head’, the first n rows are kept. If ‘random’, n rows are sampled randomly, without maintaining order and without replacement.

Returns:

subsampled data: The subsampled dataframe, column or numpy array.

See also

Expr.skb.preview: Access a preview of the result on the subsampled data.

Notes

This method configures how the dataframe should be subsampled. If it has been configured, subsampling actually only takes place in some specific situations:

When computing the previews (results displayed when printing an expression and the output of Expr.skb.preview()).
When it is explicitly requested by passing keep_subsampling=True to one of the functions that expose that parameter such as Expr.skb.get_randomized_search() or cross_validate().

When subsampling has not been configured (subsample has not been called anywhere in the expression), no subsampling is ever done.

This method can only be used on steps that produce a dataframe, a column (series) or a numpy array.

Examples

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import Ridge
>>> import skrub

>>> df = load_diabetes(as_frame=True)["frame"]
>>> df.shape
(442, 11)

>>> data = skrub.var("data", df).skb.subsample(n=15)

We can see that the previews use only a subsample of 15 rows:

>>> data.shape
<GetAttr 'shape'>
Result (on a subsample):
――――――――――――――――――――――――
(15, 11)
>>> X = data.drop("target", axis=1, errors="ignore").skb.mark_as_X()
>>> y = data["target"].skb.mark_as_y()
>>> pred = X.skb.apply(
...     Ridge(alpha=skrub.choose_float(0.01, 10.0, log=True, name="α")), y=y
... )

Here also, the preview for the predictions contains 15 rows:

>>> pred
<Apply Ridge>
Result (on a subsample):
――――――――――――――――――――――――
        target
 142.866906
 130.980765
 138.555388
 149.703363
 136.015214
 139.773213
 134.110415
 129.224783
 140.161363
 155.272033
139.552110
130.318783
135.956591
142.998060
132.511013

By default, model fitting and hyperparameter search are done on the full data, so if we want the subsampling to take place we have to pass keep_subsampling=True:

>>> quick_search = pred.skb.get_randomized_search(
...     keep_subsampling=True, fitted=True, n_iter=4, random_state=0
... )
>>> quick_search.detailed_results_[["mean_test_score", "mean_fit_time", "α"]]
   mean_test_score  mean_fit_time         α
0        -0.597596       0.004322  0.431171
1        -0.599036       0.004328  0.443038
2        -0.615900       0.004272  0.643117
3        -0.637498       0.004219  1.398196

Now that we have checked our pipeline works on a subsample, we can fit the hyperparameter search on the full data:

>>> full_search = pred.skb.get_randomized_search(
...     fitted=True, n_iter=4, random_state=0
... )
>>> full_search.detailed_results_[["mean_test_score", "mean_fit_time", "α"]]
   mean_test_score  mean_fit_time         α
0         0.457807       0.004791  0.431171
1         0.456808       0.004834  0.443038
2         0.439670       0.004849  0.643117
3         0.380719       0.004827  1.398196

This example dataset is so small that the subsampling does not change the fit computation time but we can tell the second search used the full data from the higher scores. For datasets of a realistic size using the subsampling allows us to do a “dry run” of the cross-validation or model fitting much faster than when using the full data.

skrub.Expr.skb.subsample#

This Page