skrub.Expr.skb.subsample#
- Expr.skb.subsample(n=1000, *, how='head')[source]#
Configure subsampling of a dataframe or numpy array.
Enables faster development by computing the previews on a subsample of the available data. Outside of previews, no subsampling takes place by default but it can be turned on with the
keep_subsampling
parameter – see the Notes section for details.- Parameters:
- n
int
, default=1000 Number of rows to keep.
- how‘head’ or ‘random’
How subsampling should be done (when it takes place). If ‘head’, the first
n
rows are kept. If ‘random’,n
rows are sampled randomly, without maintaining order and without replacement.
- n
- Returns:
- subsampled data
The subsampled dataframe, column or numpy array.
See also
Expr.skb.preview
Access a preview of the result on the subsampled data.
Notes
This method configures how the dataframe should be subsampled. If it has been configured, subsampling actually only takes place in some specific situations:
When computing the previews (results displayed when printing an expression and the output of
Expr.skb.preview()
).When it is explicitly requested by passing
keep_subsampling=True
to one of the functions that expose that parameter such asExpr.skb.get_randomized_search()
orcross_validate()
.
When subsampling has not been configured (
subsample
has not been called anywhere in the expression), no subsampling is ever done.This method can only be used on steps that produce a dataframe, a column (series) or a numpy array.
Examples
>>> from sklearn.datasets import load_diabetes >>> from sklearn.linear_model import Ridge >>> import skrub
>>> df = load_diabetes(as_frame=True)["frame"] >>> df.shape (442, 11)
>>> data = skrub.var("data", df).skb.subsample(n=15)
We can see that the previews use only a subsample of 15 rows:
>>> data.shape <GetAttr 'shape'> Result (on a subsample): ―――――――――――――――――――――――― (15, 11) >>> X = data.drop("target", axis=1, errors="ignore").skb.mark_as_X() >>> y = data["target"].skb.mark_as_y() >>> pred = X.skb.apply( ... Ridge(alpha=skrub.choose_float(0.01, 10.0, log=True, name="α")), y=y ... )
Here also, the preview for the predictions contains 15 rows:
>>> pred <Apply Ridge> Result (on a subsample): ―――――――――――――――――――――――― target 0 142.866906 1 130.980765 2 138.555388 3 149.703363 4 136.015214 5 139.773213 6 134.110415 7 129.224783 8 140.161363 9 155.272033 10 139.552110 11 130.318783 12 135.956591 13 142.998060 14 132.511013
By default, model fitting and hyperparameter search are done on the full data, so if we want the subsampling to take place we have to pass
keep_subsampling=True
:>>> quick_search = pred.skb.get_randomized_search( ... keep_subsampling=True, fitted=True, n_iter=4, random_state=0 ... ) >>> quick_search.detailed_results_[["mean_test_score", "mean_fit_time", "α"]] mean_test_score mean_fit_time α 0 -0.597596 0.004322 0.431171 1 -0.599036 0.004328 0.443038 2 -0.615900 0.004272 0.643117 3 -0.637498 0.004219 1.398196
Now that we have checked our pipeline works on a subsample, we can fit the hyperparameter search on the full data:
>>> full_search = pred.skb.get_randomized_search( ... fitted=True, n_iter=4, random_state=0 ... ) >>> full_search.detailed_results_[["mean_test_score", "mean_fit_time", "α"]] mean_test_score mean_fit_time α 0 0.457807 0.004791 0.431171 1 0.456808 0.004834 0.443038 2 0.439670 0.004849 0.643117 3 0.380719 0.004827 1.398196
This example dataset is so small that the subsampling does not change the fit computation time but we can tell the second search used the full data from the higher scores. For datasets of a realistic size using the subsampling allows us to do a “dry run” of the cross-validation or model fitting much faster than when using the full data.