skrub.Expr.skb.get_pipeline#

Expr.skb.get_pipeline(*, fitted=False, keep_subsampling=False)[source]#

Get a skrub pipeline for this expression.

Returns a SkrubPipeline with a fit() method so it can be fit to some training data and then apply it to unseen data by calling transform() or predict(). Unlike scikit-learn estimators, skrub pipelines accept a dictionary of inputs rather than X and y arguments.

Warning

If the expression contains choices (e.g. choose_from(...)), this pipeline uses the default value of each choice. To actually pick the best value with hyperparameter tuning, use Expr.skb.get_randomized_search() or Expr.skb.get_grid_search() instead.

Parameters:

fittedbool (default=False): If true, the returned pipeline is fitted to the data provided when initializing variables in skrub.var("name", value=...) and skrub.X(value).
keep_subsamplingbool (default=False): If True, and if subsampling has been configured (see Expr.skb.subsample()), fit on a subsample of the data. By default subsampling is not applied and all the data is used. This is only applied for fitting the estimator when fitted=True, subsequent use of the estimator is not affected by subsampling. Therefore it is an error to pass keep_subsampling=True and fitted=False (because keep_subsampling=True would have no effect).

Returns:

pipeline: A skrub pipeline with an interface similar to scikit-learn’s, except that its methods accept a dictionary of named inputs rather than X and y arguments.

Examples

>>> import skrub
>>> from sklearn.dummy import DummyClassifier
>>> orders_df = skrub.toy_orders().orders
>>> orders = skrub.var('orders', orders_df)
>>> X = orders.drop(columns='delayed', errors='ignore').skb.mark_as_X()
>>> y = orders['delayed'].skb.mark_as_y()
>>> pred = X.skb.apply(skrub.TableVectorizer()).skb.apply(
...     DummyClassifier(), y=y
... )
>>> pred
<Apply DummyClassifier>
Result:
―――――――
   delayed
0    False
1    False
2    False
3    False
>>> pipeline = pred.skb.get_pipeline(fitted=True)
>>> new_orders_df = skrub.toy_orders(split='test').X
>>> new_orders_df
   ID product  quantity        date
4   5     cup         5  2020-04-11
5   6    fork         2  2020-04-12
>>> pipeline.predict({'orders': new_orders_df})
array([False, False])

Note that the 'orders' key in the dictionary passed to predict corresponds to the name 'orders' in skrub.var('orders', orders_df) above.

Please see the examples gallery for full information about expressions and the pipelines they generate.

skrub.Expr.skb.get_pipeline#

This Page