skrub.Expr.skb.get_pipeline#
- Expr.skb.get_pipeline(*, fitted=False, keep_subsampling=False)[source]#
Get a skrub pipeline for this expression.
Returns a
SkrubPipeline
.Please see the examples gallery for full information about expressions and the pipelines they generate.
Provides a skrub pipeline with a
fit()
method so we can fit it to some training data and then apply it to unseen data by callingtransform()
orpredict()
.An important difference between skrub pipelines and scikit-learn estimators is that
fit()
,transform()
etc. accept a dictionary of inputs rather thanX
andy
arguments (see examples below).We can pass
fitted=True
to get a pipeline fitted to the data provided as the values inskrub.var("name", value=...)
andskrub.X(value)
.Warning
If the expression contains choices (e.g.
choose_from(...)
), this pipeline uses the default value of each choice. To actually pick the best value with hyperparameter tuning, useExpr.skb.get_randomized_search()
orExpr.skb.get_grid_search()
instead.- Parameters:
- fitted
bool
(default=False) If true, the returned pipeline is fitted to the data provided when initializing variables in the expression.
- keep_subsampling
bool
(default=False) If True, and if subsampling has been configured (see
Expr.skb.subsample()
), fit on a subsample of the data. By default subsampling is not applied and all the data is used. This is only applied for fitting the estimator whenfitted=True
, subsequent use of the estimator is not affected by subsampling. Therefore it is an error to passkeep_subsampling=True
andfitted=False
(becausekeep_subsampling=True
would have no effect).
- fitted
- Returns:
- pipeline
A skrub pipeline with an interface similar to scikit-learn’s, except that its methods accept a dictionary of named inputs rather than
X
andy
arguments.
Examples
>>> import skrub >>> from sklearn.dummy import DummyClassifier >>> orders_df = skrub.toy_orders().orders >>> orders = skrub.var('orders', orders_df) >>> X = orders.drop(columns='delayed', errors='ignore').skb.mark_as_X() >>> y = orders['delayed'].skb.mark_as_y() >>> pred = X.skb.apply(skrub.TableVectorizer()).skb.apply( ... DummyClassifier(), y=y ... ) >>> pred <Apply DummyClassifier> Result: ――――――― delayed 0 False 1 False 2 False 3 False >>> pipeline = pred.skb.get_pipeline(fitted=True) >>> new_orders_df = skrub.toy_orders(split='test').X >>> new_orders_df ID product quantity date 4 5 cup 5 2020-04-11 5 6 fork 2 2020-04-12 >>> pipeline.predict({'orders': new_orders_df}) array([False, False])
Note that the
'orders'
key in the dictionary passed topredict
corresponds to the name'orders'
inskrub.var('orders', orders_df)
above.