skrub.DataOp.skb.make_learner#

DataOp.skb.make_learner(*, fitted=False, keep_subsampling=False)[source]#

Get a skrub learner for this DataOp.

Returns a SkrubLearner with a fit() method so it can be fit to some training data and then apply it to unseen data by calling transform() or predict(). Unlike scikit-learn estimators, skrub learners accept a dictionary of inputs rather than X and y arguments.

Warning

If the DataOp contains choices (e.g. choose_from(...)), this learner uses the default value of each choice. To actually pick the best value with hyperparameter tuning, use DataOp.skb.make_randomized_search() or DataOp.skb.make_grid_search() instead.

Parameters:
fittedbool (default=False)

If true, the returned learner is fitted to the data provided when initializing variables in skrub.var("name", value=...) and skrub.X(value).

keep_subsamplingbool (default=False)

If True, and if subsampling has been configured (see DataOp.skb.subsample()), fit on a subsample of the data. By default subsampling is not applied and all the data is used. This is only applied for fitting the estimator when fitted=True, subsequent use of the estimator is not affected by subsampling. Therefore it is an error to pass keep_subsampling=True and fitted=False (because keep_subsampling=True would have no effect).

Returns:
learner

A skrub learner with an interface similar to scikit-learn’s, except that its methods accept a dictionary of named inputs rather than X and y arguments.

Examples

>>> import skrub
>>> from sklearn.dummy import DummyClassifier
>>> orders_df = skrub.datasets.toy_orders(split="train").orders
>>> orders = skrub.var('orders', orders_df)
>>> X = orders.drop(columns='delayed', errors='ignore').skb.mark_as_X()
>>> y = orders['delayed'].skb.mark_as_y()
>>> pred = X.skb.apply(skrub.TableVectorizer()).skb.apply(
...     DummyClassifier(), y=y
... )
>>> pred
<Apply DummyClassifier>
Result:
―――――――
   delayed
0    False
1    False
2    False
3    False
>>> learner = pred.skb.make_learner(fitted=True)
>>> new_orders_df = skrub.datasets.toy_orders(split='test').X
>>> new_orders_df
   ID product  quantity        date
4   5     cup         5  2020-04-11
5   6    fork         2  2020-04-12
>>> learner.predict({'orders': new_orders_df})
array([False, False])

Note that the 'orders' key in the dictionary passed to predict corresponds to the name 'orders' in skrub.var('orders', orders_df) above.

Please see the examples gallery for full information about DataOps and the learners they generate.