skrub.DataOp.skb.make_learner#
- DataOp.skb.make_learner(*, fitted=False, keep_subsampling=False, choose='default')[source]#
Get a skrub learner for this DataOp.
Returns a
SkrubLearnerwith afit()method so it can be fit to some training data and then apply it to unseen data by callingtransform()orpredict(). Unlike scikit-learn estimators, skrub learners accept a dictionary of inputs rather thanXandyarguments.Warning
If the DataOp contains choices (e.g.
choose_from(...)), by default this learner uses the default value of each choice. See the choose parameter for other options (random or from an Optuna trial). To actually pick the best value with hyperparameter tuning, useDataOp.skb.make_randomized_search()DataOp.skb.make_grid_search()instead, or an OptunaStudyas shown in this example.- Parameters:
- fitted
bool(default=False) If true, the returned learner is fitted to the data provided when initializing variables in
skrub.var("name", value=...)andskrub.X(value).- keep_subsampling
bool(default=False) If True, and if subsampling has been configured (see
DataOp.skb.subsample()), fit on a subsample of the data. By default subsampling is not applied and all the data is used. This is only applied for fitting the estimator whenfitted=True, subsequent use of the estimator is not affected by subsampling. Therefore it is an error to passkeep_subsampling=Trueandfitted=False(becausekeep_subsampling=Truewould have no effect).- choose‘default’, ‘random’, ‘random([seed])’ or
optuna.Trialinstance How to resolve choices contained in the data_op. The different options are:
‘default’: the corresponding parameters of the SkrubLearner are not set; the default values of the choices are used.
‘random’: a random value is picked according to the distribution of each choice. The form ‘random([seed])’ is also accepted to set the random seed: for example ‘random(0)’ sets it to 0. ‘random()’ is the same as ‘random’.
an instance of
numpy.random.RandomState. Same as ‘random’, but the provided RandomState is used to sample values.an instance of
optuna.Trialoroptuna.FrozenTrial. It is used to suggest values for the choices.
Note that none of these options picks the best choice value according to an evaluation criterion, as this function creates a single learner. These options can be combined with external logic to evaluate and select the resulting learners, or one of
DataOp.skb.make_grid_search(),DataOp.skb.make_randomized_search(),optuna.Study.optimize(as shown in this example) can be used to automatically select the best hyperparameters.
- fitted
- Returns:
- learner
A skrub learner with an interface similar to scikit-learn’s, except that its methods accept a dictionary of named inputs rather than
Xandyarguments.
Examples
>>> import skrub >>> from sklearn.dummy import DummyClassifier >>> orders_df = skrub.datasets.toy_orders(split="train").orders >>> orders = skrub.var('orders', orders_df) >>> X = orders.drop(columns='delayed', errors='ignore').skb.mark_as_X() >>> y = orders['delayed'].skb.mark_as_y() >>> pred = X.skb.apply(skrub.TableVectorizer()).skb.apply( ... DummyClassifier(), y=y ... ) >>> pred <Apply DummyClassifier> Result: ――――――― 0 False 1 False 2 False 3 False Name: delayed, dtype: bool >>> learner = pred.skb.make_learner(fitted=True) >>> new_orders_df = skrub.datasets.toy_orders(split='test').X >>> new_orders_df ID product quantity date 4 5 cup 5 2020-04-11 5 6 fork 2 2020-04-12 >>> learner.predict({'orders': new_orders_df}) array([False, False])
Note that the
'orders'key in the dictionary passed topredictcorresponds to the name'orders'inskrub.var('orders', orders_df)above.The
chooseparameter allows us to control how choices contained in the DataOp should be handled. The default is to use the default value of each choice.>>> def mult(x, factor): ... return x * factor >>> out = skrub.var("x").skb.apply_func( ... mult, skrub.choose_int(-10, 10, default=2) ... ) >>> out.skb.make_learner().fit_transform({'x': 1}) 2
The ‘random’ option samples new choice outcomes for each created learner:
>>> out.skb.make_learner(choose='random').fit_transform({'x': 1}) np.int64(3) >>> out.skb.make_learner(choose='random').fit_transform({'x': 1}) np.int64(-5)
If an
optuna.Trialinstance is passed instead, the choice outcomes are obtained by calling the trial’ssuggest_int,suggest_floatorsuggest_categoricalmethods. This allows easily selecting hyperparameters with optuna.Please see the examples gallery for full information about DataOps and the learners they generate.