skrub.DataOp.skb.make_randomized_search#
- DataOp.skb.make_randomized_search(*, fitted=False, keep_subsampling=False, backend='sklearn', n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False, storage=None, study_name=None, sampler=None, timeout=None)[source]#
Find the best parameters with randomized search.
This function returns a
ParamSearch, an object similar to scikit-learn’sRandomizedSearchCV, where the main difference isfit()andpredict()accept a dictionary of inputs rather thanXandy. The best learner is stored in the attribute.best_learner_.- Parameters:
- fitted
bool(default=False) If
True, the randomized search is fitted on the data provided when initializing variables in this DataOp (the data returned by.skb.get_data()).- keep_subsampling
bool(default=False) If True, and if subsampling has been configured (see
DataOp.skb.subsample()), fit on a subsample of the data. By default subsampling is not applied and all the data is used. This is only applied for fitting the randomized search whenfitted=True, subsequent use of the randomized search is not affected by subsampling. Therefore it is an error to passkeep_subsampling=Trueandfitted=False(becausekeep_subsampling=Truewould have no effect).- backend‘sklearn’ or ‘optuna’ (default=’sklearn’)
Which library to use for hyperparameter search. The default is ‘sklearn’, which uses the scikit-learn
RandomizedSearchCV. If ‘optuna’, an OptunaStudyis used instead and it is possible to choose the sampler and storage.- n_iter
int, default=10 Number of parameter combinations to try.
- scoring
str,callable(),listordict, default=None Strategy to evaluate the model’s predictions. It can be:
None: use the predictor’s
scoremethoda metric name such as ‘accuracy’
a list of such names
a callable (estimator, X_test, y_test) → score
a dict mapping metric name to callable
a callable returning a dict mapping metric name to value
See the scikit-learn documentation for details.
- n_jobs
intorNone, default=None Number of jobs to run in parallel.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.- refit
boolorstr, default=True Whether to refit a learner to the whole dataset using the best parameters found.
For multiple metric evaluation it should be the name of the metric to use to pick the best parameters. If
backend='optuna'it is also the metric that drives the Optuna optimization.- cv
int, cross-validation iterator or iterable, default=None Cross-validation splitting strategy. It can be:
None: 5-fold (stratified) cross-validation
integer: specify the number of folds
sklearn CV splitter
iterable yielding (train, test) splits as arrays of indices.
- verbose
int, default=0 Verbosity, the higher the more verbose. It is recommended to leave it to 0 if using
backend='optuna'.- pre_dispatch
int, orstr, default=’2*n_jobs’ Number of jobs dispatched during parallel execution, when using the joblib parallelization.
- random_state
int,RandomStateinstance orNone, default=None Pseudo random number generator state used for random sampling Pass an int for reproducible output across multiple function calls.
Note
the result will never be deterministic if using
backend='optuna'andn_jobs > 1(as the sampled parameters depend on previous runs).- error_score‘raise’ or
float, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised.
- return_train_score
bool, default=False Also compute scores on the training set, in which case they will be available in the
cv_results_attribute in addition to the scores on the test set.- storage
Noneorstr, default=None The URL for the database to use as the Optuna storage. In addition to the usual relational database URLs (e.g.
'sqlite:///<file_path>'), it can be'journal:///<file_path>'to use Optuna’sJournalStorage. See the SQLAlchemy documentation for information on how to construct database URLs, which take the general formdialect+driver://username:password@host:port/database.- study_name
Noneorstr, default=None The name to use for the created (or loaded) Optuna study. If the study already exists in the provided
storage, the existing one is loaded. If None, a random name is generated.- sampler
Noneor Optuna sampler, default=None The sampler to use when the backend is ‘optuna’. If None, a
TPESampleris used (the same default ascreate_study()).- timeout
Noneorfloat, default=None Timeout after which no new trials are created. Trials already started when reaching the timeout are still completed. If None, there is no timeout and all
n_iterstrials are completed.Note
If this parameter is used, parallelization when
n_jobs > 1is always done with Optuna’s built-in parallelization, which relies on multithreading. This means threads are used (rather than processes) regardless of the joblib backend.
- fitted
- Returns:
- ParamSearch or OptunaParamSearch
An object implementing the hyperparameter search. Besides the usual
fit,predict, attributes of interest areresults_,plot_results(), andbest_learner_. Ifbackend='optuna'was used, the returned object is anOptunaParamSearchwhich additionally has an attributestudy_which is the OptunaStudythat performed the hyperparameter optimization.
See also
skrub.DataOp.skb.make_grid_searchFind the best parameters with grid search.
skrub.DataOp.skb.make_learnerMake a
SkrubLearnerwithout actually searching for the best hyperparameters. The strategy to resolve choices can be use the default value, random, or taking suggestions from an Optunaoptuna.trial.Trial. This allows using Optuna directly, rather than through themake_randomized_searchinterface, for more advanced use cases.
Examples
>>> import skrub >>> from sklearn.datasets import make_classification >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.feature_selection import SelectKBest >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.dummy import DummyClassifier
>>> X_a, y_a = make_classification(random_state=0) >>> X, y = skrub.X(X_a), skrub.y(y_a) >>> selector = SelectKBest(k=skrub.choose_int(4, 20, log=True, name='k')) >>> logistic = LogisticRegression(C=skrub.choose_float(0.1, 10.0, log=True, name="C")) >>> rf = RandomForestClassifier( ... n_estimators=skrub.choose_int(3, 30, log=True, name="N 🌴"), ... random_state=0, ... ) >>> classifier = skrub.choose_from( ... {"logistic": logistic, "rf": rf, "dummy": DummyClassifier()}, name="classifier" ... ) >>> pred = X.skb.apply(selector, y=y).skb.apply(classifier, y=y) >>> print(pred.skb.describe_param_grid()) - k: choose_int(4, 20, log=True, name='k') classifier: 'logistic' C: choose_float(0.1, 10.0, log=True, name='C') - k: choose_int(4, 20, log=True, name='k') classifier: 'rf' N 🌴: choose_int(3, 30, log=True, name='N 🌴') - k: choose_int(4, 20, log=True, name='k') classifier: 'dummy'
>>> search = pred.skb.make_randomized_search(fitted=True, random_state=0) >>> search.results_ k C N 🌴 classifier mean_test_score 0 4 4.626363 NaN logistic 0.92 1 16 NaN 6.0 rf 0.90 2 11 NaN 7.0 rf 0.88 3 7 3.832217 NaN logistic 0.87 4 10 4.881255 NaN logistic 0.85 5 20 3.965675 NaN logistic 0.80 6 14 NaN 3.0 rf 0.77 7 4 NaN NaN dummy 0.50 8 10 NaN NaN dummy 0.50 9 5 NaN NaN dummy 0.50
Please refer to the examples gallery for an in-depth explanation.