Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.
Tuning DataOps with Optuna#
This example shows how to use Optuna to tune the hyperparameters of a
skrub DataOp. As seen in the previous example, skrub DataOps can contain
“choices”, objects created with choose_from(), choose_int(),
choose_float(), etc. and we can use hyperparameter search techniques to
pick the best outcome for each choice. Performing this search with Optuna
allows us to benefit from its many features, such as state-of-the-art search
strategies, monitoring and visualization, stopping and resuming searches, and
parallel or distributed computation.
In order to use Optuna with skrub, the package must be installed first. This can be done with pip:
pip install optuna
A simple regressor and example data.#
We will fit a regressor containing a few choices on a toy dataset. We try 2 regressors: extra trees and ridge. They both have hyperparameters that we want to tune.
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import Ridge
import skrub
extra_tree = ExtraTreesRegressor(
min_samples_leaf=skrub.choose_int(1, 32, log=True, name="min_samples_leaf"),
)
ridge = Ridge(alpha=skrub.choose_float(0.01, 10.0, log=True, name="α"))
regressor = skrub.choose_from(
{"extra_tree": extra_tree, "ridge": ridge}, name="regressor"
)
data = skrub.var("data")
X = data.drop(columns="MedHouseVal", errors="ignore").skb.mark_as_X()
y = data["MedHouseVal"].skb.mark_as_y()
pred = X.skb.apply(regressor, y=y)
print(pred.skb.describe_param_grid())
- regressor: 'extra_tree'
min_samples_leaf: choose_int(1, 32, log=True, name='min_samples_leaf')
- regressor: 'ridge'
α: choose_float(0.01, 10.0, log=True, name='α')
Load data for the example
import pandas as pd
from sklearn.model_selection import KFold
# (We subsample the dataset by half to make the example run faster)
file_path = skrub.datasets.fetch_california_housing().path
df = pd.read_csv(file_path).sample(10_000, random_state=0)
# The environment we will use to fit the learners created by our DataOp.
env = {"data": df}
cv = KFold(n_splits=4, shuffle=True, random_state=0)
Selecting the best hyperparameters with Optuna.#
The simplest way to use Optuna is to pass backend='optuna' to
DataOp.skb.make_randomized_search(). It is used very similarly as
with the default backend
(sklearn.model_selection.RandomizedSearchCV). Additional
parameters are available to control the Optuna sampler, storage and study
name, and timeout.
Note that in order to persist the study and resume it later, the storage
parameter must be set to a valid database URL (e.g., a SQLite file). Refer to
the User Guide for an example.
Running optuna search for study skrub_randomized_search_4f565a49-58c4-414b-aa7f-678b267be9f4 in storage /tmp/tmptgc_2gfb_skrub_optuna_search_storage/optuna_storage
[I 2026-03-25 12:55:39,496] A new study created in Journal with name: skrub_randomized_search_4f565a49-58c4-414b-aa7f-678b267be9f4
[I 2026-03-25 12:55:39,772] Trial 0 finished with value: 0.5998958893326463 and parameters: {'2:regressor': '1:ridge', '1:α': 0.08737056333717456}. Best is trial 0 with value: 0.5998958893326463.
[I 2026-03-25 12:55:40,057] Trial 1 finished with value: 0.5998965729709885 and parameters: {'2:regressor': '1:ridge', '1:α': 0.17777676175509427}. Best is trial 1 with value: 0.5998965729709885.
[I 2026-03-25 12:55:40,332] Trial 2 finished with value: 0.5998956300607201 and parameters: {'2:regressor': '1:ridge', '1:α': 0.05323908243249831}. Best is trial 1 with value: 0.5998965729709885.
[I 2026-03-25 12:55:40,601] Trial 3 finished with value: 0.5998953213503151 and parameters: {'2:regressor': '1:ridge', '1:α': 0.012709575518456051}. Best is trial 1 with value: 0.5998965729709885.
[I 2026-03-25 12:55:45,826] Trial 4 finished with value: 0.7908334327788022 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 4 with value: 0.7908334327788022.
[I 2026-03-25 12:55:54,101] Trial 5 finished with value: 0.7934509506206493 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 5 with value: 0.7934509506206493.
[I 2026-03-25 12:55:54,392] Trial 6 finished with value: 0.5998987110257884 and parameters: {'2:regressor': '1:ridge', '1:α': 0.464456825115831}. Best is trial 5 with value: 0.7934509506206493.
[I 2026-03-25 12:55:57,453] Trial 7 finished with value: 0.7586110016584814 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 7}. Best is trial 5 with value: 0.7934509506206493.
[I 2026-03-25 12:56:01,127] Trial 8 finished with value: 0.7752663630420862 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 4}. Best is trial 5 with value: 0.7934509506206493.
[I 2026-03-25 12:56:06,405] Trial 9 finished with value: 0.7905512830967987 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 5 with value: 0.7934509506206493.
[I 2026-03-25 12:56:06,407] A new study created in memory with name: skrub_randomized_search_4f565a49-58c4-414b-aa7f-678b267be9f4
The usual results_, detailed_results_ and plot_results() are
still available.
The Optuna Study that was used to run the
hyperparameter search is available in the attribute study_:
<optuna.study.study.Study object at 0x707870f4a780>
{'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}
This allows us to use Optuna’s reporting capabilities provided in optuna.visualization or optuna-dashboard.
import optuna
optuna.visualization.plot_slice(search.study_, params=["0:min_samples_leaf"])
Using Optuna directly for more advanced use cases#
Often we may want more control over the use of Optuna, or to access
functionality not available through DataOp.skb.make_randomized_search()
such as the ask-and-tell interface, trial pruning, callbacks,
multi-objective optimization, etc. .
Directly using Optuna ourselves is also easy, as we will show now. What makes
this possible is that we can pass an Optuna Trial to
DataOp.skb.make_learner() in which case the parameters suggested by the
trial are used to create the learner.
We revisit the example above, following the typical Optuna workflow.
The optuna.Study runs the hyperparameter
search.
Its method optimize is given an
objective function. The objective must accept a
Trial object (which is produced by the study and picks
the parameters for a given evaluation of the objective) and return the value
to maximize (or minimize).
To use Optuna with a DataOp, we just need to pass the Trial object
to DataOp.skb.make_learner(). This creates a SkrubLearner
initialized with the parameters picked by the optuna Trial.
We can then cross-validate the SkrubLearner, or score it however we prefer, and return the score so that the optuna Study can take it into account.
Here we return a single score (R²), but multi-objective optimization is also possible. Please refer to the Optuna documentation for more information.
def objective(trial):
learner = pred.skb.make_learner(choose=trial)
cv_results = skrub.cross_validate(learner, environment=env, cv=cv)
return cv_results["test_score"].mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
study.best_params
[I 2026-03-25 12:56:14,176] A new study created in memory with name: no-name-87ea7dbc-99a8-4f1f-94c0-c04a3d6ae17f
[I 2026-03-25 12:56:14,443] Trial 0 finished with value: 0.5999076062359522 and parameters: {'2:regressor': '1:ridge', '1:α': 1.7286435255824895}. Best is trial 0 with value: 0.5999076062359522.
[I 2026-03-25 12:56:14,685] Trial 1 finished with value: 0.5998955361320245 and parameters: {'2:regressor': '1:ridge', '1:α': 0.04089486856167892}. Best is trial 0 with value: 0.5999076062359522.
[I 2026-03-25 12:56:14,926] Trial 2 finished with value: 0.5998965424363789 and parameters: {'2:regressor': '1:ridge', '1:α': 0.17372600992084639}. Best is trial 0 with value: 0.5999076062359522.
[I 2026-03-25 12:56:18,644] Trial 3 finished with value: 0.7760420089320459 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 4}. Best is trial 3 with value: 0.7760420089320459.
[I 2026-03-25 12:56:18,901] Trial 4 finished with value: 0.5999151382712157 and parameters: {'2:regressor': '1:ridge', '1:α': 2.9080836673387958}. Best is trial 3 with value: 0.7760420089320459.
[I 2026-03-25 12:56:19,150] Trial 5 finished with value: 0.5998954939135777 and parameters: {'2:regressor': '1:ridge', '1:α': 0.03535008113240401}. Best is trial 3 with value: 0.7760420089320459.
[I 2026-03-25 12:56:19,391] Trial 6 finished with value: 0.5999016017732786 and parameters: {'2:regressor': '1:ridge', '1:α': 0.8620480133429969}. Best is trial 3 with value: 0.7760420089320459.
[I 2026-03-25 12:56:23,650] Trial 7 finished with value: 0.7824535414765221 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 3}. Best is trial 7 with value: 0.7824535414765221.
[I 2026-03-25 12:56:25,736] Trial 8 finished with value: 0.7102358130141629 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 25}. Best is trial 7 with value: 0.7824535414765221.
[I 2026-03-25 12:56:28,050] Trial 9 finished with value: 0.7266278916176526 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 17}. Best is trial 7 with value: 0.7824535414765221.
{'2:regressor': '0:extra_tree', '0:min_samples_leaf': 3}
We can also use Optuna’s visualization capabilities to inspect the study:
Now we build a learner with the best hyperparameters and fit it on the full dataset:
best_learner = pred.skb.make_learner(choose=study.best_trial)
# This would achieve the same result:
# best_learner = pred.skb.make_learner()
# best_learner.set_params(**study.best_params)
best_learner.fit(env)
print(best_learner.describe_params())
{'min_samples_leaf': 3, 'regressor': 'extra_tree'}
Total running time of the script: (0 minutes 54.034 seconds)
Estimated memory usage: 567 MB