Tuning DataOps with Optuna#

This example shows how to use Optuna to tune the hyperparameters of a skrub DataOp. As seen in the previous example, skrub DataOps can contain “choices”, objects created with choose_from(), choose_int(), choose_float(), etc. and we can use hyperparameter search techniques to pick the best outcome for each choice. Performing this search with Optuna allows us to benefit from its many features, such as state-of-the-art search strategies, monitoring and visualization, stopping and resuming searches, and parallel or distributed computation.

In order to use Optuna with skrub, the package must be installed first. This can be done with pip:

pip install optuna

A simple regressor and example data.#

We will fit a regressor containing a few choices on a toy dataset. We try 2 regressors: gradient boosting and random forest. They both have hyperparameters that we want to tune.

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import Ridge

import skrub

extra_tree = ExtraTreesRegressor(
    min_samples_leaf=skrub.choose_int(1, 32, log=True, name="min_samples_leaf"),
)
ridge = Ridge(alpha=skrub.choose_float(0.01, 10.0, log=True, name="α"))

regressor = skrub.choose_from(
    {"extra_tree": extra_tree, "ridge": ridge}, name="regressor"
)
data = skrub.var("data")
X = data.drop(columns="MedHouseVal", errors="ignore").skb.mark_as_X()
y = data["MedHouseVal"].skb.mark_as_y()
pred = X.skb.apply(regressor, y=y)
print(pred.skb.describe_param_grid())
- regressor: 'extra_tree'
  min_samples_leaf: choose_int(1, 32, log=True, name='min_samples_leaf')
- regressor: 'ridge'
  α: choose_float(0.01, 10.0, log=True, name='α')

Load data for the example

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold

# (We subsample the dataset by half to make the example run faster)
df = fetch_california_housing(as_frame=True).frame.sample(10_000, random_state=0)

# The environment we will use to fit the learners created by our DataOp.
env = {"data": df}
cv = KFold(n_splits=4, shuffle=True, random_state=0)

Selecting the best hyperparameters with Optuna.#

The simplest way to use Optuna is to pass backend='optuna' to DataOp.skb.make_randomized_search(). It is used very similarly as with the default backend (sklearn.model_selection.RandomizedSearchCV). Additional parameters are available to control the Optuna sampler, storage and study name, and timeout. Note that in order to persist the study and resume it later, the storage parameter must be set to a valid database URL (e.g., a SQLite file). Refer to the User Guide for an example.

search = pred.skb.make_randomized_search(backend="optuna", cv=cv, n_iter=10)
search.fit(env)
search.results_
Running optuna search for study skrub_randomized_search_19659252-79ca-4770-85cd-8e8efc7db0d8 in storage /tmp/tmpkuvzr0n0_skrub_optuna_search_storage/optuna_storage
[I 2025-12-11 22:22:00,571] A new study created in Journal with name: skrub_randomized_search_19659252-79ca-4770-85cd-8e8efc7db0d8
[I 2025-12-11 22:22:04,070] Trial 0 finished with value: 0.7652881473233559 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 6}. Best is trial 0 with value: 0.7652881473233559.
[I 2025-12-11 22:22:06,436] Trial 1 finished with value: 0.7166275246081899 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 21}. Best is trial 0 with value: 0.7652881473233559.
[I 2025-12-11 22:22:07,634] Trial 2 finished with value: 0.5999128192700078 and parameters: {'2:regressor': '1:ridge', '1:α': 2.53257860062555}. Best is trial 0 with value: 0.7652881473233559.
[I 2025-12-11 22:22:09,236] Trial 3 finished with value: 0.5998983778241221 and parameters: {'2:regressor': '1:ridge', '1:α': 0.4193784471459371}. Best is trial 0 with value: 0.7652881473233559.
[I 2025-12-11 22:22:18,144] Trial 4 finished with value: 0.794182310175938 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:22,136] Trial 5 finished with value: 0.7748953283529009 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 4}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:27,796] Trial 6 finished with value: 0.790706770010962 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:30,705] Trial 7 finished with value: 0.7526033083987798 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 9}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:34,088] Trial 8 finished with value: 0.766132414324147 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 6}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:37,899] Trial 9 finished with value: 0.7759210273415043 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 4}. Best is trial 4 with value: 0.794182310175938.
[I 2025-12-11 22:22:37,901] A new study created in memory with name: skrub_randomized_search_19659252-79ca-4770-85cd-8e8efc7db0d8
min_samples_leaf α regressor mean_test_score
0 1.0 NaN extra_tree 0.794182
1 2.0 NaN extra_tree 0.790707
2 4.0 NaN extra_tree 0.775921
3 4.0 NaN extra_tree 0.774895
4 6.0 NaN extra_tree 0.766132
5 6.0 NaN extra_tree 0.765288
6 9.0 NaN extra_tree 0.752603
7 21.0 NaN extra_tree 0.716628
8 NaN 2.532579 ridge 0.599913
9 NaN 0.419378 ridge 0.599898


The usual results_, detailed_results_ and plot_results() are still available.



The Optuna Study that was used to run the hyperparameter search is available in the attribute study_:

<optuna.study.study.Study object at 0x7fe7f8c695b0>
{'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}

This allows us to use Optuna’s reporting capabilities provided in optuna.visualization or optuna-dashboard.

import optuna

optuna.visualization.plot_slice(search.study_, params=["0:min_samples_leaf"])


Using Optuna directly for more advanced use cases#

Often we may want more control over the use of Optuna, or to access functionality not available through DataOp.skb.make_randomized_search() such as the ask-and-tell interface, trial pruning, callbacks, multi-objective optimization, etc. .

Directly using Optuna ourselves is also easy, as we will show now. What makes this possible is that we can pass an Optuna Trial to DataOp.skb.make_learner() in which case the parameters suggested by the trial are used to create the learner.

We revisit the example above, following the typical Optuna workflow.

The optuna.Study runs the hyperparameter search.

Its method optimize is given an objective function. The objective must accept a Trial object (which is produced by the study and picks the parameters for a given evaluation of the objective) and return the value to maximize (or minimize).

To use Optuna with a DataOp, we just need to pass the Trial object to DataOp.skb.make_learner(). This creates a SkrubLearner initialized with the parameters picked by the optuna Trial.

We can then cross-validate the SkrubLearner, or score it however we prefer, and return the score so that the optuna Study can take it into account.

Here we return a single score (R²), but multi-objective optimization is also possible. Please refer to the Optuna documentation for more information.

def objective(trial):
    learner = pred.skb.make_learner(choose=trial)
    cv_results = skrub.cross_validate(learner, environment=env, cv=cv)
    return cv_results["test_score"].mean()


study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
study.best_params
[I 2025-12-11 22:22:41,019] A new study created in memory with name: no-name-f53d1bdb-4add-4fb9-984e-145bcf9ab405
[I 2025-12-11 22:22:41,317] Trial 0 finished with value: 0.5999194317008708 and parameters: {'2:regressor': '1:ridge', '1:α': 3.6369000053665252}. Best is trial 0 with value: 0.5999194317008708.
[I 2025-12-11 22:22:41,605] Trial 1 finished with value: 0.5998997665729485 and parameters: {'2:regressor': '1:ridge', '1:α': 0.6082664624078673}. Best is trial 0 with value: 0.5999194317008708.
[I 2025-12-11 22:22:44,587] Trial 2 finished with value: 0.7543675538031376 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 8}. Best is trial 2 with value: 0.7543675538031376.
[I 2025-12-11 22:22:44,882] Trial 3 finished with value: 0.5999217869726455 and parameters: {'2:regressor': '1:ridge', '1:α': 4.057625148517793}. Best is trial 2 with value: 0.7543675538031376.
[I 2025-12-11 22:22:53,360] Trial 4 finished with value: 0.7931217834262223 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 4 with value: 0.7931217834262223.
[I 2025-12-11 22:22:53,622] Trial 5 finished with value: 0.5999079641826593 and parameters: {'2:regressor': '1:ridge', '1:α': 1.7822078537393546}. Best is trial 4 with value: 0.7931217834262223.
[I 2025-12-11 22:22:53,886] Trial 6 finished with value: 0.5999297003331783 and parameters: {'2:regressor': '1:ridge', '1:α': 5.6072175983763595}. Best is trial 4 with value: 0.7931217834262223.
[I 2025-12-11 22:22:54,176] Trial 7 finished with value: 0.599912335844857 and parameters: {'2:regressor': '1:ridge', '1:α': 2.455763940595294}. Best is trial 4 with value: 0.7931217834262223.
[I 2025-12-11 22:22:59,611] Trial 8 finished with value: 0.7886027689298862 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 4 with value: 0.7931217834262223.
[I 2025-12-11 22:22:59,878] Trial 9 finished with value: 0.5998953016904454 and parameters: {'2:regressor': '1:ridge', '1:α': 0.010132538390380303}. Best is trial 4 with value: 0.7931217834262223.

{'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}

We can also use Optuna’s visualization capabilities to inspect the study:



Now we build a learner with the best hyperparameters and fit it on the full dataset:

best_learner = pred.skb.make_learner(choose=study.best_trial)

# This would achieve the same result:
# best_learner = pred.skb.make_learner()
# best_learner.set_params(**study.best_params)

best_learner.fit(env)
print(best_learner.describe_params())
{'min_samples_leaf': 1, 'regressor': 'extra_tree'}

Total running time of the script: (1 minutes 3.567 seconds)

Gallery generated by Sphinx-Gallery