.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/data_ops/1131_optuna_choices.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via JupyterLite or Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_data_ops_1131_optuna_choices.py: .. currentmodule:: skrub .. _example_optuna_choices: Tuning DataOps with Optuna ========================== This example shows how to use `Optuna `_ to tune the hyperparameters of a skrub :class:`DataOp`. As seen in the previous example, skrub DataOps can contain "choices", objects created with :func:`choose_from`, :func:`choose_int`, :func:`choose_float`, etc. and we can use hyperparameter search techniques to pick the best outcome for each choice. Performing this search with Optuna allows us to benefit from its many features, such as state-of-the-art search strategies, monitoring and visualization, stopping and resuming searches, and parallel or distributed computation. In order to use Optuna with skrub, the package must be installed first. This can be done with pip: .. code-block:: bash pip install optuna .. GENERATED FROM PYTHON SOURCE LINES 28-34 A simple regressor and example data. ------------------------------------ We will fit a regressor containing a few choices on a toy dataset. We try 2 regressors: extra trees and ridge. They both have hyperparameters that we want to tune. .. GENERATED FROM PYTHON SOURCE LINES 36-55 .. code-block:: Python from sklearn.ensemble import ExtraTreesRegressor from sklearn.linear_model import Ridge import skrub extra_tree = ExtraTreesRegressor( min_samples_leaf=skrub.choose_int(1, 32, log=True, name="min_samples_leaf"), ) ridge = Ridge(alpha=skrub.choose_float(0.01, 10.0, log=True, name="α")) regressor = skrub.choose_from( {"extra_tree": extra_tree, "ridge": ridge}, name="regressor" ) data = skrub.var("data") X = data.drop(columns="MedHouseVal", errors="ignore").skb.mark_as_X() y = data["MedHouseVal"].skb.mark_as_y() pred = X.skb.apply(regressor, y=y) print(pred.skb.describe_param_grid()) .. rst-class:: sphx-glr-script-out .. code-block:: none - regressor: 'extra_tree' min_samples_leaf: choose_int(1, 32, log=True, name='min_samples_leaf') - regressor: 'ridge' α: choose_float(0.01, 10.0, log=True, name='α') .. GENERATED FROM PYTHON SOURCE LINES 56-57 Load data for the example .. GENERATED FROM PYTHON SOURCE LINES 59-70 .. code-block:: Python from sklearn.model_selection import KFold # (We subsample the dataset by half to make the example run faster) df = skrub.datasets.fetch_california_housing().california_housing.sample( 10_000, random_state=0 ) # The environment we will use to fit the learners created by our DataOp. env = {"data": df} cv = KFold(n_splits=4, shuffle=True, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 71-83 Selecting the best hyperparameters with Optuna. ----------------------------------------------- The simplest way to use Optuna is to pass ``backend='optuna'`` to :meth:`DataOp.skb.make_randomized_search()`. It is used very similarly as with the default backend (:class:`sklearn.model_selection.RandomizedSearchCV`). Additional parameters are available to control the Optuna sampler, storage and study name, and timeout. Note that in order to persist the study and resume it later, the ``storage`` parameter must be set to a valid database URL (e.g., a SQLite file). Refer to the User Guide for an example. .. GENERATED FROM PYTHON SOURCE LINES 85-91 .. code-block:: Python search = pred.skb.make_randomized_search( backend="optuna", cv=cv, n_iter=10, random_state=10 ) search.fit(env) search.results_ .. rst-class:: sphx-glr-script-out .. code-block:: none Running optuna search for study skrub_randomized_search_9308ffce-a740-429b-9506-1ce0c8c8f488 in storage /tmp/tmpw_twijg1_skrub_optuna_search_storage/optuna_storage [I 2026-02-10 10:24:59,185] A new study created in Journal with name: skrub_randomized_search_9308ffce-a740-429b-9506-1ce0c8c8f488 [I 2026-02-10 10:24:59,426] Trial 0 finished with value: 0.5998958893326463 and parameters: {'2:regressor': '1:ridge', '1:α': 0.08737056333717456}. Best is trial 0 with value: 0.5998958893326463. [I 2026-02-10 10:24:59,658] Trial 1 finished with value: 0.5998965729709885 and parameters: {'2:regressor': '1:ridge', '1:α': 0.17777676175509427}. Best is trial 1 with value: 0.5998965729709885. [I 2026-02-10 10:24:59,897] Trial 2 finished with value: 0.5998956300607201 and parameters: {'2:regressor': '1:ridge', '1:α': 0.05323908243249831}. Best is trial 1 with value: 0.5998965729709885. [I 2026-02-10 10:25:00,128] Trial 3 finished with value: 0.5998953213503151 and parameters: {'2:regressor': '1:ridge', '1:α': 0.012709575518456051}. Best is trial 1 with value: 0.5998965729709885. [I 2026-02-10 10:25:05,181] Trial 4 finished with value: 0.7897936635651363 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 4 with value: 0.7897936635651363. [I 2026-02-10 10:25:13,174] Trial 5 finished with value: 0.7930210657931196 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 5 with value: 0.7930210657931196. [I 2026-02-10 10:25:13,404] Trial 6 finished with value: 0.5998987110257884 and parameters: {'2:regressor': '1:ridge', '1:α': 0.464456825115831}. Best is trial 5 with value: 0.7930210657931196. [I 2026-02-10 10:25:16,241] Trial 7 finished with value: 0.7586118873804404 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 7}. Best is trial 5 with value: 0.7930210657931196. [I 2026-02-10 10:25:19,752] Trial 8 finished with value: 0.7745354458535161 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 4}. Best is trial 5 with value: 0.7930210657931196. [I 2026-02-10 10:25:24,730] Trial 9 finished with value: 0.7910036605628203 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 5 with value: 0.7930210657931196. [I 2026-02-10 10:25:24,732] A new study created in memory with name: skrub_randomized_search_9308ffce-a740-429b-9506-1ce0c8c8f488 .. raw:: html
min_samples_leaf α regressor mean_test_score
0 1.0 NaN extra_tree 0.793021
1 2.0 NaN extra_tree 0.791004
2 2.0 NaN extra_tree 0.789794
3 4.0 NaN extra_tree 0.774535
4 7.0 NaN extra_tree 0.758612
5 NaN 0.464457 ridge 0.599899
6 NaN 0.177777 ridge 0.599897
7 NaN 0.087371 ridge 0.599896
8 NaN 0.053239 ridge 0.599896
9 NaN 0.012710 ridge 0.599895


.. GENERATED FROM PYTHON SOURCE LINES 92-94 The usual ``results_``, ``detailed_results_`` and ``plot_results()`` are still available. .. GENERATED FROM PYTHON SOURCE LINES 96-98 .. code-block:: Python search.plot_results() .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 99-101 The Optuna :class:`Study ` that was used to run the hyperparameter search is available in the attribute ``study_``: .. GENERATED FROM PYTHON SOURCE LINES 103-105 .. code-block:: Python search.study_ .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 106-108 .. code-block:: Python search.study_.best_params .. rst-class:: sphx-glr-script-out .. code-block:: none {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1} .. GENERATED FROM PYTHON SOURCE LINES 109-114 This allows us to use Optuna's reporting capabilities provided in `optuna.visualization `_ or `optuna-dashboard `_. .. GENERATED FROM PYTHON SOURCE LINES 116-120 .. code-block:: Python import optuna optuna.visualization.plot_slice(search.study_, params=["0:min_samples_leaf"]) .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 121-155 Using Optuna directly for more advanced use cases ------------------------------------------------- Often we may want more control over the use of Optuna, or to access functionality not available through :meth:`DataOp.skb.make_randomized_search` such as the ask-and-tell interface, trial pruning, callbacks, multi-objective optimization, etc. . Directly using Optuna ourselves is also easy, as we will show now. What makes this possible is that we can pass an Optuna Trial to :meth:`DataOp.skb.make_learner` in which case the parameters suggested by the trial are used to create the learner. We revisit the example above, following the typical Optuna workflow. The :class:`optuna.Study ` runs the hyperparameter search. Its method :meth:`optimize ` is given an ``objective`` function. The ``objective`` must accept a :class:`~optuna.trial.Trial` object (which is produced by the study and picks the parameters for a given evaluation of the objective) and return the value to maximize (or minimize). To use Optuna with a :class:`DataOp`, we just need to pass the Trial object to :meth:`DataOp.skb.make_learner`. This creates a :class:`SkrubLearner` initialized with the parameters picked by the optuna Trial. We can then cross-validate the SkrubLearner, or score it however we prefer, and return the score so that the optuna Study can take it into account. Here we return a single score (R²), but multi-objective optimization is also possible. Please refer to the Optuna documentation for more information. .. GENERATED FROM PYTHON SOURCE LINES 158-170 .. code-block:: Python def objective(trial): learner = pred.skb.make_learner(choose=trial) cv_results = skrub.cross_validate(learner, environment=env, cv=cv) return cv_results["test_score"].mean() study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=10) study.best_params .. rst-class:: sphx-glr-script-out .. code-block:: none [I 2026-02-10 10:25:30,185] A new study created in memory with name: no-name-d2c3799e-f1f7-475f-9ab5-fc8eb285afc5 [I 2026-02-10 10:25:30,415] Trial 0 finished with value: 0.5999388375487643 and parameters: {'2:regressor': '1:ridge', '1:α': 7.778421443913227}. Best is trial 0 with value: 0.5999388375487643. [I 2026-02-10 10:25:38,452] Trial 1 finished with value: 0.7921475110890374 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 1 with value: 0.7921475110890374. [I 2026-02-10 10:25:38,665] Trial 2 finished with value: 0.5998968471096364 and parameters: {'2:regressor': '1:ridge', '1:α': 0.21419820409047396}. Best is trial 1 with value: 0.7921475110890374. [I 2026-02-10 10:25:43,628] Trial 3 finished with value: 0.791341354843085 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 2}. Best is trial 1 with value: 0.7921475110890374. [I 2026-02-10 10:25:43,839] Trial 4 finished with value: 0.5998984275446252 and parameters: {'2:regressor': '1:ridge', '1:α': 0.42609550480521413}. Best is trial 1 with value: 0.7921475110890374. [I 2026-02-10 10:25:51,816] Trial 5 finished with value: 0.795523106005046 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1}. Best is trial 5 with value: 0.795523106005046. [I 2026-02-10 10:25:52,026] Trial 6 finished with value: 0.5998955359166415 and parameters: {'2:regressor': '1:ridge', '1:α': 0.040866575407271455}. Best is trial 5 with value: 0.795523106005046. [I 2026-02-10 10:25:56,040] Trial 7 finished with value: 0.7813008825764698 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 3}. Best is trial 5 with value: 0.795523106005046. [I 2026-02-10 10:25:56,252] Trial 8 finished with value: 0.5999058083279392 and parameters: {'2:regressor': '1:ridge', '1:α': 1.462978585285932}. Best is trial 5 with value: 0.795523106005046. [I 2026-02-10 10:25:58,365] Trial 9 finished with value: 0.7201796077975192 and parameters: {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 19}. Best is trial 5 with value: 0.795523106005046. {'2:regressor': '0:extra_tree', '0:min_samples_leaf': 1} .. GENERATED FROM PYTHON SOURCE LINES 171-172 We can also use Optuna's visualization capabilities to inspect the study: .. GENERATED FROM PYTHON SOURCE LINES 172-174 .. code-block:: Python optuna.visualization.plot_optimization_history(study) .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 175-177 Now we build a learner with the best hyperparameters and fit it on the full dataset: .. GENERATED FROM PYTHON SOURCE LINES 179-188 .. code-block:: Python best_learner = pred.skb.make_learner(choose=study.best_trial) # This would achieve the same result: # best_learner = pred.skb.make_learner() # best_learner.set_params(**study.best_params) best_learner.fit(env) print(best_learner.describe_params()) .. rst-class:: sphx-glr-script-out .. code-block:: none {'min_samples_leaf': 1, 'regressor': 'extra_tree'} .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 4.544 seconds) **Estimated memory usage:** 613 MB .. _sphx_glr_download_auto_examples_data_ops_1131_optuna_choices.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.7.2?urlpath=lab/tree/notebooks/auto_examples/data_ops/1131_optuna_choices.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../../lite/lab/index.html?path=auto_examples/data_ops/1131_optuna_choices.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 1131_optuna_choices.ipynb <1131_optuna_choices.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 1131_optuna_choices.py <1131_optuna_choices.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 1131_optuna_choices.zip <1131_optuna_choices.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_