Tuning DataOps with Optuna#

Optuna is a powerful hyperparameter optimization framework that can be used to efficiently search for the best hyperparameters for machine learning models; Optuna includes both sophisticated search algorithms and tools to monitor and visualize the optimization process.

There are two main ways of using Optuna with skrub DataOps: either by using Optuna as a backend in the make_randomized_search() method, or by creating an Optuna study directly and using it to pick values for skrub choices when calling DataOp.skb.make_learner().

Note

To use Optuna with skrub, you need to have Optuna installed in your Python environment. You can install it using pip:

pip install optuna

Using Optuna as a backend for randomized search#

The easiest way to use Optuna with skrub is to use it as a backend for make_randomized_search(). This allows us to leverage Optuna’s advanced sampling algorithms and features while keeping same the familiar interface as for other search methods.

We start by defining a DataOp containing choices:

>>> import skrub
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.dummy import DummyClassifier

>>> X_a, y_a = make_classification(random_state=0)
>>> X, y = skrub.X(X_a), skrub.y(y_a)
>>> selector = SelectKBest(k=skrub.choose_int(4, 20, log=True, name='k'))
>>> logistic = LogisticRegression(C=skrub.choose_float(0.1, 10.0, log=True, name="C"))
>>> hgb = HistGradientBoostingClassifier(
...    learning_rate=skrub.choose_float(.01, .5, log=True, name="learning_rate"),
...    random_state=0,
... )
>>> classifier = skrub.choose_from(
... {"logistic": logistic, "hgb": hgb, "dummy": DummyClassifier()}, name="classifier"
... )
>>> pred = X.skb.apply(selector, y=y).skb.apply(classifier, y=y)
>>> print(pred.skb.describe_param_grid())
- k: choose_int(4, 20, log=True, name='k')
  classifier: 'logistic'
  C: choose_float(0.1, 10.0, log=True, name='C')
- k: choose_int(4, 20, log=True, name='k')
  classifier: 'hgb'
  learning_rate: choose_float(0.01, 0.5, log=True, name='learning_rate')
- k: choose_int(4, 20, log=True, name='k')
  classifier: 'dummy'

Now, we can create a randomized search using Optuna as the backend:

>>> search = pred.skb.make_randomized_search(fitted=True, random_state=0, backend="optuna")
Running optuna search for study skrub_randomized_search_c4af73b2-45fb-49ca-9f06-092d74aa8118 in storage .../tmpuor7hqjm_skrub_optuna_search_storage/optuna_storage

It is possible to access the same parameters as with the default backend:

>>> search.results_
    k         C  learning_rate classifier  mean_test_score
 4       NaN       0.013146        hgb             0.93
 4       NaN       0.040454        hgb             0.92
19       NaN       0.019968        hgb             0.92
 4  0.645966            NaN   logistic             0.92
 4       NaN       0.023337        hgb             0.92
 8       NaN       0.097994        hgb             0.90
 9       NaN       0.104104        hgb             0.88
14  0.391899            NaN   logistic             0.81
20       NaN            NaN      dummy             0.50
 9       NaN            NaN      dummy             0.50

The best learner and best hyperparameters can be accessed as usual:

>>> search.best_learner_.describe_params()
{'k': 4, 'learning_rate': 0.01314593370942781, 'classifier': 'hgb'}

make_randomized_search() accepts sampler and timeout parameters to customize the Optuna study. Optuna studies feature a wide range of additional parameters, which can be accessed by using Optuna directly with skrub learners, as shown in the next section.

A more complete example that includes more advanced usage is available in Tuning DataOps with Optuna.

Setting a storage for the Optuna study#

When using Optuna as a backend for hyperparameter search, it is possible to specify a storage option to persist the study and its results. This allows us to resume the search later or analyze the results after the search is complete. This can be done by providing the storage parameter to make_randomized_search().

search = pred.skb.make_randomized_search(
    fitted=True,
    random_state=0,
    backend="optuna",
    storage="sqlite:///optuna_study.db",  # Use a SQLite database file
)

If no storage is provided, a temporary storage is used during optimization, then the study is moved to an in-memory storage once the search completes so the resulting search object is self-contained.

Using Optuna directly#

It is also possible to use Optuna directly with skrub DataOps. This allows for more flexibility and control over the optimization process, as we can define custom objectives and leverage Optuna’s advanced features, such as the ask-and-tell interface, trial pruning, and multi-objective optimization.

In this case, rather than running the hyperparameter search through make_randomized_search(), the optuna.Study runs the hyperparameter search by defining an objective function that uses a skrub learner with hyperparameters suggested by Optuna.

optimize is given an objective function. The objective must accept a Trial object (which is produced by the study and picks the parameters for a given evaluation of the objective) and return the value to maximize (or minimize).

To use Optuna with a DataOp, we just need to pass the Trial object to DataOp.skb.make_learner(). This creates a SkrubLearner initialized with the parameters picked by the optuna Trial.

We can then cross-validate the:class:SkrubLearner, or score it however we prefer, and return the score so that the optuna Study can take it into account.

Here we return a single score (R²), but multi-objective optimization is also possible. Please refer to the Optuna documentation for more information.

>>> import optuna

>>> def objective(trial):
...    learner = pred.skb.make_learner(choose=trial)
...    cv_results = skrub.cross_validate(learner, environment=pred.skb.get_data(), cv=4)
...    return cv_results["test_score"].mean()

>>> study = optuna.create_study(direction="maximize")
>>> study.optimize(objective, n_trials=16)
>>> best_params = study.best_params

Then, we can create the best learner using the best trial found by Optuna:

>>> best_learner = pred.skb.make_learner(choose=study.best_trial)

The learner can also be defined as follows:

>>> best_learner = pred.skb.make_learner()
>>> best_learner.set_params(**study.best_params)
SkrubLearner(data_op=<Apply HistGradientBoostingClassifier>)

Then, we can inspect the parameters as usual:

>>> best_learner.describe_params()
{'k': 12, 'learning_rate': 0.06401143720094754, 'classifier': 'hgb'}

You can find a more complete example in Tuning DataOps with Optuna.

Parallelism#

Optuna’s optuna.study.Study.optimize() uses thread-based parallelism. When we use DataOp.skb.make_randomized_search() with the Optuna backend, both threading and multiprocessing can be used. Skrub will choose based on the joblib configuration: if joblib is configured to use processes (the default), parallelization is done with joblib, and if joblib is configured to use the “threading” backend, Optuna’s built-in thread-based parallelism is used instead.

When the timeout parameter is used, Optuna’s built-in, thread-based parallelization is always used regardless of the joblib configuration.

Using the Optuna dashboard#

Optuna provides a dashboard that allows us to visualize and monitor the optimization process in real-time. This can be especially useful for long-running hyperparameter searches. To use the Optuna Dashboard, we need to install it first:

pip install optuna-dashboard

We can then start the dashboard by running the following command in the terminal:

optuna-dashboard STORAGE_URL

Where STORAGE_URL is the same storage URL used in the Optuna study.

We can then access the dashboard in our web browser at http://localhost:8080 (by default). The dashboard provides various visualizations and tools to analyze the optimization process, such as parameter importance, optimization history, and parallel coordinate plots.

Tuning DataOps with Optuna#

Using Optuna as a backend for randomized search#

Setting a storage for the Optuna study#

Using Optuna directly#

Parallelism#

Using the Optuna dashboard#

This Page