Tuning DataOps with Optuna#

Optuna is a powerful hyperparameter optimization framework that can be used to efficiently search for the best hyperparameters for machine learning models; Optuna includes both sophisticated search algorithms and tools to monitor and visualize the optimization process.

There are two main ways of using Optuna with skrub DataOps: either by using Optuna as a backend in the make_randomized_search() method, or by creating an Optuna study directly and using it to pick values for skrub choices when calling DataOp.skb.make_learner().

Note

To use Optuna with skrub, you need to have Optuna installed in your Python environment. You can install it using pip:

pip install optuna

Setting a storage for the Optuna study#

When using Optuna as a backend for hyperparameter search, it is possible to specify a storage option to persist the study and its results. This allows us to resume the search later or analyze the results after the search is complete. This can be done by providing the storage parameter to make_randomized_search().

search = pred.skb.make_randomized_search(
    fitted=True,
    random_state=0,
    backend="optuna",
    storage="sqlite:///optuna_study.db",  # Use a SQLite database file
)

If no storage is provided, a temporary storage is used during optimization, then the study is moved to an in-memory storage once the search completes so the resulting search object is self-contained.

Using Optuna directly#

It is also possible to use Optuna directly with skrub DataOps. This allows for more flexibility and control over the optimization process, as we can define custom objectives and leverage Optuna’s advanced features, such as the ask-and-tell interface, trial pruning, and multi-objective optimization.

In this case, rather than running the hyperparameter search through make_randomized_search(), the optuna.Study runs the hyperparameter search by defining an objective function that uses a skrub learner with hyperparameters suggested by Optuna.

optimize is given an objective function. The objective must accept a Trial object (which is produced by the study and picks the parameters for a given evaluation of the objective) and return the value to maximize (or minimize).

To use Optuna with a DataOp, we just need to pass the Trial object to DataOp.skb.make_learner(). This creates a SkrubLearner initialized with the parameters picked by the optuna Trial.

We can then cross-validate the:class:SkrubLearner, or score it however we prefer, and return the score so that the optuna Study can take it into account.

Here we return a single score (R²), but multi-objective optimization is also possible. Please refer to the Optuna documentation for more information.

>>> import optuna
>>> def objective(trial):
...    learner = pred.skb.make_learner(choose=trial)
...    cv_results = skrub.cross_validate(learner, environment=pred.skb.get_data(), cv=4)
...    return cv_results["test_score"].mean()
>>> study = optuna.create_study(direction="maximize")
>>> study.optimize(objective, n_trials=16)
>>> best_params = study.best_params

Then, we can create the best learner using the best trial found by Optuna:

>>> best_learner = pred.skb.make_learner(choose=study.best_trial)

The learner can also be defined as follows:

>>> best_learner = pred.skb.make_learner()
>>> best_learner.set_params(**study.best_params)
SkrubLearner(data_op=<Apply HistGradientBoostingClassifier>)

Then, we can inspect the parameters as usual:

>>> best_learner.describe_params()
{'k': 12, 'learning_rate': 0.06401143720094754, 'classifier': 'hgb'}

You can find a more complete example in Tuning DataOps with Optuna.

Parallelism#

Optuna’s optuna.study.Study.optimize() uses thread-based parallelism. When we use DataOp.skb.make_randomized_search() with the Optuna backend, both threading and multiprocessing can be used. Skrub will choose based on the joblib configuration: if joblib is configured to use processes (the default), parallelization is done with joblib, and if joblib is configured to use the “threading” backend, Optuna’s built-in thread-based parallelism is used instead.

When the timeout parameter is used, Optuna’s built-in, thread-based parallelization is always used regardless of the joblib configuration.

Using the Optuna dashboard#

Optuna provides a dashboard that allows us to visualize and monitor the optimization process in real-time. This can be especially useful for long-running hyperparameter searches. To use the Optuna Dashboard, we need to install it first:

pip install optuna-dashboard

We can then start the dashboard by running the following command in the terminal:

optuna-dashboard STORAGE_URL

Where STORAGE_URL is the same storage URL used in the Optuna study.

We can then access the dashboard in our web browser at http://localhost:8080 (by default). The dashboard provides various visualizations and tools to analyze the optimization process, such as parameter importance, optimization history, and parallel coordinate plots.