Example datasets, utilities, and customization#

Customizing the default configuration#

Skrub includes a configuration manager that allows setting various parameters (see the set_config() documentation for more detail).

It is possible to change configuration options using the set_config() function:

from skrub import set_config
set_config(use_table_report=True)

Each configuration parameter can also be modified by setting its environment variable.

A config_context() is also provided, which allows temporarily altering the configuration:

import skrub
with skrub.config_context(max_plot_columns=1):
    ...

Fetching the example datasets used in skrub#

skrub includes a number of datasets used for running examples. Each dataset can be downloaded using its fetch_* function, provided in the skrub.datasets namespace:

from skrub.datasets import fetch_employee_salaries
data = fetch_employee_salaries()

Datasets are stored as Bunch objects, which include the full data, an X feature matrix, and a y target column with type pd.DataFrame. Some datasets may have a different format depending on the use case.

Modifying the download location of skrub datasets#

By default, datasets are stored in ~/skrub_data, where ~ is expanded as the (OS dependent) home directory of the user. The function get_data_dir shows the location that skrub uses to store data.

If needed, it is possible to change this location by modifying the environment variable SKRUB_DATA_DIRECTORY to an absolute directory path.