Working with the example datasets provided by skrub#

skrub includes a number of datasets used for running examples. Each dataset can be downloaded using its fetch_* function, provided in the skrub.datasets namespace:

from skrub.datasets import fetch_employee_salaries
data = fetch_employee_salaries()

Datasets are stored as Bunch objects, which include the full data, an X feature matrix, and a y target column with type pd.DataFrame. Some datasets may have a different format depending on the use case.

Modifying the download location of skrub datasets#

By default, datasets are stored in ~/skrub_data, where ~ is expanded as the (OS dependent) home directory of the user. The function get_data_dir shows the location that skrub uses to store data.

If needed, it is possible to change this location by modifying the environment variable SKRUB_DATA_DIRECTORY to an absolute directory path.

See Customizing the global configuration for more info on the global skrub configuration.