Working with the example datasets provided by skrub#

Skrub includes a number of datasets used for running examples. Each dataset can be downloaded using its fetch_* function, provided in the skrub.datasets namespace:

from skrub.datasets import fetch_employee_salaries
data = fetch_employee_salaries()

Datasets are stored as Bunch objects, which include a path to each table in the dataset. Datasets should be loaded using the path:

import pandas as pd
df = pd.read_csv(data.path)

Some datasets include multiple tables: in this case, path isn’t available and instead each table should be loaded with its own path:

from skrub.datasets import fetch_credit_fraud
data = fetch_employee_salaries()
baskets = pd.read_csv(data.baskets_path)
products = pd.read_csv(data.products_path)

Modifying the download location of skrub datasets#

By default, datasets are stored in ~/skrub_data, where ~ is expanded as the (OS dependent) home directory of the user. The function get_data_dir() shows the location that skrub uses to store data.

If needed, it is possible to change this location by modifying the environment variable SKB_DATA_DIRECTORY to an absolute directory path.

See How to configure and customize the default behavior of skrub for more info on the global skrub configuration.