Working with the example datasets provided by skrub#
Skrub includes a number of datasets used for running examples. Each dataset
can be downloaded using its fetch_* function, provided in the skrub.datasets
namespace:
from skrub.datasets import fetch_employee_salaries
data = fetch_employee_salaries()
Datasets are stored as Bunch objects, which include a path
to each table in the dataset. Datasets should be loaded using the path:
import pandas as pd
df = pd.read_csv(data.path)
Some datasets include multiple tables: in this case, path isn’t available and
instead each table should be loaded with its own path:
from skrub.datasets import fetch_credit_fraud
data = fetch_employee_salaries()
baskets = pd.read_csv(data.baskets_path)
products = pd.read_csv(data.products_path)
Modifying the download location of skrub datasets#
By default, datasets are stored in ~/skrub_data, where ~ is expanded as
the (OS dependent) home directory of the user. The function
get_data_dir() shows
the location that skrub uses to store data.
If needed, it is possible to change this location by modifying the environment
variable SKB_DATA_DIRECTORY to an absolute directory path.
See How to configure and customize the default behavior of skrub for more info on the global skrub configuration.