fetch_toxicity#

skrub.datasets.fetch_toxicity(data_home=None)[source]#

Fetch the toxicity dataset (classification) available at https://github.com/skrub-data/skrub-data-files

This is a balanced binary classification use-case, where the single table consists in only two columns:

  • text: the text of the comment

  • is_toxic: whether or not the comment is toxic

Size on disk: 220KB.

Parameters:
data_homestr or path-like, default=None

The directory where to download and unzip the files.

Returns:
bunchBunch

A dictionary-like object with the following keys:

toxicityDataFrame of shape (1000, 2)

The dataframe.

XDataFrame of shape (1000, 1)

Features, i.e. the dataframe without the target.

yDataFrame of shape (1000, 1)

Target labels.

metadatadict

A dictionary containing the name, description, source and target.

pathstr

The path to the toxicity CSV file.