make_deduplication_data#
- skrub.datasets.make_deduplication_data(examples, entries_per_example, prob_mistake_per_letter=0.2, random_state=None)[source]#
Duplicates examples with spelling mistakes.
Characters are misspelled with probability prob_mistake_per_letter.
- Parameters:
- examples
list
ofstr
Examples to duplicate.
- entries_per_example
list
ofint
Number of duplications per example.
- prob_mistake_per_letter
float
in [0, 1], default=0.2 Probability of misspelling a character in duplications. By default, 1/5 of the characters will be misspeled.
- random_state
int
,RandomState
instance, optional Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.
- examples
- Returns:
Gallery examples#
Deduplicating misspelled categories
Deduplicating misspelled categories