make_deduplication_data#
- skrub.datasets.make_deduplication_data(examples, entries_per_example, prob_mistake_per_letter=0.2, random_state=None)[source]#
Duplicates examples with spelling mistakes.
Characters are misspelled with probability prob_mistake_per_letter.
- Parameters:
- examples
listofstr Examples to duplicate.
- entries_per_example
listofint Number of duplications per example.
- prob_mistake_per_letter
floatin [0, 1], default=0.2 Probability of misspelling a character in duplications. By default, 1/5 of the characters will be misspelled.
- random_state
int,RandomStateinstance, optional Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.
- examples
- Returns:
Examples
>>> from skrub.datasets import make_deduplication_data >>> make_deduplication_data(["string1", "string2"], ... entries_per_example=[4, 5], ... random_state=9) ['btrwng1', 'string1', 'string1', 'string1', 'saoing2', 'string2', 'string2', 'string2', 'string2']