make_deduplication_data#

skrub.datasets.make_deduplication_data(examples, entries_per_example, prob_mistake_per_letter=0.2, random_state=None)[source]#

Duplicates examples with spelling mistakes.

Characters are misspelled with probability prob_mistake_per_letter.

Parameters:

exampleslist of str: Examples to duplicate.
entries_per_examplelist of int: Number of duplications per example.
prob_mistake_per_letterfloat in [0, 1], default=0.2: Probability of misspelling a character in duplications. By default, 1/5 of the characters will be misspelled.
random_stateint, RandomState instance, optional: Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.

Returns:

Gallery examples#