
skrub.datasets.make_deduplication_data(examples, entries_per_example, prob_mistake_per_letter=0.2, random_state=None)[source]#

Duplicates examples with spelling mistakes.

Characters are misspelled with probability prob_mistake_per_letter.

exampleslist of str

Examples to duplicate.

entries_per_examplelist of int

Number of duplications per example.

prob_mistake_per_letterfloat in [0, 1], default=0.2

Probability of misspelling a character in duplications. By default, 1/5 of the characters will be misspeled.

random_stateint, RandomState instance, optional

Determines random number generation for dataset noise. Pass an int for reproducible output across multiple function calls.

list of str

List of duplicated examples with spelling mistakes

Examples using skrub.datasets.make_deduplication_data#

Deduplicating misspelled categories

Deduplicating misspelled categories