.. |deduplicate| replace:: :func:`~skrub.deduplicate` .. _user_guide_deduplicate: Deduplicating categorical data with |deduplicate| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have a series or list that contains strings with typos, the |deduplicate| function may be used to remove the typos. This is done by creating a mapping between the typo strings and the correct strings. Deduplication is done by first computing the n-gram distance between unique categories in data, then performing hierarchical clustering on this distance matrix, and choosing the most frequent element in each cluster as the 'correct' spelling. This method works best if the true number of categories is significantly smaller than the number of observed spellings. >>> from skrub.datasets import make_deduplication_data >>> duplicated = make_deduplication_data(examples=['black', 'white'], ... entries_per_example=[5, 5], ... prob_mistake_per_letter=0.3, ... random_state=42) >>> duplicated # doctest: +SKIP ['blacs', 'black', 'black', 'black', 'black', \ 'uhibe', 'white', 'white', 'white', 'white'] To deduplicate the data, we can build a correspondence matrix: >>> from skrub import deduplicate >>> deduplicate_correspondence = deduplicate(duplicated) >>> deduplicate_correspondence blacs black black black black black black black black black uhibe white white white white white white white white white dtype: object >>> deduplicated = list(deduplicate_correspondence) >>> deduplicated # doctest: +SKIP ['black', 'black', 'black', 'black', 'black', \ 'white', 'white', 'white', 'white', 'white'] See the |deduplicate| documentation for caveats and more detail.