deduplicate#

skrub.deduplicate(X, *, n_clusters=None, ngram_range=(2, 4), analyzer='char_wb', linkage_method='average', n_jobs=None)[source]#

Deduplicate categorical data by hierarchically clustering similar strings.

This works best if there are a number of underlying categories that sometimes appear in the data with small variations and/or misspellings.

Parameters:
Xsequence of str

The data to be deduplicated.

n_clustersint, default=None

Number of clusters to use for hierarchical clustering, if None use the number of clusters that lead to the lowest silhouette score.

ngram_range2-tuple of int, default=(2, 4)

The lower and upper boundaries of the range of n-values for different n-grams used in the string similarity. All values of n such that min_n <= n <= max_n will be used.

analyzer{‘word’, ‘char’, ‘char_wb’}, default=’char_wb’

Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

linkage_method{‘single’, ‘complete’, ‘average’, ‘centroid’, ‘median’, ‘ward’},

default=’average’ Linkage method parameter to use for merging clusters via scipy.cluster.hierarchy.linkage(). Option ‘average’ calculates the distance between two clusters as the average distance between data points in the first and second cluster.

n_jobsint, default=None

The number of jobs to run in parallel.

Returns:
list of str

The deduplicated data.

See also

GapEncoder

Encodes dirty categories (strings) by constructing latent topics with continuous encoding.

MinHashEncoder

Encode string columns as a numeric array with the minhash method.

SimilarityEncoder

Encode string columns as a numeric array with n-gram string similarity.

Notes

Deduplication is done by first computing the n-gram distance between unique categories in data, then performing hierarchical clustering on this distance matrix, and choosing the most frequent element in each cluster as the ‘correct’ spelling. This method works best if the true number of categories is significantly smaller than the number of observed spellings.

Examples

>>> from skrub.datasets import make_deduplication_data
>>> duplicated = make_deduplication_data(examples=['black', 'white'],
...                                      entries_per_example=[5, 5],
...                                      prob_mistake_per_letter=0.3,
...                                      random_state=42)
>>> duplicated
['blacs', 'black', 'black', 'black', 'black', 'uhibe', 'white', 'white', 'white', 'white']

To deduplicate the data, we can build a correspondence matrix:

>>> from skrub import deduplicate
>>> deduplicate_correspondence = deduplicate(duplicated)
>>> deduplicate_correspondence
blacs    black
black    black
black    black
black    black
black    black
uhibe    white
white    white
white    white
white    white
white    white
dtype: object

The translation table above is actually a series, giving the deduplicated values, and indexed by the original values. A deduplicated version of the initial list can easily be created:

>>> deduplicated = list(deduplicate_correspondence)
>>> deduplicated
['black', 'black', 'black', 'black', 'black', 'white', 'white', 'white', 'white', 'white']