deduplicate#
- skrub.deduplicate(X, *, n_clusters=None, ngram_range=(2, 4), analyzer='char_wb', linkage_method='average', n_jobs=None)[source]#
Deduplicate categorical data by hierarchically clustering similar strings.
This works best if there are a number of underlying categories that sometimes appear in the data with small variations and/or misspellings.
- Parameters:
- Xsequence of
str
The data to be deduplicated.
- n_clusters
int
, default=None Number of clusters to use for hierarchical clustering, if None use the number of clusters that lead to the lowest silhouette score.
- ngram_range2-tuple of
int
, default=(2, 4) The lower and upper boundaries of the range of n-values for different n-grams used in the string similarity. All values of n such that
min_n <= n <= max_n
will be used.- analyzer{‘word’, ‘char’, ‘char_wb’}, default=’char_wb’
Analyzer parameter for the CountVectorizer used for the string similarities. Describes whether the matrix V to factorize should be made of word counts or character n-gram counts. Option char_wb creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
- linkage_method{‘single’, ‘complete’, ‘average’, ‘centroid’, ‘median’, ‘ward’},
default=’average’ Linkage method parameter to use for merging clusters via
scipy.cluster.hierarchy.linkage()
. Option ‘average’ calculates the distance between two clusters as the average distance between data points in the first and second cluster.- n_jobs
int
, default=None The number of jobs to run in parallel.
- Xsequence of
- Returns:
See also
GapEncoder
Encodes dirty categories (strings) by constructing latent topics with continuous encoding.
MinHashEncoder
Encode string columns as a numeric array with the minhash method.
SimilarityEncoder
Encode string columns as a numeric array with n-gram string similarity.
Notes
Deduplication is done by first computing the n-gram distance between unique categories in data, then performing hierarchical clustering on this distance matrix, and choosing the most frequent element in each cluster as the ‘correct’ spelling. This method works best if the true number of categories is significantly smaller than the number of observed spellings.
Examples
>>> from skrub.datasets import make_deduplication_data >>> duplicated = make_deduplication_data(examples=['black', 'white'], ... entries_per_example=[5, 5], ... prob_mistake_per_letter=0.3, ... random_state=42) >>> duplicated ['blacs', 'black', 'black', 'black', 'black', 'uhibe', 'white', 'white', 'white', 'white']
To deduplicate the data, we can build a correspondence matrix:
>>> from skrub import deduplicate >>> deduplicate_correspondence = deduplicate(duplicated) >>> deduplicate_correspondence blacs black black black black black black black black black uhibe white white white white white white white white white dtype: object
The translation table above is actually a series, giving the deduplicated values, and indexed by the original values. A deduplicated version of the initial list can easily be created:
>>> deduplicated = list(deduplicate_correspondence) >>> deduplicated ['black', 'black', 'black', 'black', 'black', 'white', 'white', 'white', 'white', 'white']