Cleaning#

deduplicate(): merging variants of the same entry#

deduplicate() is used to merge multiple variants of the same entry into one, for instance typos. Such cleaning is needed to apply subsequent dataframe operations that need exact correspondences, such as counting elements. It is typically not needed when the entries are fed directly to a machine-learning model, as a dirty-category encoder can suffice.