Cleaning#
Cleaner
: sanitizing a dataframe#
Cleaner
sanitizes a dataframe, transforming it to a more
consistent data representation which is easier to work with: it detects
null values represented as strings, parses dates, and removes
uninformative columns (see docstring: Cleaner
).
To have reproducible transformations, it is implemented as a scikit-learn transformer:
to sanitize a given dataframe df:
>>> from skrub import Cleaner >>> cleaner = Cleaner(drop_if_constant=True) >>> clean_df = cleaner.fit_transform(df)
to apply the same exact operations to a new dataframe df_test (new rows with the same columns):
>>> clean_df_test = cleaner.transform(df_test)
Reusing the cleaner to transform new data ensures that if columns were dropped on the first dataframe, they are dropped on the second.
deduplicate()
: merging variants of the same entry#
deduplicate()
is used to merge multiple variants of the same entry
into one, for instance typos. Such cleaning is needed to apply subsequent
dataframe operations that need exact correspondences, such as counting
elements. It is typically not needed when the entries are fed directly to
a machine-learning model, as a dirty-category encoder
can suffice.