.. |DropUninformative| replace:: :class:`~skrub.DropUninformative` .. |Cleaner| replace:: :class:`~skrub.Cleaner` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |ToDatetime| replace:: :class:`~skrub.ToDatetime` .. _user_guide_cleaning_dataframes: |Cleaner|: sanitizing a dataframe --------------------------------- Very often, the first steps in preparing a dataframe for further use involve understanding the datatypes in the data and changing them into a more suitable format (e.g., from string to number or datetime). The |Cleaner| aids with this by running the following set of transformations on each column: - Clean null strings: Replace strings typically used to represent missing values with a null value suitable for the column under consideration. - |DropUninformative|: Drop the column if it is considered "uninformative." A column is considered "uninformative" if it contains only missing values (``drop_null_fraction``), only a constant value (``drop_if_constant``), or if all values are distinct (``drop_if_unique``). By default, the |Cleaner| keeps all columns unless they contain only missing values. Refer to :ref:`user_guide_drop_uninformative` for more detail on this operation. .. note:: Setting ``drop_if_unique`` to ``True`` may lead to dropping columns that contain text or IDs. Numeric columns are never dropped by ``drop_if_unique``. - |ToDatetime|: Parse datetimes represented as strings and return them as actual datetimes with the correct dtype. If ``datetime_format`` is provided, it is forwarded to |ToDatetime|. Otherwise, the format is guessed according to common datetime formats. - Clean categories: If the dtype of the column is detected as "Categorical", process it based on the dataframe library (Pandas or Polars) to ensure consistent typing and avoid downstream issues. - Convert to strings: Convert columns to strings unless they have a more informative dtype, such as numeric, categorical, or datetime. If ``numeric_dtype`` is set to ``float32``, the ``Cleaner`` will also convert numeric columns to ``np.float32`` dtype, ensuring a consistent representation of numbers and missing values. This can be useful if the ``Cleaner`` is used as a preprocessing step at the beginning of an ML pipeline. The |Cleaner| is a scikit-learn compatible transformer: >>> from skrub import Cleaner >>> import pandas as pd >>> df = pd.DataFrame({ ... "id": [1, 2, 3], ... "all_missing": ["", "", ""], ... "date": ["2024-05-05", "2024-05-06", "2024-05-07"], ... }) >>> df_clean = Cleaner().fit_transform(df) >>> df_clean id date 0 1 2024-05-05 1 2 2024-05-06 2 3 2024-05-07 >>> df_clean.dtypes id int64 date datetime64[ns] dtype: object Note that the ``"all_missing"`` column has been dropped, and that the ``"date"`` column has been correctly parsed as a datetime column. Converting numeric dtypes to ``float32`` with the |Cleaner| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, when the |Cleaner| encounters numeric dtypes (e.g., ``int8``, ``float64``), it leaves them as-is. In some cases, it may be beneficial to have the same numeric dtype for all numeric columns to guarantee compatibility between values. The |Cleaner| allows conversion of numeric features to ``float32`` by setting the ``numeric_dtype`` parameter: >>> from skrub import Cleaner >>> cleaner = Cleaner(numeric_dtype="float32") >>> import pandas as pd >>> df = pd.DataFrame({ ... "id": [1, 2, 3], ... }) >>> df.dtypes id int64 dtype: object >>> df_cleaned = cleaner.fit_transform(df) >>> df_cleaned.dtypes id float32 dtype: object Setting the dtype to ``float32`` reduces RAM footprint for most use cases and ensures that all missing values have the same representation. This also ensures compatibility with scikit-learn transformers.