.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/05_deduplication.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_05_deduplication.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_05_deduplication.py:


.. _examples_deduplication:

===================================
Deduplicating misspelled categories
===================================

Real world datasets often come with misspellings, for instance
in manually inputted categorical variables.
Such misspelling break data analysis steps that require
exact matching, such as a ``GROUP BY`` operation.

Merging multiple variants of the same category is known as
*deduplication*. It is implemented in skrub with the |deduplicate| function.

Deduplication relies on *unsupervised learning*. It finds structures in
the data without providing a-priori known and explicit labels/categories.
Specifically, measuring the distance between strings can be used to
find clusters of strings that are similar to each other (e.g. differ only
by a misspelling) and hence, flag and regroup potentially
misspelled category names in an unsupervised manner.


.. |deduplicate| replace::
    :func:`~skrub.deduplicate`

.. |Gap| replace::
     :class:`~skrub.GapEncoder`

.. |MinHash| replace::
     :class:`~skrub.MinHashEncoder`

.. GENERATED FROM PYTHON SOURCE LINES 35-49

A typical use case
------------------

Let's take an example:
as a data scientist, your job is to analyze the data from a hospital ward.
In the data, we notice that in most cases, the doctor prescribes
one of three following medications:
"Contrivan", "Genericon" or "Zipholan".

However, data entry is manual and - either because the doctor's
handwriting was hard to decipher, or due to mistakes during input -
there are multiple spelling mistakes in the dataset.

Let's generate this example dataset:

.. GENERATED FROM PYTHON SOURCE LINES 49-64

.. code-block:: Python


    import numpy as np
    import pandas as pd

    from skrub.datasets import make_deduplication_data

    duplicated_names = make_deduplication_data(
        examples=["Contrivan", "Genericon", "Zipholan"],  # our three medication names
        entries_per_example=[500, 100, 1500],  # their respective number of occurrences
        prob_mistake_per_letter=0.05,  # 5% probability of typo per letter
        random_state=42,  # set seed for reproducibility
    )

    duplicated_names[:5]


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    ['Contrivan', 'Cvntrivan', 'Contrivan', 'Coqtrivan', 'Contriian']


.. GENERATED FROM PYTHON SOURCE LINES 65-67

We then extract the unique medication names in the data and
visualize how often they appear:

.. GENERATED FROM PYTHON SOURCE LINES 67-78

.. code-block:: Python


    import matplotlib.pyplot as plt

    unique_examples, counts = np.unique(duplicated_names, return_counts=True)

    plt.figure(figsize=(10, 15))
    plt.barh(unique_examples, counts)
    plt.ylabel("Medication name")
    plt.xlabel("Count")
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_001.png
   :alt: 05 deduplication
   :srcset: /auto_examples/images/sphx_glr_05_deduplication_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 79-88

We clearly see the structure of the data:
the three original medications ("Contrivan", "Genericon" and "Zipholan")
are the most common ones, but there are many spelling mistakes or
slight variations of the original names.

The idea behind |deduplicate| is to use the fact that
the string distance of misspelled medications will be
closest to their original (most frequent) medication name
- and therefore form clusters.

.. GENERATED FROM PYTHON SOURCE LINES 90-97

Deduplication: suggest corrections of misspelled names
------------------------------------------------------

The |deduplicate| function uses clustering based on
string similarities to group duplicated names.

Let's deduplicate our data:

.. GENERATED FROM PYTHON SOURCE LINES 97-104

.. code-block:: Python


    from skrub import deduplicate

    deduplicated_data = deduplicate(duplicated_names)

    deduplicated_data[:5]


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Contrivan    Contrivan
    Cvntrivan    Contrivan
    Contrivan    Contrivan
    Coqtrivan    Contrivan
    Contriian    Contrivan
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 105-113

And that's it! We now have the deduplicated data.

.. topic:: Note:

   The number of clusters will need some adjustment depending on the data.
   If no fixed number of clusters is given, |deduplicate| tries to set it
   automatically via the
   `silhouette score <https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient>`_.

.. GENERATED FROM PYTHON SOURCE LINES 115-116

We can visualize the distribution of categories in the deduplicated data:

.. GENERATED FROM PYTHON SOURCE LINES 116-128

.. code-block:: Python


    deduplicated_unique_examples, deduplicated_counts = np.unique(
        deduplicated_data, return_counts=True
    )
    deduplicated_series = pd.Series(deduplicated_counts, index=deduplicated_unique_examples)

    plt.figure(figsize=(10, 5))
    plt.barh(deduplicated_unique_examples, deduplicated_counts)
    plt.xlabel("Count")
    plt.ylabel("Medication name")
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_002.png
   :alt: 05 deduplication
   :srcset: /auto_examples/images/sphx_glr_05_deduplication_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 129-136

Here, the silhouette score finds the ideal number of
clusters (3) and groups the spelling mistakes.

In practice, the translation/deduplication will often be imperfect
and require some tweaks.
In this case, we can construct and update a translation table based on the
data returned by |deduplicate|.

.. GENERATED FROM PYTHON SOURCE LINES 136-145

.. code-block:: Python


    # create a table that maps original to corrected categories
    translation_table = pd.Series(deduplicated_data, index=duplicated_names)

    # remove duplicates in the original data
    translation_table = translation_table[~translation_table.index.duplicated(keep="first")]

    translation_table.head()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Contrivan    Contrivan
    Cvntrivan    Contrivan
    Coqtrivan    Contrivan
    Contriian    Contrivan
    Contaivan    Contrivan
    dtype: object


.. GENERATED FROM PYTHON SOURCE LINES 146-150

In this table, we have the category name on the left,
and the cluster it was translated to on the right.
If we want to adapt the translation table, we can
modify it manually.

.. GENERATED FROM PYTHON SOURCE LINES 152-159

Visualizing string pair-wise distance between names
---------------------------------------------------

Below, we use a heatmap to visualize the pairwise-distance between medication
names. A darker color means that two medication names are closer together
(i.e. more similar), a lighter color means a larger distance.


.. GENERATED FROM PYTHON SOURCE LINES 159-175

.. code-block:: Python


    from scipy.spatial.distance import squareform

    from skrub import compute_ngram_distance

    ngram_distances = compute_ngram_distance(unique_examples)
    square_distances = squareform(ngram_distances)

    import seaborn as sns

    fig, ax = plt.subplots(figsize=(14, 12))
    sns.heatmap(
        square_distances, yticklabels=unique_examples, xticklabels=unique_examples, ax=ax
    )
    plt.show()


.. image-sg:: /auto_examples/images/sphx_glr_05_deduplication_003.png
   :alt: 05 deduplication
   :srcset: /auto_examples/images/sphx_glr_05_deduplication_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 176-178

We have three clusters appearing - the original medication
names and their misspellings that form a cluster around them.

.. GENERATED FROM PYTHON SOURCE LINES 180-192

Conclusion
----------

In this example, we have seen how to use the |deduplicate| function to
automatically detect and correct misspelled category names.

Note that deduplication is especially useful when we either
know our ground truth (e.g. the original medication names),
or when the similarity across strings does not
carry useful information for our machine learning task.
Otherwise, we prefer using encoding methods such as |Gap|
or |MinHash|.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.198 seconds)


.. _sphx_glr_download_auto_examples_05_deduplication.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/skrub-data/skrub/0.5.4?urlpath=lab/tree/notebooks/auto_examples/05_deduplication.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../lite/lab/index.html?path=auto_examples/05_deduplication.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 05_deduplication.ipynb <05_deduplication.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 05_deduplication.py <05_deduplication.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 05_deduplication.zip <05_deduplication.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_