Various string encoders: a sentiment analysis example#

In this example, we explore the performance of string and categorical encoders available in skrub.

The Toxicity dataset#

We focus on the toxicity dataset, a corpus of 1,000 tweets, evenly balanced between the binary labels “Toxic” and “Not Toxic”. Our goal is to classify each entry between these two labels, using only the text of the tweets as features.

from skrub.datasets import fetch_toxicity

dataset = fetch_toxicity()
X, y = dataset.X, dataset.y
X["is_toxic"] = y

When it comes to displaying large chunks of text, the TableReport is especially useful! Click on any cell below to expand and read the tweet in full.

from skrub import TableReport

TableReport(X)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



GapEncoder#

First, let’s vectorize our text column using the GapEncoder, one of the high cardinality categorical encoders provided by skrub. As introduced in the previous example, the GapEncoder performs matrix factorization for topic modeling. It builds latent topics by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to topic activations.

To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.

from skrub import GapEncoder

gap = GapEncoder(n_components=30)
X_trans = gap.fit_transform(X["text"])
# Add the original text as a first column
X_trans.insert(0, "text", X["text"])
TableReport(X_trans)