Note

Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.

Various string encoders: a sentiment analysis example#

In this example, we explore the performance of string and categorical encoders available in skrub.

The Toxicity dataset#

We focus on the toxicity dataset, a corpus of 1,000 tweets, evenly balanced between the binary labels “Toxic” and “Not Toxic”. Our goal is to classify each entry between these two labels, using only the text of the tweets as features.

import pandas as pd

from skrub.datasets import fetch_toxicity

We load the dataset from the path using pandas.

file_path = fetch_toxicity().path

X = pd.read_csv(file_path)

Downloading 'toxicity_v1' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/toxicity_v1.zip (attempt 1/3)

When it comes to displaying large chunks of text, the TableReport is especially useful! Click on any cell below to expand and read the tweet in full.

from skrub import TableReport

TableReport(X)

	text	is_toxic
0	Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.	Not Toxic
1	Are you feeling it now, mr mark?	Not Toxic
2	Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!	Not Toxic
3	so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens	Toxic
4	I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.	Not Toxic

995	ILLEGITIMATE INCOMPETENT TYRANT TALIBIDEN	Toxic
996	From the theories I’ve read I’m pretty sure you’re right and he’s singed	Not Toxic
997	I can't bear how nice this is. I guess its bearnessities. I'll see my self out	Not Toxic
998	I’m 70 and I agree.	Not Toxic
999	Especially Today's Politicians #traderjoe #delouse #fuckliberals #wethepewple #iwillnotcomply #defiant #fafo #hardinasoftworld #hellandback #irrepressible #shepherdoffire ⌖ #america #freedom #2A #shallnotbeinfringed #noquarter #nosurrender #molonlabe #letfreedomring #keepthepowderdry #patriot #veteran #warrior	Toxic

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	text	StringDtype	False	0 (0.0%)	999 (99.9%)
1	is_toxic	StringDtype	False	0 (0.0%)	2 (0.2%)

Column 1	Column 2	Cramér's V	Pearson's Correlation
text	is_toxic	0.100

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We prepare the target variable by mapping the binary labels “Toxic” and “Not Toxic” to 1 and 0, respectively. The target is reused throughout the example.

y = X.pop("is_toxic").map({"Toxic": 1, "Not Toxic": 0})

GapEncoder#

First, let’s vectorize our text column using the GapEncoder, one of the high cardinality categorical encoders provided by skrub. As introduced in the previous example, the GapEncoder performs matrix factorization for topic modeling. It builds latent topics by capturing combinations of substrings that frequently co-occur, and encoded vectors correspond to topic activations.

To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.

from skrub import GapEncoder

gap = GapEncoder(n_components=30)
X_trans = gap.fit_transform(X["text"])
# Add the original text as a first column
X_trans.insert(0, "text", X["text"])
TableReport(X_trans)

Column	Column name	dtype	Is sorted	Unique values	Mean	Std	Min	Median	Max
0	text	StringDtype	False	999 (99.9%)
1	text: beautifuls, beautiful, painfully	Float64DType	False	999 (99.9%)	0.0146	0.0584	0.000261	0.000682	0.819
2	text: disrespectful, disrespects, respectful	Float64DType	False	999 (99.9%)	0.0291	0.115	0.000261	0.00202	2.86
3	text: suicide, joking, laughing	Float64DType	False	999 (99.9%)	0.0495	0.209	0.000254	0.00461	6.13
4	text: nvrseqrvrqrqrr, movement, satsuki	Float64DType	False	999 (99.9%)	0.0277	0.0827	0.000267	0.00119	1.52
5	text: bitchymitchy, vague, enough	Float64DType	False	999 (99.9%)	0.0256	0.0817	0.000259	0.000897	0.803
6	text: kiwikid, actually, mistype	Float64DType	False	999 (99.9%)	0.0200	0.0704	0.000262	0.000960	0.721
7	text: dumbfucks, fuckzuckerberg, fucktard	Float64DType	False	999 (99.9%)	0.0193	0.0845	0.000259	0.000767	1.91
8	text: morgana, melancholy, stories	Float64DType	False	999 (99.9%)	0.0475	0.246	0.000276	0.00335	7.29
9	text: recommended, recommend, noodles	Float64DType	False	999 (99.9%)	0.0373	0.156	0.000259	0.00248	4.21
10	text: constituents, definitely, considerable	Float64DType	False	999 (99.9%)	0.0434	0.211	0.000255	0.00363	6.17
11	text: ability, choosing, research	Float64DType	False	999 (99.9%)	0.0359	0.137	0.000255	0.00144	3.13
12	text: governments, destroying, commandments	Float64DType	False	999 (99.9%)	0.0518	0.421	0.000252	0.00355	13.0
13	text: pseudoscience, yourselves, ourselves	Float64DType	False	999 (99.9%)	0.0595	0.455	0.000252	0.00263	14.0
14	text: interesting, supporters, interested	Float64DType	False	999 (99.9%)	0.0329	0.125	0.000259	0.00163	3.22
15	text: minorities, negligent, baldwins	Float64DType	False	999 (99.9%)	0.0313	0.118	0.000262	0.00161	2.88
16	text: administration, murderers, murderer	Float64DType	False	999 (99.9%)	0.0237	0.0866	0.000256	0.00115	1.06
17	text: emblazoned, confederate, changing	Float64DType	False	999 (99.9%)	0.0512	0.368	0.000253	0.00414	11.4
18	text: fentanyl, georgefloyd, floyed	Float64DType	False	999 (99.9%)	0.0190	0.125	0.000257	0.000634	2.07
19	text: bahiyyih, independent, guidance	Float64DType	False	999 (99.9%)	0.0329	0.104	0.000262	0.00164	2.46
20	text: afghanistan, withdrawal, campaigned	Float64DType	False	999 (99.9%)	0.0365	0.197	0.000258	0.00235	5.89
21	text: unrelentless, relentlessly, intentionally	Float64DType	False	999 (99.9%)	0.0527	0.296	0.000254	0.00645	8.98
22	text: republican, mythology, republicans	Float64DType	False	999 (99.9%)	0.0366	0.191	0.000257	0.00314	5.50
23	text: lackluster, previously, shyvana	Float64DType	False	999 (99.9%)	0.0513	0.452	0.000253	0.00371	14.1
24	text: ƞỉဌဌᕦѓ, psycho, psychopathic	Float64DType	False	999 (99.9%)	0.0139	0.0640	0.000257	0.000714	1.07
25	text: conservatives, indoctrination, consequences	Float64DType	False	999 (99.9%)	0.0421	0.190	0.000254	0.00279	5.46
26	text: americans, qualified, marxists	Float64DType	False	999 (99.9%)	0.0449	0.250	0.000254	0.00375	7.42
27	text: pizzaaaa, cuntmila, libcunts	Float64DType	False	999 (99.9%)	0.0176	0.0788	0.000255	0.000614	1.61
28	text: michelle, abrahamic, transfer	Float64DType	False	999 (99.9%)	0.0164	0.0929	0.000264	0.000727	1.61
29	text: difficult, netcode, rollback	Float64DType	False	999 (99.9%)	0.0259	0.0816	0.000261	0.00143	1.46
30	text: nxhcplzrecw, friends, unfriend	Float64DType	False	999 (99.9%)	0.0168	0.0592	0.000259	0.000926	0.696

Please enable javascript

We can use a heatmap to highlight the highest activations, making them more visible for comparison against the original text and vectors above.

import numpy as np
from matplotlib import pyplot as plt


def plot_gap_feature_importance(X_trans):
    x_samples = X_trans.pop("text")

    # We slightly format the topics and labels for them to fit on the plot.
    topic_labels = [x.replace("text: ", "") for x in X_trans.columns]
    labels = x_samples.str[:50].values + "..."

    # We clip large outliers to make activations more visible.
    X_trans = np.clip(X_trans, a_min=None, a_max=200)

    plt.figure(figsize=(10, 10), dpi=200)
    plt.imshow(X_trans.T)

    plt.yticks(
        range(len(topic_labels)),
        labels=topic_labels,
        ha="right",
        size=12,
    )
    plt.xticks(range(len(labels)), labels=labels, size=12, rotation=50, ha="right")

    plt.colorbar().set_label(label="Topic activations", size=13)
    plt.ylabel("Latent topics", size=14)
    plt.xlabel("Data entries", size=14)
    plt.tight_layout()
    plt.show()


plot_gap_feature_importance(X_trans.head())

/home/circleci/project/examples/0020_text_with_string_encoders.py:126: UserWarning: Glyph 4108 (\N{MYANMAR LETTER TTHA}) missing from font(s) DejaVu Sans.
  plt.tight_layout()

Now that we have an understanding of the vectors produced by the GapEncoder, let’s evaluate its performance in toxicity classification. The GapEncoder excels at handling categorical columns with high cardinality, but here the column consists of free-form text. Sentences are generally longer, with more unique ngrams than high cardinality categories.

To benchmark the performance of the GapEncoder against the toxicity dataset, we integrate it into a TableVectorizer, as introduced in the previous example, and create a Pipeline by appending a HistGradientBoostingClassifier, which consumes the vectors produced by the GapEncoder.

We set n_components to 30; however, to achieve the best performance, we would need to find the optimal value for this hyperparameter using either GridSearchCV or RandomizedSearchCV. We skip this part to keep the computation time for this small example.

Recall that the ROC AUC is a metric that quantifies the ranking power of estimators, where a random estimator scores 0.5, and an oracle —providing perfect predictions— scores 1.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

from skrub import TableVectorizer


def plot_box_results(named_results):
    fig, ax = plt.subplots()
    names, scores = zip(
        *[(name, result["test_score"]) for name, result in named_results]
    )
    ax.boxplot(scores)
    ax.set_xticks(range(1, len(names) + 1), labels=list(names), size=12)
    ax.set_ylabel("ROC AUC", size=14)
    plt.title(
        "AUC distribution across folds (higher is better)",
        size=14,
    )
    plt.show()


results = []

Now we can evaluate the performance of the GapEncoder in toxicity classification.

gap_pipe = make_pipeline(
    TableVectorizer(high_cardinality=GapEncoder(n_components=30)),
    HistGradientBoostingClassifier(),
)
gap_results = cross_validate(gap_pipe, X, y, scoring="roc_auc")
results.append(("GapEncoder", gap_results))

plot_box_results(results)

AUC distribution across folds (higher is better)

MinHashEncoder#

We now compare these results with the MinHashEncoder, which is faster and produces vectors better suited for tree-based estimators like HistGradientBoostingClassifier. To do this, we can simply replace the GapEncoder with the MinHashEncoder in the previous pipeline using set_params().

from skrub import MinHashEncoder

minhash_pipe = make_pipeline(
    TableVectorizer(high_cardinality=MinHashEncoder(n_components=30)),
    HistGradientBoostingClassifier(),
)
minhash_results = cross_validate(minhash_pipe, X, y, scoring="roc_auc")
results.append(("MinHashEncoder", minhash_results))

plot_box_results(results)

Remarkably, the vectors produced by the MinHashEncoder offer less predictive power than those from the GapEncoder on this dataset.

TextEncoder#

Let’s now shift our focus to pre-trained deep learning encoders. Our previous encoders are syntactic models that we trained directly on the toxicity dataset. To generate more powerful vector representations for free-form text and diverse entries, we can instead use semantic models, such as BERT, which have been trained on very large datasets.

TextEncoder enables you to integrate any Sentence Transformer model from the Hugging Face Hub (or from your local disk) into your Pipeline to transform a text column in a dataframe. By default, TextEncoder uses the e5-small-v2 model.

from skrub import TextEncoder

text_encoder = TextEncoder(
    "sentence-transformers/paraphrase-albert-small-v2",
    device="cpu",
)

text_encoder_pipe = make_pipeline(
    TableVectorizer(high_cardinality=text_encoder),
    HistGradientBoostingClassifier(),
)
text_encoder_results = cross_validate(text_encoder_pipe, X, y, scoring="roc_auc")
results.append(("TextEncoder", text_encoder_results))

plot_box_results(results)

StringEncoder#

TextEncoder embeddings are very strong, but they are also quite expensive to use. A simpler, faster alternative for encoding strings is the StringEncoder, which works by first performing a tf-idf (computing vectors of rescaled word counts of the text wiki), and then following it with TruncatedSVD to reduce the number of dimensions to, in this case, 30.

from skrub import StringEncoder

string_encoder = StringEncoder(ngram_range=(3, 4), analyzer="char_wb", random_state=0)

string_encoder_pipe = make_pipeline(
    TableVectorizer(high_cardinality=string_encoder),
    HistGradientBoostingClassifier(),
)

string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
results.append(("StringEncoder", string_encoder_results))

plot_box_results(results)

The performance of the TextEncoder is significantly stronger than that of the syntactic encoders, which is expected. But how long does it take to load and vectorize text on a CPU using a Sentence Transformer model? Below, we display the tradeoff between predictive accuracy and training time. Note that since we are not training the Sentence Transformer model, the “fitting time” refers to the time taken for vectorization.

def plot_performance_tradeoff(results):
    fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
    markers = ["s", "o", "^", "x"]
    for idx, (name, result) in enumerate(results):
        ax.scatter(
            result["fit_time"],
            result["test_score"],
            label=name,
            marker=markers[idx],
        )
        mean_fit_time = np.mean(result["fit_time"])
        mean_score = np.mean(result["test_score"])
        ax.scatter(
            mean_fit_time,
            mean_score,
            color="k",
            marker=markers[idx],
        )
        std_fit_time = np.std(result["fit_time"])
        std_score = np.std(result["test_score"])
        ax.errorbar(
            x=mean_fit_time,
            y=mean_score,
            yerr=std_score,
            fmt="none",
            c="k",
            capsize=2,
        )
        ax.errorbar(
            x=mean_fit_time,
            y=mean_score,
            xerr=std_fit_time,
            fmt="none",
            c="k",
            capsize=2,
        )
        ax.set_xscale("log")

        ax.set_xlabel("Time to fit (seconds)")
        ax.set_ylabel("ROC AUC")
        ax.set_title("Prediction performance / training time trade-off")

    ax.annotate(
        "Best time / \nperformance trade-off",
        xy=(0.05, 0.95),
        xycoords="axes fraction",
        xytext=(0.2, 0.8),
        textcoords="axes fraction",
        arrowprops=dict(arrowstyle="->", lw=1.5, mutation_scale=15),
    )
    ax.legend(bbox_to_anchor=(1.02, 0.3))
    plt.show()


plot_performance_tradeoff(results)

Prediction performance / training time trade-off

The black points represent the average time to fit and AUC for each vectorizer, and the width of the bars represents one standard deviation.

The green outlier dot on the right side of the plot corresponds to the first time the Sentence Transformers model was downloaded and loaded into memory. During the subsequent cross-validation iterations, the model is simply copied, which reduces computation time for the remaining folds.

Interestingly, StringEncoder has a performance remarkably similar to that of GapEncoder, while being significantly faster.

Conclusion#

In conclusion, TextEncoder provides powerful vectorization for text, but at the cost of longer computation times and the need for additional dependencies, such as torch. StringEncoder represents a simpler alternative that can provide good performance at a fraction of the cost of more complex methods.

Total running time of the script: (3 minutes 39.477 seconds)

Estimated memory usage: 1270 MB

Download Jupyter notebook: 0020_text_with_string_encoders.ipynb

Download Python source code: 0020_text_with_string_encoders.py

Download zipped: 0020_text_with_string_encoders.zip

Gallery generated by Sphinx-Gallery

	text	text: beautifuls, beautiful, painfully	text: disrespectful, disrespects, respectful	text: suicide, joking, laughing	text: nvrseqrvrqrqrr, movement, satsuki	text: bitchymitchy, vague, enough	text: kiwikid, actually, mistype	text: dumbfucks, fuckzuckerberg, fucktard	text: morgana, melancholy, stories	text: recommended, recommend, noodles	text: constituents, definitely, considerable	text: ability, choosing, research	text: governments, destroying, commandments	text: pseudoscience, yourselves, ourselves	text: interesting, supporters, interested	text: minorities, negligent, baldwins	text: administration, murderers, murderer	text: emblazoned, confederate, changing	text: fentanyl, georgefloyd, floyed	text: bahiyyih, independent, guidance	text: afghanistan, withdrawal, campaigned	text: unrelentless, relentlessly, intentionally	text: republican, mythology, republicans	text: lackluster, previously, shyvana	text: ƞỉဌဌᕦѓ, psycho, psychopathic	text: conservatives, indoctrination, consequences	text: americans, qualified, marxists	text: pizzaaaa, cuntmila, libcunts	text: michelle, abrahamic, transfer	text: difficult, netcode, rollback	text: nxhcplzrecw, friends, unfriend
	text	text: beautifuls, beautiful, painfully	text: disrespectful, disrespects, respectful	text: suicide, joking, laughing	text: nvrseqrvrqrqrr, movement, satsuki	text: bitchymitchy, vague, enough	text: kiwikid, actually, mistype	text: dumbfucks, fuckzuckerberg, fucktard	text: morgana, melancholy, stories	text: recommended, recommend, noodles	text: constituents, definitely, considerable	text: ability, choosing, research	text: governments, destroying, commandments	text: pseudoscience, yourselves, ourselves	text: interesting, supporters, interested	text: minorities, negligent, baldwins	text: administration, murderers, murderer	text: emblazoned, confederate, changing	text: fentanyl, georgefloyd, floyed	text: bahiyyih, independent, guidance	text: afghanistan, withdrawal, campaigned	text: unrelentless, relentlessly, intentionally	text: republican, mythology, republicans	text: lackluster, previously, shyvana	text: ƞỉဌဌᕦѓ, psycho, psychopathic	text: conservatives, indoctrination, consequences	text: americans, qualified, marxists	text: pizzaaaa, cuntmila, libcunts	text: michelle, abrahamic, transfer	text: difficult, netcode, rollback	text: nxhcplzrecw, friends, unfriend
0	Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.	0.000811	0.00262	0.00171	0.00225	0.199	0.00257	0.00266	0.114	0.0763	0.0403	0.0429	0.00693	0.00422	0.0112	0.00462	0.00243	0.00202	0.00232	0.00308	0.00229	0.00147	0.00645	0.0153	0.00124	0.00144	0.112	0.00119	0.00320	0.114	0.00866
1	Are you feeling it now, mr mark?	0.000354	0.000476	0.00290	0.000490	0.000505	0.000502	0.000413	0.0357	0.000691	0.000526	0.00326	0.00138	0.0506	0.000730	0.000710	0.000407	0.00121	0.000557	0.000478	0.00129	0.000553	0.000980	0.00135	0.108	0.0152	0.00128	0.000340	0.000411	0.00164	0.000459
2	Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!	0.000950	0.000964	0.0659	0.0107	0.0714	0.00114	0.00108	0.0697	0.000869	0.0651	0.0129	0.0361	0.00182	0.000957	0.0258	0.0124	0.102	0.00111	0.0793	0.00225	0.0841	0.0629	0.00111	0.00108	0.0414	0.00871	0.000587	0.000985	0.0401	0.212
3	so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens	0.000553	0.00446	0.170	0.00109	0.00201	0.000937	0.00918	0.0266	0.00174	0.0882	0.000916	0.00198	0.0252	0.00260	0.00117	0.00500	0.0115	0.000480	0.00322	0.0282	0.121	0.0536	0.00224	0.000442	0.00118	0.0225	0.344	0.000481	0.00862	0.000666
4	I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.	0.00211	0.126	0.146	0.00146	0.00451	0.00104	0.00282	0.0890	0.0976	0.296	0.00605	0.0266	0.343	0.0597	0.00198	0.0628	0.0460	0.000665	0.00700	0.00455	0.0298	0.00612	0.251	0.0337	0.0719	0.0504	0.000599	0.000698	0.00138	0.00261

995	ILLEGITIMATE INCOMPETENT TYRANT TALIBIDEN	0.00193	0.0140	0.000513	0.000458	0.00110	0.000620	0.000554	0.000665	0.00125	0.00103	0.0425	0.00250	0.00146	0.00196	0.000730	0.0868	0.00135	0.00218	0.0179	0.0330	0.0304	0.0134	0.00106	0.000607	0.00140	0.00109	0.000566	0.0330	0.00151	0.00506
996	From the theories I’ve read I’m pretty sure you’re right and he’s singed	0.000410	0.00104	0.000780	0.000695	0.000918	0.000526	0.000433	0.0726	0.0201	0.00271	0.0447	0.00153	0.0112	0.00221	0.00157	0.00103	0.00527	0.000484	0.193	0.0947	0.00216	0.00179	0.0680	0.000611	0.00179	0.00116	0.000376	0.000450	0.000709	0.000731
997	I can't bear how nice this is. I guess its bearnessities. I'll see my self out	0.00584	0.0224	0.000569	0.000954	0.0206	0.102	0.000614	0.147	0.000663	0.000762	0.0414	0.00104	0.0168	0.00160	0.00402	0.000627	0.000601	0.000523	0.00181	0.00270	0.0512	0.0183	0.130	0.000620	0.000738	0.00204	0.000860	0.000543	0.00208	0.000720
998	I’m 70 and I agree.	0.000310	0.000341	0.000340	0.00236	0.000350	0.000384	0.000319	0.0777	0.000417	0.000546	0.000468	0.000520	0.000381	0.000386	0.000371	0.000547	0.000356	0.000365	0.0448	0.000367	0.000378	0.000424	0.000376	0.000427	0.000355	0.000347	0.000338	0.000333	0.000337	0.000320
999	Especially Today's Politicians #traderjoe #delouse #fuckliberals #wethepewple #iwillnotcomply #defiant #fafo #hardinasoftworld #hellandback #irrepressible #shepherdoffire ⌖ #america #freedom #2A #shallnotbeinfringed #noquarter #nosurrender #molonlabe #letfreedomring #keepthepowderdry #patriot #veteran #warrior	0.00195	0.00207	0.00191	0.000690	0.000689	0.00236	0.0137	0.00192	0.00216	0.00622	0.00155	0.0990	0.00163	0.00339	0.115	0.0248	0.0381	1.28	0.0136	0.00140	0.0230	0.139	0.00296	0.00128	0.00269	0.0854	0.000635	0.456	0.00151	0.00106

Various string encoders: a sentiment analysis example#

The Toxicity dataset#

text

is_toxic

text

is_toxic

Please enable javascript

GapEncoder#

text

text: beautifuls, beautiful, painfully

text: disrespectful, disrespects, respectful

text: suicide, joking, laughing

text: nvrseqrvrqrqrr, movement, satsuki

text: bitchymitchy, vague, enough

text: kiwikid, actually, mistype

text: dumbfucks, fuckzuckerberg, fucktard

text: morgana, melancholy, stories

text: recommended, recommend, noodles

text: constituents, definitely, considerable

text: ability, choosing, research

text: governments, destroying, commandments

text: pseudoscience, yourselves, ourselves

text: interesting, supporters, interested

text: minorities, negligent, baldwins

text: administration, murderers, murderer

text: emblazoned, confederate, changing

text: fentanyl, georgefloyd, floyed

text: bahiyyih, independent, guidance

text: afghanistan, withdrawal, campaigned

text: unrelentless, relentlessly, intentionally

text: republican, mythology, republicans

text: lackluster, previously, shyvana

text: ƞỉဌဌᕦѓ, psycho, psychopathic

text: conservatives, indoctrination, consequences

text: americans, qualified, marxists

text: pizzaaaa, cuntmila, libcunts

text: michelle, abrahamic, transfer

text: difficult, netcode, rollback

text: nxhcplzrecw, friends, unfriend

Please enable javascript

MinHashEncoder#

TextEncoder#

StringEncoder#

Conclusion#