Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.
Various string encoders: a sentiment analysis example#
In this example, we explore the performance of string and categorical encoders available in skrub.
The Toxicity dataset#
We focus on the toxicity dataset, a corpus of 1,000 tweets, evenly balanced between the binary labels “Toxic” and “Not Toxic”. Our goal is to classify each entry between these two labels, using only the text of the tweets as features.
import pandas as pd
from skrub.datasets import fetch_toxicity
We load the dataset from the path using pandas.
file_path = fetch_toxicity().path
X = pd.read_csv(file_path)
Downloading 'toxicity_v1' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/toxicity_v1.zip (attempt 1/3)
When it comes to displaying large chunks of text, the TableReport is especially
useful! Click on any cell below to expand and read the tweet in full.
from skrub import TableReport
TableReport(X)
| text | is_toxic | |
|---|---|---|
| 0 | Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though. | Not Toxic |
| 1 | Are you feeling it now, mr mark? | Not Toxic |
| 2 | Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility! | Not Toxic |
| 3 | so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens | Toxic |
| 4 | I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort. | Not Toxic |
| 995 | ILLEGITIMATE INCOMPETENT TYRANT TALIBIDEN | Toxic |
| 996 | From the theories I’ve read I’m pretty sure you’re right and he’s singed | Not Toxic |
| 997 | I can't bear how nice this is. I guess its bearnessities. I'll see my self out | Not Toxic |
| 998 | I’m 70 and I agree. | Not Toxic |
| 999 | Especially Today's Politicians #traderjoe #delouse #fuckliberals #wethepewple #iwillnotcomply #defiant #fafo #hardinasoftworld #hellandback #irrepressible #shepherdoffire ⌖ #america #freedom #2A #shallnotbeinfringed #noquarter #nosurrender #molonlabe #letfreedomring #keepthepowderdry #patriot #veteran #warrior | Toxic |
text
StringDtype- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
Most frequent values
Post was funny, but this took it to another level.
Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.
Are you feeling it now, mr mark?
Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!
so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens
I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.
every Republican is traitor scum, and our country will never be safe again until we EXTERMINATE all these terrorists.
nobody threw President Trump anywhere, the dems cheated. We've only been screaming this since Nov 2019 and we won't stop until he is in his elected seat by We The People!!! FJB LET'S GO BRANDON!!
I’m from Scandinavia, so we have different pancakes than in the US. Ours is much thinner and larger. But I loved the BA’s!
List:holy shit, its actually ironically funny now that this shit is fun to gawk at these dumbasses
['Post was funny, but this took it to another level.', 'Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.', 'Are you feeling it now, mr mark?', 'Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!', 'so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens', "I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.", 'every Republican is traitor scum, and our country will never be safe again until we EXTERMINATE all these terrorists.', "nobody threw President Trump anywhere, the dems cheated. We've only been screaming this since Nov 2019 and we won't stop until he is in his elected seat by We The People!!!\n FJB\n LET'S GO BRANDON!!", 'I’m from Scandinavia, so we have different pancakes than in the US. Ours is much thinner and larger. But I loved the BA’s!', 'holy shit, its actually ironically funny now that this shit is fun to gawk at these dumbasses']
is_toxic
StringDtype- Null values
- 0 (0.0%)
- Unique values
- 2 (0.2%)
Most frequent values
Toxic
Not Toxic
['Toxic', 'Not Toxic']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
|
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | text | StringDtype | False | 0 (0.0%) | 999 (99.9%) | |||||
| 1 | is_toxic | StringDtype | False | 0 (0.0%) | 2 (0.2%) |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
text
StringDtype- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
Most frequent values
Post was funny, but this took it to another level.
Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.
Are you feeling it now, mr mark?
Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!
so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens
I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.
every Republican is traitor scum, and our country will never be safe again until we EXTERMINATE all these terrorists.
nobody threw President Trump anywhere, the dems cheated. We've only been screaming this since Nov 2019 and we won't stop until he is in his elected seat by We The People!!! FJB LET'S GO BRANDON!!
I’m from Scandinavia, so we have different pancakes than in the US. Ours is much thinner and larger. But I loved the BA’s!
List:holy shit, its actually ironically funny now that this shit is fun to gawk at these dumbasses
['Post was funny, but this took it to another level.', 'Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though.', 'Are you feeling it now, mr mark?', 'Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility!', 'so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens', "I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort.", 'every Republican is traitor scum, and our country will never be safe again until we EXTERMINATE all these terrorists.', "nobody threw President Trump anywhere, the dems cheated. We've only been screaming this since Nov 2019 and we won't stop until he is in his elected seat by We The People!!!\n FJB\n LET'S GO BRANDON!!", 'I’m from Scandinavia, so we have different pancakes than in the US. Ours is much thinner and larger. But I loved the BA’s!', 'holy shit, its actually ironically funny now that this shit is fun to gawk at these dumbasses']
is_toxic
StringDtype- Null values
- 0 (0.0%)
- Unique values
- 2 (0.2%)
Most frequent values
Toxic
Not Toxic
['Toxic', 'Not Toxic']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
| Column 1 | Column 2 | Cramér's V | Pearson's Correlation |
|---|---|---|---|
| text | is_toxic | 0.100 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We prepare the target variable by mapping the binary labels “Toxic” and “Not Toxic” to 1 and 0, respectively. The target is reused throughout the example.
GapEncoder#
First, let’s vectorize our text column using the GapEncoder, one of the
high cardinality categorical encoders
provided by skrub.
As introduced in the previous example, the GapEncoder
performs matrix factorization for topic modeling. It builds latent topics by
capturing combinations of substrings that frequently co-occur, and encoded vectors
correspond to topic activations.
To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.
from skrub import GapEncoder
gap = GapEncoder(n_components=30)
X_trans = gap.fit_transform(X["text"])
# Add the original text as a first column
X_trans.insert(0, "text", X["text"])
TableReport(X_trans)
| text | text: beautifuls, beautiful, painfully | text: disrespectful, disrespects, respectful | text: suicide, joking, laughing | text: nvrseqrvrqrqrr, movement, satsuki | text: bitchymitchy, vague, enough | text: kiwikid, actually, mistype | text: dumbfucks, fuckzuckerberg, fucktard | text: morgana, melancholy, stories | text: recommended, recommend, noodles | text: constituents, definitely, considerable | text: ability, choosing, research | text: governments, destroying, commandments | text: pseudoscience, yourselves, ourselves | text: interesting, supporters, interested | text: minorities, negligent, baldwins | text: administration, murderers, murderer | text: emblazoned, confederate, changing | text: fentanyl, georgefloyd, floyed | text: bahiyyih, independent, guidance | text: afghanistan, withdrawal, campaigned | text: unrelentless, relentlessly, intentionally | text: republican, mythology, republicans | text: lackluster, previously, shyvana | text: ƞỉဌဌᕦѓ, psycho, psychopathic | text: conservatives, indoctrination, consequences | text: americans, qualified, marxists | text: pizzaaaa, cuntmila, libcunts | text: michelle, abrahamic, transfer | text: difficult, netcode, rollback | text: nxhcplzrecw, friends, unfriend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Two-Minute Mysteries by Donald J. Sobol? I read those in middle school. Might be a different think though. | 0.000811 | 0.00262 | 0.00171 | 0.00225 | 0.199 | 0.00257 | 0.00266 | 0.114 | 0.0763 | 0.0403 | 0.0429 | 0.00693 | 0.00422 | 0.0112 | 0.00462 | 0.00243 | 0.00202 | 0.00232 | 0.00308 | 0.00229 | 0.00147 | 0.00645 | 0.0153 | 0.00124 | 0.00144 | 0.112 | 0.00119 | 0.00320 | 0.114 | 0.00866 |
| 1 | Are you feeling it now, mr mark? | 0.000354 | 0.000476 | 0.00290 | 0.000490 | 0.000505 | 0.000502 | 0.000413 | 0.0357 | 0.000691 | 0.000526 | 0.00326 | 0.00138 | 0.0506 | 0.000730 | 0.000710 | 0.000407 | 0.00121 | 0.000557 | 0.000478 | 0.00129 | 0.000553 | 0.000980 | 0.00135 | 0.108 | 0.0152 | 0.00128 | 0.000340 | 0.000411 | 0.00164 | 0.000459 |
| 2 | Can we get some pro environment conspiracy theories going? What about the gay frogs?! Surely this substance will threaten male virility! | 0.000950 | 0.000964 | 0.0659 | 0.0107 | 0.0714 | 0.00114 | 0.00108 | 0.0697 | 0.000869 | 0.0651 | 0.0129 | 0.0361 | 0.00182 | 0.000957 | 0.0258 | 0.0124 | 0.102 | 0.00111 | 0.0793 | 0.00225 | 0.0841 | 0.0629 | 0.00111 | 0.00108 | 0.0414 | 0.00871 | 0.000587 | 0.000985 | 0.0401 | 0.212 |
| 3 | so many haters on the internet . dident get what so bad about this video.... i think u just jealous logan paul.... fuck humens | 0.000553 | 0.00446 | 0.170 | 0.00109 | 0.00201 | 0.000937 | 0.00918 | 0.0266 | 0.00174 | 0.0882 | 0.000916 | 0.00198 | 0.0252 | 0.00260 | 0.00117 | 0.00500 | 0.0115 | 0.000480 | 0.00322 | 0.0282 | 0.121 | 0.0536 | 0.00224 | 0.000442 | 0.00118 | 0.0225 | 0.344 | 0.000481 | 0.00862 | 0.000666 |
| 4 | I do this with store-bought breaded shrimp all the time. The way you are better than be is that the seasoning is IN the batter, not dusted on top. Play! Experiment! You'll do something wonderful if you are already putting in that effort. | 0.00211 | 0.126 | 0.146 | 0.00146 | 0.00451 | 0.00104 | 0.00282 | 0.0890 | 0.0976 | 0.296 | 0.00605 | 0.0266 | 0.343 | 0.0597 | 0.00198 | 0.0628 | 0.0460 | 0.000665 | 0.00700 | 0.00455 | 0.0298 | 0.00612 | 0.251 | 0.0337 | 0.0719 | 0.0504 | 0.000599 | 0.000698 | 0.00138 | 0.00261 |
| 995 | ILLEGITIMATE INCOMPETENT TYRANT TALIBIDEN | 0.00193 | 0.0140 | 0.000513 | 0.000458 | 0.00110 | 0.000620 | 0.000554 | 0.000665 | 0.00125 | 0.00103 | 0.0425 | 0.00250 | 0.00146 | 0.00196 | 0.000730 | 0.0868 | 0.00135 | 0.00218 | 0.0179 | 0.0330 | 0.0304 | 0.0134 | 0.00106 | 0.000607 | 0.00140 | 0.00109 | 0.000566 | 0.0330 | 0.00151 | 0.00506 |
| 996 | From the theories I’ve read I’m pretty sure you’re right and he’s singed | 0.000410 | 0.00104 | 0.000780 | 0.000695 | 0.000918 | 0.000526 | 0.000433 | 0.0726 | 0.0201 | 0.00271 | 0.0447 | 0.00153 | 0.0112 | 0.00221 | 0.00157 | 0.00103 | 0.00527 | 0.000484 | 0.193 | 0.0947 | 0.00216 | 0.00179 | 0.0680 | 0.000611 | 0.00179 | 0.00116 | 0.000376 | 0.000450 | 0.000709 | 0.000731 |
| 997 | I can't bear how nice this is. I guess its bearnessities. I'll see my self out | 0.00584 | 0.0224 | 0.000569 | 0.000954 | 0.0206 | 0.102 | 0.000614 | 0.147 | 0.000663 | 0.000762 | 0.0414 | 0.00104 | 0.0168 | 0.00160 | 0.00402 | 0.000627 | 0.000601 | 0.000523 | 0.00181 | 0.00270 | 0.0512 | 0.0183 | 0.130 | 0.000620 | 0.000738 | 0.00204 | 0.000860 | 0.000543 | 0.00208 | 0.000720 |
| 998 | I’m 70 and I agree. | 0.000310 | 0.000341 | 0.000340 | 0.00236 | 0.000350 | 0.000384 | 0.000319 | 0.0777 | 0.000417 | 0.000546 | 0.000468 | 0.000520 | 0.000381 | 0.000386 | 0.000371 | 0.000547 | 0.000356 | 0.000365 | 0.0448 | 0.000367 | 0.000378 | 0.000424 | 0.000376 | 0.000427 | 0.000355 | 0.000347 | 0.000338 | 0.000333 | 0.000337 | 0.000320 |
| 999 | Especially Today's Politicians #traderjoe #delouse #fuckliberals #wethepewple #iwillnotcomply #defiant #fafo #hardinasoftworld #hellandback #irrepressible #shepherdoffire ⌖ #america #freedom #2A #shallnotbeinfringed #noquarter #nosurrender #molonlabe #letfreedomring #keepthepowderdry #patriot #veteran #warrior | 0.00195 | 0.00207 | 0.00191 | 0.000690 | 0.000689 | 0.00236 | 0.0137 | 0.00192 | 0.00216 | 0.00622 | 0.00155 | 0.0990 | 0.00163 | 0.00339 | 0.115 | 0.0248 | 0.0381 | 1.28 | 0.0136 | 0.00140 | 0.0230 | 0.139 | 0.00296 | 0.00128 | 0.00269 | 0.0854 | 0.000635 | 0.456 | 0.00151 | 0.00106 |
text
StringDtype- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
text: beautifuls, beautiful, painfully
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0146 ± 0.0584
- Median ± IQR
- 0.000682 ± 0.000904
- Min | Max
- 0.000261 | 0.819
text: disrespectful, disrespects, respectful
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0291 ± 0.115
- Median ± IQR
- 0.00202 ± 0.0159
- Min | Max
- 0.000261 | 2.86
text: suicide, joking, laughing
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0495 ± 0.209
- Median ± IQR
- 0.00461 ± 0.0487
- Min | Max
- 0.000254 | 6.13
text: nvrseqrvrqrqrr, movement, satsuki
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0277 ± 0.0827
- Median ± IQR
- 0.00119 ± 0.0116
- Min | Max
- 0.000267 | 1.52
text: bitchymitchy, vague, enough
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0256 ± 0.0817
- Median ± IQR
- 0.000897 ± 0.00496
- Min | Max
- 0.000259 | 0.803
text: kiwikid, actually, mistype
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0200 ± 0.0704
- Median ± IQR
- 0.000960 ± 0.00292
- Min | Max
- 0.000262 | 0.721
text: dumbfucks, fuckzuckerberg, fucktard
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0193 ± 0.0845
- Median ± IQR
- 0.000767 ± 0.00127
- Min | Max
- 0.000259 | 1.91
text: morgana, melancholy, stories
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0475 ± 0.246
- Median ± IQR
- 0.00335 ± 0.0386
- Min | Max
- 0.000276 | 7.29
text: recommended, recommend, noodles
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0373 ± 0.156
- Median ± IQR
- 0.00248 ± 0.0260
- Min | Max
- 0.000259 | 4.21
text: constituents, definitely, considerable
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0434 ± 0.211
- Median ± IQR
- 0.00363 ± 0.0304
- Min | Max
- 0.000255 | 6.17
text: ability, choosing, research
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0359 ± 0.137
- Median ± IQR
- 0.00144 ± 0.0159
- Min | Max
- 0.000255 | 3.13
text: governments, destroying, commandments
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0518 ± 0.421
- Median ± IQR
- 0.00355 ± 0.0333
- Min | Max
- 0.000252 | 13.0
text: pseudoscience, yourselves, ourselves
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0595 ± 0.455
- Median ± IQR
- 0.00263 ± 0.0364
- Min | Max
- 0.000252 | 14.0
text: interesting, supporters, interested
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0329 ± 0.125
- Median ± IQR
- 0.00163 ± 0.0194
- Min | Max
- 0.000259 | 3.22
text: minorities, negligent, baldwins
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0313 ± 0.118
- Median ± IQR
- 0.00161 ± 0.0150
- Min | Max
- 0.000262 | 2.88
text: administration, murderers, murderer
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0237 ± 0.0866
- Median ± IQR
- 0.00115 ± 0.00538
- Min | Max
- 0.000256 | 1.06
text: emblazoned, confederate, changing
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0512 ± 0.368
- Median ± IQR
- 0.00414 ± 0.0404
- Min | Max
- 0.000253 | 11.4
text: fentanyl, georgefloyd, floyed
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0190 ± 0.125
- Median ± IQR
- 0.000634 ± 0.000557
- Min | Max
- 0.000257 | 2.07
text: bahiyyih, independent, guidance
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0329 ± 0.104
- Median ± IQR
- 0.00164 ± 0.0234
- Min | Max
- 0.000262 | 2.46
text: afghanistan, withdrawal, campaigned
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0365 ± 0.197
- Median ± IQR
- 0.00235 ± 0.0286
- Min | Max
- 0.000258 | 5.89
text: unrelentless, relentlessly, intentionally
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0527 ± 0.296
- Median ± IQR
- 0.00645 ± 0.0475
- Min | Max
- 0.000254 | 8.98
text: republican, mythology, republicans
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0366 ± 0.191
- Median ± IQR
- 0.00314 ± 0.0261
- Min | Max
- 0.000257 | 5.50
text: lackluster, previously, shyvana
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0513 ± 0.452
- Median ± IQR
- 0.00371 ± 0.0346
- Min | Max
- 0.000253 | 14.1
text: ƞỉဌဌᕦѓ, psycho, psychopathic
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0139 ± 0.0640
- Median ± IQR
- 0.000714 ± 0.000706
- Min | Max
- 0.000257 | 1.07
text: conservatives, indoctrination, consequences
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0421 ± 0.190
- Median ± IQR
- 0.00279 ± 0.0277
- Min | Max
- 0.000254 | 5.46
text: americans, qualified, marxists
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0449 ± 0.250
- Median ± IQR
- 0.00375 ± 0.0357
- Min | Max
- 0.000254 | 7.42
text: pizzaaaa, cuntmila, libcunts
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0176 ± 0.0788
- Median ± IQR
- 0.000614 ± 0.000753
- Min | Max
- 0.000255 | 1.61
text: michelle, abrahamic, transfer
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0164 ± 0.0929
- Median ± IQR
- 0.000727 ± 0.000755
- Min | Max
- 0.000264 | 1.61
text: difficult, netcode, rollback
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0259 ± 0.0816
- Median ± IQR
- 0.00143 ± 0.0132
- Min | Max
- 0.000261 | 1.46
text: nxhcplzrecw, friends, unfriend
Float64DType- Null values
- 0 (0.0%)
- Unique values
-
999 (99.9%)
This column has a high cardinality (> 40).
- Mean ± Std
- 0.0168 ± 0.0592
- Median ± IQR
- 0.000926 ± 0.00170
- Min | Max
- 0.000259 | 0.696
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
|
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | text | StringDtype | False | 0 (0.0%) | 999 (99.9%) | |||||
| 1 | text: beautifuls, beautiful, painfully | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0146 | 0.0584 | 0.000261 | 0.000682 | 0.819 |
| 2 | text: disrespectful, disrespects, respectful | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0291 | 0.115 | 0.000261 | 0.00202 | 2.86 |
| 3 | text: suicide, joking, laughing | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0495 | 0.209 | 0.000254 | 0.00461 | 6.13 |
| 4 | text: nvrseqrvrqrqrr, movement, satsuki | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0277 | 0.0827 | 0.000267 | 0.00119 | 1.52 |
| 5 | text: bitchymitchy, vague, enough | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0256 | 0.0817 | 0.000259 | 0.000897 | 0.803 |
| 6 | text: kiwikid, actually, mistype | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0200 | 0.0704 | 0.000262 | 0.000960 | 0.721 |
| 7 | text: dumbfucks, fuckzuckerberg, fucktard | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0193 | 0.0845 | 0.000259 | 0.000767 | 1.91 |
| 8 | text: morgana, melancholy, stories | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0475 | 0.246 | 0.000276 | 0.00335 | 7.29 |
| 9 | text: recommended, recommend, noodles | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0373 | 0.156 | 0.000259 | 0.00248 | 4.21 |
| 10 | text: constituents, definitely, considerable | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0434 | 0.211 | 0.000255 | 0.00363 | 6.17 |
| 11 | text: ability, choosing, research | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0359 | 0.137 | 0.000255 | 0.00144 | 3.13 |
| 12 | text: governments, destroying, commandments | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0518 | 0.421 | 0.000252 | 0.00355 | 13.0 |
| 13 | text: pseudoscience, yourselves, ourselves | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0595 | 0.455 | 0.000252 | 0.00263 | 14.0 |
| 14 | text: interesting, supporters, interested | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0329 | 0.125 | 0.000259 | 0.00163 | 3.22 |
| 15 | text: minorities, negligent, baldwins | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0313 | 0.118 | 0.000262 | 0.00161 | 2.88 |
| 16 | text: administration, murderers, murderer | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0237 | 0.0866 | 0.000256 | 0.00115 | 1.06 |
| 17 | text: emblazoned, confederate, changing | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0512 | 0.368 | 0.000253 | 0.00414 | 11.4 |
| 18 | text: fentanyl, georgefloyd, floyed | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0190 | 0.125 | 0.000257 | 0.000634 | 2.07 |
| 19 | text: bahiyyih, independent, guidance | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0329 | 0.104 | 0.000262 | 0.00164 | 2.46 |
| 20 | text: afghanistan, withdrawal, campaigned | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0365 | 0.197 | 0.000258 | 0.00235 | 5.89 |
| 21 | text: unrelentless, relentlessly, intentionally | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0527 | 0.296 | 0.000254 | 0.00645 | 8.98 |
| 22 | text: republican, mythology, republicans | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0366 | 0.191 | 0.000257 | 0.00314 | 5.50 |
| 23 | text: lackluster, previously, shyvana | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0513 | 0.452 | 0.000253 | 0.00371 | 14.1 |
| 24 | text: ƞỉဌဌᕦѓ, psycho, psychopathic | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0139 | 0.0640 | 0.000257 | 0.000714 | 1.07 |
| 25 | text: conservatives, indoctrination, consequences | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0421 | 0.190 | 0.000254 | 0.00279 | 5.46 |
| 26 | text: americans, qualified, marxists | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0449 | 0.250 | 0.000254 | 0.00375 | 7.42 |
| 27 | text: pizzaaaa, cuntmila, libcunts | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0176 | 0.0788 | 0.000255 | 0.000614 | 1.61 |
| 28 | text: michelle, abrahamic, transfer | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0164 | 0.0929 | 0.000264 | 0.000727 | 1.61 |
| 29 | text: difficult, netcode, rollback | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0259 | 0.0816 | 0.000261 | 0.00143 | 1.46 |
| 30 | text: nxhcplzrecw, friends, unfriend | Float64DType | False | 0 (0.0%) | 999 (99.9%) | 0.0168 | 0.0592 | 0.000259 | 0.000926 | 0.696 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Plotting was skipped. This is due to either:
- The dataframe exceeding the configured
table_report_plots_thresholdlimit (default: 30). - The
plot_distributionsoption being set toFalse(default:"auto", which applies the configuredtable_report_plots_threshold).
You can adjust this behavior in several ways:
- To force plotting for a single report:
report = TableReport(df, plot_distributions=True) - To change the threshold for the current Python session, use
skrub.set_config:from skrub import set_config set_config(table_report_plots_threshold=50) - To make the change permanent, use an environment variable:
export SKB_TABLE_REPORT_PLOTS_THRESHOLD=50
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Computing pairwise associations was skipped. This is due to either:
- The dataframe exceeding the configured
table_report_associations_thresholdlimit (default: 30). - The
compute_associationsoption being set toFalse(default:"auto", which applies the configuredtable_report_associations_threshold).
You can adjust this behavior in several ways:
- To force computation for a single report:
report = TableReport(df, compute_associations=True) - To change the threshold for the current Python session, use
skrub.set_config:from skrub import set_config set_config(table_report_associations_threshold=50) - To make the change permanent, use an environment variable:
export SKB_TABLE_REPORT_ASSOCIATIONS_THRESHOLD=50
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We can use a heatmap to highlight the highest activations, making them more visible for comparison against the original text and vectors above.
import numpy as np
from matplotlib import pyplot as plt
def plot_gap_feature_importance(X_trans):
x_samples = X_trans.pop("text")
# We slightly format the topics and labels for them to fit on the plot.
topic_labels = [x.replace("text: ", "") for x in X_trans.columns]
labels = x_samples.str[:50].values + "..."
# We clip large outliers to make activations more visible.
X_trans = np.clip(X_trans, a_min=None, a_max=200)
plt.figure(figsize=(10, 10), dpi=200)
plt.imshow(X_trans.T)
plt.yticks(
range(len(topic_labels)),
labels=topic_labels,
ha="right",
size=12,
)
plt.xticks(range(len(labels)), labels=labels, size=12, rotation=50, ha="right")
plt.colorbar().set_label(label="Topic activations", size=13)
plt.ylabel("Latent topics", size=14)
plt.xlabel("Data entries", size=14)
plt.tight_layout()
plt.show()
plot_gap_feature_importance(X_trans.head())

/home/circleci/project/examples/0020_text_with_string_encoders.py:126: UserWarning: Glyph 4108 (\N{MYANMAR LETTER TTHA}) missing from font(s) DejaVu Sans.
plt.tight_layout()
Now that we have an understanding of the vectors produced by the GapEncoder,
let’s evaluate its performance in toxicity classification. The GapEncoder excels
at handling categorical columns with high cardinality, but here the column consists
of free-form text. Sentences are generally longer, with more unique ngrams than
high cardinality categories.
To benchmark the performance of the GapEncoder against the toxicity dataset,
we integrate it into a TableVectorizer, as introduced in the
previous example,
and create a Pipeline by appending a HistGradientBoostingClassifier, which
consumes the vectors produced by the GapEncoder.
We set n_components to 30; however, to achieve the best performance, we would
need to find the optimal value for this hyperparameter using either GridSearchCV
or RandomizedSearchCV. We skip this part to keep the computation time for this
small example.
Recall that the ROC AUC is a metric that quantifies the ranking power of estimators, where a random estimator scores 0.5, and an oracle —providing perfect predictions— scores 1.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer
def plot_box_results(named_results):
fig, ax = plt.subplots()
names, scores = zip(
*[(name, result["test_score"]) for name, result in named_results]
)
ax.boxplot(scores)
ax.set_xticks(range(1, len(names) + 1), labels=list(names), size=12)
ax.set_ylabel("ROC AUC", size=14)
plt.title(
"AUC distribution across folds (higher is better)",
size=14,
)
plt.show()
results = []
Now we can evaluate the performance of the GapEncoder in toxicity classification.
gap_pipe = make_pipeline(
TableVectorizer(high_cardinality=GapEncoder(n_components=30)),
HistGradientBoostingClassifier(),
)
gap_results = cross_validate(gap_pipe, X, y, scoring="roc_auc")
results.append(("GapEncoder", gap_results))
plot_box_results(results)
MinHashEncoder#
We now compare these results with the MinHashEncoder, which is faster
and produces vectors better suited for tree-based estimators like
HistGradientBoostingClassifier. To do this, we can simply replace
the GapEncoder with the MinHashEncoder in the previous pipeline
using set_params().
from skrub import MinHashEncoder
minhash_pipe = make_pipeline(
TableVectorizer(high_cardinality=MinHashEncoder(n_components=30)),
HistGradientBoostingClassifier(),
)
minhash_results = cross_validate(minhash_pipe, X, y, scoring="roc_auc")
results.append(("MinHashEncoder", minhash_results))
plot_box_results(results)

Remarkably, the vectors produced by the MinHashEncoder offer less predictive
power than those from the GapEncoder on this dataset.
TextEncoder#
Let’s now shift our focus to pre-trained deep learning encoders. Our previous encoders are syntactic models that we trained directly on the toxicity dataset. To generate more powerful vector representations for free-form text and diverse entries, we can instead use semantic models, such as BERT, which have been trained on very large datasets.
TextEncoder enables you to integrate any Sentence Transformer model from the
Hugging Face Hub (or from your local disk) into your Pipeline to transform a text
column in a dataframe. By default, TextEncoder uses the e5-small-v2 model.
from skrub import TextEncoder
text_encoder = TextEncoder(
"sentence-transformers/paraphrase-albert-small-v2",
device="cpu",
)
text_encoder_pipe = make_pipeline(
TableVectorizer(high_cardinality=text_encoder),
HistGradientBoostingClassifier(),
)
text_encoder_results = cross_validate(text_encoder_pipe, X, y, scoring="roc_auc")
results.append(("TextEncoder", text_encoder_results))
plot_box_results(results)

StringEncoder#
TextEncoder embeddings are very strong, but they are also quite expensive to
use. A simpler, faster alternative for encoding strings is the StringEncoder,
which works by first performing a tf-idf (computing vectors of rescaled word
counts of the text wiki), and then
following it with TruncatedSVD to reduce the number of dimensions to, in this
case, 30.
from skrub import StringEncoder
string_encoder = StringEncoder(ngram_range=(3, 4), analyzer="char_wb", random_state=0)
string_encoder_pipe = make_pipeline(
TableVectorizer(high_cardinality=string_encoder),
HistGradientBoostingClassifier(),
)
string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
results.append(("StringEncoder", string_encoder_results))
plot_box_results(results)

The performance of the TextEncoder is significantly stronger than that of
the syntactic encoders, which is expected. But how long does it take to load
and vectorize text on a CPU using a Sentence Transformer model? Below, we display
the tradeoff between predictive accuracy and training time. Note that since we are
not training the Sentence Transformer model, the “fitting time” refers to the
time taken for vectorization.
def plot_performance_tradeoff(results):
fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
markers = ["s", "o", "^", "x"]
for idx, (name, result) in enumerate(results):
ax.scatter(
result["fit_time"],
result["test_score"],
label=name,
marker=markers[idx],
)
mean_fit_time = np.mean(result["fit_time"])
mean_score = np.mean(result["test_score"])
ax.scatter(
mean_fit_time,
mean_score,
color="k",
marker=markers[idx],
)
std_fit_time = np.std(result["fit_time"])
std_score = np.std(result["test_score"])
ax.errorbar(
x=mean_fit_time,
y=mean_score,
yerr=std_score,
fmt="none",
c="k",
capsize=2,
)
ax.errorbar(
x=mean_fit_time,
y=mean_score,
xerr=std_fit_time,
fmt="none",
c="k",
capsize=2,
)
ax.set_xscale("log")
ax.set_xlabel("Time to fit (seconds)")
ax.set_ylabel("ROC AUC")
ax.set_title("Prediction performance / training time trade-off")
ax.annotate(
"Best time / \nperformance trade-off",
xy=(0.05, 0.95),
xycoords="axes fraction",
xytext=(0.2, 0.8),
textcoords="axes fraction",
arrowprops=dict(arrowstyle="->", lw=1.5, mutation_scale=15),
)
ax.legend(bbox_to_anchor=(1.02, 0.3))
plt.show()
plot_performance_tradeoff(results)

The black points represent the average time to fit and AUC for each vectorizer, and the width of the bars represents one standard deviation.
The green outlier dot on the right side of the plot corresponds to the first time the Sentence Transformers model was downloaded and loaded into memory. During the subsequent cross-validation iterations, the model is simply copied, which reduces computation time for the remaining folds.
Interestingly, StringEncoder has a performance remarkably similar to that of
GapEncoder, while being significantly faster.
Conclusion#
In conclusion, TextEncoder provides powerful vectorization for text, but at
the cost of longer computation times and the need for additional dependencies,
such as torch. StringEncoder represents a simpler alternative that can provide
good performance at a fraction of the cost of more complex methods.
Total running time of the script: (3 minutes 39.477 seconds)
Estimated memory usage: 1270 MB