Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Various string encoders: a sentiment analysis example#
In this example, we explore the performance of string and categorical encoders available in skrub.
The Toxicity dataset#
We focus on the toxicity dataset, a corpus of 1,000 tweets, evenly balanced between the binary labels “Toxic” and “Not Toxic”. Our goal is to classify each entry between these two labels, using only the text of the tweets as features.
When it comes to displaying large chunks of text, the TableReport
is especially
useful! Click on any cell below to expand and read the tweet in full.
from skrub import TableReport
TableReport(X)
text | is_toxic | |
---|---|---|
0 | Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page. | Toxic |
1 | The senile credit card shrill from Delaware needs to resign!! | Toxic |
2 | He does that a lot -- makes everyone look good but him...I guess it's also probably the Dems and the Media that force him to compulsively tweet abject bullshit like a lying bitch. They're tricky, them libs. | Toxic |
3 | F*ck Lizzo | Toxic |
4 | Epstein and trump were best buds!!! Pedophiles who play together!! | Toxic |
995 | My maternal abuelita taught me how to make plantain empanadas 🥺 and my paternal abuelita needed me to help her brush her dentures 😌 I love them so much 🥰 | Not Toxic |
996 | Funnily enough I was looking online last week and wondering why nobody has opened an eSports/Gaming bar round here. Can’t wait to pop in at some point :) | Not Toxic |
997 | I can't bear how nice this is. I guess its bearnessities. I'll see my self out | Not Toxic |
998 | Going to buy a share of Tesla just to ensure it starts going back down | Not Toxic |
999 | I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah. | Not Toxic |
text
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
is_toxic
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.2%)
Most frequent values
Toxic
Not Toxic
['Toxic', 'Not Toxic']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | text | ObjectDType | 0 (0.0%) | 999 (99.9%) | |||||
1 | is_toxic | ObjectDType | 0 (0.0%) | 2 (0.2%) |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
text
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
is_toxic
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.2%)
Most frequent values
Toxic
Not Toxic
['Toxic', 'Not Toxic']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
text | is_toxic | 0.100 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
GapEncoder#
First, let’s vectorize our text column using the GapEncoder
, one of the
high cardinality categorical encoders
provided by skrub.
As introduced in the previous example, the GapEncoder
performs matrix factorization for topic modeling. It builds latent topics by
capturing combinations of substrings that frequently co-occur, and encoded vectors
correspond to topic activations.
To interpret these latent topics, we select for each of them a few labels from the input data with the highest activations. In the example below we select 3 labels to summarize each topic.
from skrub import GapEncoder
gap = GapEncoder(n_components=30)
X_trans = gap.fit_transform(X["text"])
# Add the original text as a first column
X_trans.insert(0, "text", X["text"])
TableReport(X_trans)
text | text: better, tears, enjoy | text: governments, government, destroying | text: legitimate, legitimacy, keywords | text: qkcuk6, awwwww, nxhcplzrecw | text: pseudoscience, yourselves, ourselves | text: pedophiles, approached, pedophile | text: previously, lackluster, survivability | text: congratulations, financial, healthcare | text: emblazoned, gargantuan, repainting | text: documentary, lawsuits, ƞỉဌဌᕦѓ | text: qualified, marxists, blackness | text: suicide, joking, quirky | text: thought, though, xayah | text: conservatives, consequences, indoctrination | text: unrelentless, relentlessly, intentionally | text: illegally, economy, hometown | text: congress, blackmailed, soooooo | text: investigation, libcunts, campaign | text: tarrasques, icingdeath, drizzt | text: beautiful, beautifuls, melancholy | text: importantly, exterminates, activities | text: expressing, unhappy, pressured | text: terrorism, terrorist, sharpton | text: ridiculous, absolutely, absolute | text: spiraled, maranzano, corleone | text: goinggoinggoing, productive, recharging | text: administration, involved, traitorous | text: definitely, constituents, considerable | text: liberalhypocrisy, politicalmemes, fucksocialism | text: departments, employees, deteriorated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page. | 0.20487707402066938 | 66.51400760580285 | 194.41214715853206 | 0.13257213895914916 | 0.32923233500680393 | 3.4857291241186905 | 1.0182520147185214 | 4.4132544718113165 | 7.132507358399246 | 0.29563006575061657 | 2.315487801177916 | 15.358056412371871 | 0.14469801864743698 | 30.898860507310857 | 36.149547070027616 | 27.308213202273073 | 17.126549270036453 | 0.6772141764625574 | 0.5548947752548391 | 3.24733262642551 | 0.706168789311077 | 0.5346757759295748 | 27.79659259703832 | 11.513450146253865 | 41.54586682260514 | 3.4292470015262975 | 6.1791237122631095 | 63.758489300797976 | 0.17264825355332594 | 89.64467267951369 |
1 | The senile credit card shrill from Delaware needs to resign!! | 0.6507922834141054 | 0.23398265793286926 | 4.2581418829970685 | 0.49662020944799773 | 0.28187268789360875 | 4.894167379185239 | 0.6471777128791791 | 1.4297931460136974 | 1.10772932146673 | 3.5347983662979052 | 0.49428951656202075 | 1.1218633396954618 | 0.3880686312157403 | 1.7085110641654901 | 0.3532647457832634 | 0.1556783122464315 | 7.547571372588799 | 10.06160523801411 | 0.678157354293835 | 0.34020802731284233 | 42.20408368163191 | 1.1641415514001676 | 0.14708758883464113 | 3.5577071416196864 | 0.537769215608751 | 0.1976855630050288 | 0.150497475726663 | 0.20879479971318426 | 0.13903164292271 | 1.3089068465847105 |
2 | He does that a lot -- makes everyone look good but him...I guess it's also probably the Dems and the Media that force him to compulsively tweet abject bullshit like a lying bitch. They're tricky, them libs. | 0.2807588504450278 | 5.148199850942162 | 0.5669228716725552 | 0.16041298264649392 | 0.2929732063163021 | 0.20682425560628262 | 20.614710981406382 | 0.1290497863716711 | 7.841979967193057 | 0.12988907583985546 | 0.324768629424562 | 106.50711135136874 | 0.16641826978387037 | 0.6675076190267866 | 0.516185768398228 | 6.036325715039121 | 1.030719386367655 | 9.742059887627313 | 5.144718362454984 | 5.639591078337295 | 36.36122984418417 | 30.74612048329125 | 1.23811762998521 | 8.717855557834373 | 0.22114447873343787 | 2.0976796775201656 | 24.793832643147827 | 23.896409627766456 | 0.30486688564296244 | 7.9756134741201254 |
3 | F*ck Lizzo | 0.10538131526451576 | 0.08476417108892823 | 0.0612937382310121 | 7.918582130585352 | 0.08993835762452396 | 0.07292373250795166 | 0.1071718664533843 | 0.06635655561934549 | 0.13742228928688827 | 0.13800063023342557 | 0.09546244590484787 | 0.9967391356137169 | 0.08241479028946849 | 0.06614749185975935 | 0.09138866556405345 | 0.07588858695895181 | 0.0966892460818606 | 0.12475958429632418 | 1.4301390711318707 | 0.0679831182631348 | 0.08082008977198359 | 0.05711248056562944 | 0.05761466511487959 | 0.16704709309808305 | 0.06854682120637402 | 0.2078669244517348 | 0.09094055867963194 | 0.061022814521784895 | 0.7388537226442164 | 0.06072696647574033 |
4 | Epstein and trump were best buds!!! Pedophiles who play together!! | 0.5780167614443014 | 0.5922409442891932 | 0.12133182556176199 | 2.076814693888415 | 8.483333790095077 | 54.925645419293126 | 0.17260408604936442 | 0.10163675125573536 | 0.7421456243071587 | 0.13697057007844385 | 1.4092375201662606 | 0.10246436413403107 | 3.736442116798167 | 0.1400375203206744 | 0.14641000435726037 | 0.19960207483504766 | 3.359929519710686 | 0.08700781356483867 | 6.512284934088411 | 1.3119018659287554 | 0.2942682500765905 | 0.1065486193253424 | 0.1440707875709255 | 0.1564646344512262 | 0.103717264728697 | 0.1700764543288445 | 0.17119609838841526 | 0.2325575276493633 | 8.556407145377417 | 2.62863393804147 |
995 | My maternal abuelita taught me how to make plantain empanadas 🥺 and my paternal abuelita needed me to help her brush her dentures 😌 I love them so much 🥰 | 52.904999251210775 | 6.011620160639497 | 3.108221737928149 | 0.18898843820532826 | 0.25436435151213665 | 28.559123037333414 | 9.477997228207919 | 0.36621718229845485 | 4.481038932952624 | 0.1533008515540044 | 0.4042974733046432 | 0.1874350434759499 | 17.0225807275087 | 0.618396710208841 | 20.459120235116607 | 0.20178685243716557 | 0.19681830952584095 | 0.2231986956722892 | 15.809827553643068 | 22.577357229754803 | 18.430206906227742 | 16.031498386725673 | 0.15175816383324717 | 0.3271348026433227 | 0.2629961727017036 | 0.23281076686727292 | 0.32622120714848957 | 0.6051663386889193 | 0.23936959089846874 | 8.18614573743057 |
996 | Funnily enough I was looking online last week and wondering why nobody has opened an eSports/Gaming bar round here. Can’t wait to pop in at some point :) | 0.13599087380930694 | 4.464673505486932 | 0.10650519905907391 | 0.13698888555808691 | 0.24451386018962212 | 3.3072688348526933 | 2.843785843640717 | 0.23866521174531563 | 0.5042785046893428 | 0.6735714359407274 | 1.5642672069978953 | 20.92690506516233 | 51.27222821584647 | 0.6817664766824609 | 26.355580297832834 | 0.36282456005644725 | 13.47669018211348 | 0.287721336002448 | 0.8762668376442926 | 0.5643766011444453 | 6.486666758902317 | 0.30026465828125204 | 2.6764790084859573 | 8.086482067073998 | 2.378991935227819 | 49.46729855735972 | 0.29171696740165015 | 6.816602855191974 | 0.13271144157957115 | 22.337915320766832 |
997 | I can't bear how nice this is. I guess its bearnessities. I'll see my self out | 2.1063479987208993 | 0.18588560367049248 | 0.19663236225523914 | 0.11544507102360967 | 8.237412935105183 | 0.19660751441376317 | 23.14102827371346 | 0.15129725495684093 | 0.12941090564288119 | 14.234847961260133 | 0.520933640084734 | 0.11131332109082681 | 0.10812579837644551 | 0.1395006195049655 | 7.0452212594010994 | 10.35388831855344 | 3.4325276683782464 | 0.17436781388327816 | 2.643604788537644 | 25.35697856103536 | 0.12374507496789308 | 0.38673019180213974 | 0.131422237599987 | 15.267152864309336 | 0.12711266894241607 | 0.1757152099074506 | 0.1461453441811395 | 0.148377150025054 | 0.09154917361577095 | 0.3206711969899611 |
998 | Going to buy a share of Tesla just to ensure it starts going back down | 0.16751773954149496 | 0.26290162920201543 | 0.13606818711814272 | 0.11129560821531442 | 0.11340504276111685 | 0.23305270322462623 | 13.438388033300292 | 0.13812615426290434 | 18.207160450995456 | 0.10445248569877212 | 0.6774496543626161 | 17.636488368142892 | 0.24837544700820108 | 1.0758278262578347 | 0.2073866043484381 | 8.688135736827652 | 0.20616947763204307 | 0.10306413073536336 | 1.9472233494455673 | 2.172877791230051 | 0.15519990582207638 | 3.0074271426676464 | 9.739618564507989 | 0.4063439502776392 | 10.290643833756278 | 8.60782544064041 | 0.10780409108932447 | 3.785421051738413 | 0.08743093181550451 | 1.4369175872359223 |
999 | I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah. | 0.6969362328803699 | 4.093066978091479 | 5.458626983640971 | 0.1343015374604244 | 6.472983624631346 | 19.002199193351224 | 27.88045074813128 | 12.891887362488868 | 35.148291323633856 | 32.35533438388597 | 28.244570155214394 | 2.8450447580440086 | 146.07541375914536 | 5.313340606145776 | 0.8920131798244387 | 0.27581420649492244 | 52.66999466441927 | 0.29078589937184623 | 47.77571271327902 | 38.430208215824784 | 0.20530294427879656 | 4.123030038352285 | 0.23192995248443457 | 11.815913699805796 | 5.888842565084842 | 34.36142092317045 | 4.616771365978665 | 18.024479898447535 | 0.11300554169262177 | 1.1723250777125178 |
text
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
text: better, tears, enjoy
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 3.41 ± 11.4
- Median ± IQR
- 0.185 ± 0.534
- Min | Max
- 0.0547 | 131.
text: governments, government, destroying
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.33 ± 83.5
- Median ± IQR
- 0.563 ± 5.22
- Min | Max
- 0.0501 | 2.59e+03
text: legitimate, legitimacy, keywords
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.03 ± 26.7
- Median ± IQR
- 0.347 ± 3.50
- Min | Max
- 0.0507 | 717.
text: qkcuk6, awwwww, nxhcplzrecw
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 3.57 ± 17.8
- Median ± IQR
- 0.138 ± 0.195
- Min | Max
- 0.0529 | 418.
text: pseudoscience, yourselves, ourselves
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 11.7 ± 90.8
- Median ± IQR
- 0.360 ± 6.93
- Min | Max
- 0.0502 | 2.79e+03
text: pedophiles, approached, pedophile
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.47 ± 42.7
- Median ± IQR
- 0.447 ± 3.91
- Min | Max
- 0.0503 | 1.30e+03
text: previously, lackluster, survivability
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.56 ± 90.1
- Median ± IQR
- 0.537 ± 5.66
- Min | Max
- 0.0502 | 2.81e+03
text: congratulations, financial, healthcare
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.36 ± 22.3
- Median ± IQR
- 0.258 ± 1.77
- Min | Max
- 0.0513 | 457.
text: emblazoned, gargantuan, repainting
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 8.79 ± 72.9
- Median ± IQR
- 0.655 ± 5.70
- Min | Max
- 0.0502 | 2.27e+03
text: documentary, lawsuits, ƞỉဌဌᕦѓ
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.86 ± 18.9
- Median ± IQR
- 0.278 ± 2.55
- Min | Max
- 0.0523 | 395.
text: qualified, marxists, blackness
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.36 ± 48.5
- Median ± IQR
- 0.577 ± 4.13
- Min | Max
- 0.0504 | 1.48e+03
text: suicide, joking, quirky
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.41 ± 41.4
- Median ± IQR
- 1.07 ± 9.98
- Min | Max
- 0.0503 | 1.22e+03
text: thought, though, xayah
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.99 ± 15.6
- Median ± IQR
- 0.164 ± 0.806
- Min | Max
- 0.0524 | 152.
text: conservatives, consequences, indoctrination
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.70 ± 37.6
- Median ± IQR
- 0.433 ± 4.78
- Min | Max
- 0.0504 | 1.09e+03
text: unrelentless, relentlessly, intentionally
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.33 ± 58.6
- Median ± IQR
- 0.852 ± 7.57
- Min | Max
- 0.0503 | 1.79e+03
text: illegally, economy, hometown
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.12 ± 16.3
- Median ± IQR
- 0.232 ± 1.34
- Min | Max
- 0.0523 | 267.
text: congress, blackmailed, soooooo
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 8.15 ± 69.4
- Median ± IQR
- 0.718 ± 5.06
- Min | Max
- 0.0502 | 2.16e+03
text: investigation, libcunts, campaign
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.18 ± 14.7
- Median ± IQR
- 0.171 ± 0.668
- Min | Max
- 0.0523 | 227.
text: tarrasques, icingdeath, drizzt
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.11 ± 18.7
- Median ± IQR
- 0.272 ± 2.32
- Min | Max
- 0.0510 | 386.
text: beautiful, beautifuls, melancholy
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.21 ± 48.8
- Median ± IQR
- 0.611 ± 7.77
- Min | Max
- 0.0552 | 1.45e+03
text: importantly, exterminates, activities
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.14 ± 27.8
- Median ± IQR
- 0.299 ± 3.25
- Min | Max
- 0.0519 | 694.
text: expressing, unhappy, pressured
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.01 ± 21.7
- Median ± IQR
- 0.309 ± 2.99
- Min | Max
- 0.0510 | 488.
text: terrorism, terrorist, sharpton
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.25 ± 16.2
- Median ± IQR
- 0.245 ± 1.61
- Min | Max
- 0.0519 | 299.
text: ridiculous, absolutely, absolute
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.61 ± 24.6
- Median ± IQR
- 0.328 ± 3.21
- Min | Max
- 0.0508 | 593.
text: spiraled, maranzano, corleone
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.58 ± 22.4
- Median ± IQR
- 0.315 ± 2.95
- Min | Max
- 0.0511 | 501.
text: goinggoinggoing, productive, recharging
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.57 ± 27.5
- Median ± IQR
- 0.419 ± 4.46
- Min | Max
- 0.0523 | 679.
text: administration, involved, traitorous
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.07 ± 22.4
- Median ± IQR
- 0.334 ± 3.00
- Min | Max
- 0.0527 | 535.
text: definitely, constituents, considerable
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.54 ± 41.4
- Median ± IQR
- 0.546 ± 4.66
- Min | Max
- 0.0504 | 1.23e+03
text: liberalhypocrisy, politicalmemes, fucksocialism
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.18 ± 28.2
- Median ± IQR
- 0.115 ± 0.263
- Min | Max
- 0.0508 | 469.
text: departments, employees, deteriorated
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.36 ± 27.4
- Median ± IQR
- 0.315 ± 2.48
- Min | Max
- 0.0506 | 771.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column
|
Column name
|
dtype
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
---|---|---|---|---|---|---|---|---|---|
0 | text | ObjectDType | 0 (0.0%) | 999 (99.9%) | |||||
1 | text: better, tears, enjoy | Float64DType | 0 (0.0%) | 999 (99.9%) | 3.41 | 11.4 | 0.0547 | 0.185 | 131. |
2 | text: governments, government, destroying | Float64DType | 0 (0.0%) | 999 (99.9%) | 9.33 | 83.5 | 0.0501 | 0.563 | 2.59e+03 |
3 | text: legitimate, legitimacy, keywords | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.03 | 26.7 | 0.0507 | 0.347 | 717. |
4 | text: qkcuk6, awwwww, nxhcplzrecw | Float64DType | 0 (0.0%) | 999 (99.9%) | 3.57 | 17.8 | 0.0529 | 0.138 | 418. |
5 | text: pseudoscience, yourselves, ourselves | Float64DType | 0 (0.0%) | 999 (99.9%) | 11.7 | 90.8 | 0.0502 | 0.360 | 2.79e+03 |
6 | text: pedophiles, approached, pedophile | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.47 | 42.7 | 0.0503 | 0.447 | 1.30e+03 |
7 | text: previously, lackluster, survivability | Float64DType | 0 (0.0%) | 999 (99.9%) | 9.56 | 90.1 | 0.0502 | 0.537 | 2.81e+03 |
8 | text: congratulations, financial, healthcare | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.36 | 22.3 | 0.0513 | 0.258 | 457. |
9 | text: emblazoned, gargantuan, repainting | Float64DType | 0 (0.0%) | 999 (99.9%) | 8.79 | 72.9 | 0.0502 | 0.655 | 2.27e+03 |
10 | text: documentary, lawsuits, ƞỉဌဌᕦѓ | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.86 | 18.9 | 0.0523 | 0.278 | 395. |
11 | text: qualified, marxists, blackness | Float64DType | 0 (0.0%) | 999 (99.9%) | 7.36 | 48.5 | 0.0504 | 0.577 | 1.48e+03 |
12 | text: suicide, joking, quirky | Float64DType | 0 (0.0%) | 999 (99.9%) | 9.41 | 41.4 | 0.0503 | 1.07 | 1.22e+03 |
13 | text: thought, though, xayah | Float64DType | 0 (0.0%) | 999 (99.9%) | 4.99 | 15.6 | 0.0524 | 0.164 | 152. |
14 | text: conservatives, consequences, indoctrination | Float64DType | 0 (0.0%) | 999 (99.9%) | 7.70 | 37.6 | 0.0504 | 0.433 | 1.09e+03 |
15 | text: unrelentless, relentlessly, intentionally | Float64DType | 0 (0.0%) | 999 (99.9%) | 9.33 | 58.6 | 0.0503 | 0.852 | 1.79e+03 |
16 | text: illegally, economy, hometown | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.12 | 16.3 | 0.0523 | 0.232 | 267. |
17 | text: congress, blackmailed, soooooo | Float64DType | 0 (0.0%) | 999 (99.9%) | 8.15 | 69.4 | 0.0502 | 0.718 | 2.16e+03 |
18 | text: investigation, libcunts, campaign | Float64DType | 0 (0.0%) | 999 (99.9%) | 4.18 | 14.7 | 0.0523 | 0.171 | 227. |
19 | text: tarrasques, icingdeath, drizzt | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.11 | 18.7 | 0.0510 | 0.272 | 386. |
20 | text: beautiful, beautifuls, melancholy | Float64DType | 0 (0.0%) | 999 (99.9%) | 9.21 | 48.8 | 0.0552 | 0.611 | 1.45e+03 |
21 | text: importantly, exterminates, activities | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.14 | 27.8 | 0.0519 | 0.299 | 694. |
22 | text: expressing, unhappy, pressured | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.01 | 21.7 | 0.0510 | 0.309 | 488. |
23 | text: terrorism, terrorist, sharpton | Float64DType | 0 (0.0%) | 999 (99.9%) | 4.25 | 16.2 | 0.0519 | 0.245 | 299. |
24 | text: ridiculous, absolutely, absolute | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.61 | 24.6 | 0.0508 | 0.328 | 593. |
25 | text: spiraled, maranzano, corleone | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.58 | 22.4 | 0.0511 | 0.315 | 501. |
26 | text: goinggoinggoing, productive, recharging | Float64DType | 0 (0.0%) | 999 (99.9%) | 7.57 | 27.5 | 0.0523 | 0.419 | 679. |
27 | text: administration, involved, traitorous | Float64DType | 0 (0.0%) | 999 (99.9%) | 6.07 | 22.4 | 0.0527 | 0.334 | 535. |
28 | text: definitely, constituents, considerable | Float64DType | 0 (0.0%) | 999 (99.9%) | 7.54 | 41.4 | 0.0504 | 0.546 | 1.23e+03 |
29 | text: liberalhypocrisy, politicalmemes, fucksocialism | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.18 | 28.2 | 0.0508 | 0.115 | 469. |
30 | text: departments, employees, deteriorated | Float64DType | 0 (0.0%) | 999 (99.9%) | 5.36 | 27.4 | 0.0506 | 0.315 | 771. |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
text
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
Most frequent values
Post was funny, but this took it to another level.
I’m 70 and I agree.
Anyone have this gif without the text?
He is definitely a maggot...
I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.
Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world. All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid. I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. He is not a representation of our community. Don’t celebrate him on this page.
The senile credit card shrill from Delaware needs to resign!!
WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!
After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***
This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I'm all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It's the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don't wanna Fly Covid Airlines or I'm a lil Drunk Airlines ...
['Post was funny, but this took it to another level.', 'I’m 70 and I agree.', 'Anyone have this gif without the text?', 'He is definitely a maggot...', "I only saw a couple of these throughout the month and tried to figure out what all of them were. Only ones I missed were Star Guardian Seraphine (thought it was Heartbreaker) and I couldn't figure out the 2nd Soraka was Victorious. So all in all, you did a really good job nailing the characters and the theme presented! I think my faves are KDA Neeko and CCN Xayah.", 'Elon Musk is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'The senile credit card shrill from Delaware needs to resign!!', "WE MANAGED TO FIND AN ASSHOLE WHO'S A BIGGER SCUMBAG THAN CUOMO!", 'After all the leftist said and did to DT, you can take your disgraceful comment and shove it up your ***', 'This is for all you Crybaby Transpo-People , NYC Police and NYC Firefighters and Cry Baby I Think I\'m all that and then some Airline Pilots ( # 1 you drive a BUS --lets get that straight ) 1 Weeks of training and Any Sesna Pilot can fly a Jet --Heck Even John Travolta and Indian Jones can Fly a Jet FOR REAL they can ...with that being said " GET Covid Vaccinated - It\'s the LAW " !!!You ALL NEED to Act like Grown Adults not a premature Sissy La La - GET the VACCINE and Like it..You Voted for Biden now Reap the Whirl-Wind of your actions ..Rough Travel you say ,,I say SORRY I don\'t wanna Fly Covid Airlines or I\'m a lil Drunk Airlines ...']
text: better, tears, enjoy
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 3.41 ± 11.4
- Median ± IQR
- 0.185 ± 0.534
- Min | Max
- 0.0547 | 131.
text: governments, government, destroying
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.33 ± 83.5
- Median ± IQR
- 0.563 ± 5.22
- Min | Max
- 0.0501 | 2.59e+03
text: legitimate, legitimacy, keywords
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.03 ± 26.7
- Median ± IQR
- 0.347 ± 3.50
- Min | Max
- 0.0507 | 717.
text: qkcuk6, awwwww, nxhcplzrecw
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 3.57 ± 17.8
- Median ± IQR
- 0.138 ± 0.195
- Min | Max
- 0.0529 | 418.
text: pseudoscience, yourselves, ourselves
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 11.7 ± 90.8
- Median ± IQR
- 0.360 ± 6.93
- Min | Max
- 0.0502 | 2.79e+03
text: pedophiles, approached, pedophile
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.47 ± 42.7
- Median ± IQR
- 0.447 ± 3.91
- Min | Max
- 0.0503 | 1.30e+03
text: previously, lackluster, survivability
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.56 ± 90.1
- Median ± IQR
- 0.537 ± 5.66
- Min | Max
- 0.0502 | 2.81e+03
text: congratulations, financial, healthcare
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.36 ± 22.3
- Median ± IQR
- 0.258 ± 1.77
- Min | Max
- 0.0513 | 457.
text: emblazoned, gargantuan, repainting
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 8.79 ± 72.9
- Median ± IQR
- 0.655 ± 5.70
- Min | Max
- 0.0502 | 2.27e+03
text: documentary, lawsuits, ƞỉဌဌᕦѓ
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.86 ± 18.9
- Median ± IQR
- 0.278 ± 2.55
- Min | Max
- 0.0523 | 395.
text: qualified, marxists, blackness
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.36 ± 48.5
- Median ± IQR
- 0.577 ± 4.13
- Min | Max
- 0.0504 | 1.48e+03
text: suicide, joking, quirky
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.41 ± 41.4
- Median ± IQR
- 1.07 ± 9.98
- Min | Max
- 0.0503 | 1.22e+03
text: thought, though, xayah
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.99 ± 15.6
- Median ± IQR
- 0.164 ± 0.806
- Min | Max
- 0.0524 | 152.
text: conservatives, consequences, indoctrination
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.70 ± 37.6
- Median ± IQR
- 0.433 ± 4.78
- Min | Max
- 0.0504 | 1.09e+03
text: unrelentless, relentlessly, intentionally
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.33 ± 58.6
- Median ± IQR
- 0.852 ± 7.57
- Min | Max
- 0.0503 | 1.79e+03
text: illegally, economy, hometown
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.12 ± 16.3
- Median ± IQR
- 0.232 ± 1.34
- Min | Max
- 0.0523 | 267.
text: congress, blackmailed, soooooo
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 8.15 ± 69.4
- Median ± IQR
- 0.718 ± 5.06
- Min | Max
- 0.0502 | 2.16e+03
text: investigation, libcunts, campaign
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.18 ± 14.7
- Median ± IQR
- 0.171 ± 0.668
- Min | Max
- 0.0523 | 227.
text: tarrasques, icingdeath, drizzt
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.11 ± 18.7
- Median ± IQR
- 0.272 ± 2.32
- Min | Max
- 0.0510 | 386.
text: beautiful, beautifuls, melancholy
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 9.21 ± 48.8
- Median ± IQR
- 0.611 ± 7.77
- Min | Max
- 0.0552 | 1.45e+03
text: importantly, exterminates, activities
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.14 ± 27.8
- Median ± IQR
- 0.299 ± 3.25
- Min | Max
- 0.0519 | 694.
text: expressing, unhappy, pressured
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.01 ± 21.7
- Median ± IQR
- 0.309 ± 2.99
- Min | Max
- 0.0510 | 488.
text: terrorism, terrorist, sharpton
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 4.25 ± 16.2
- Median ± IQR
- 0.245 ± 1.61
- Min | Max
- 0.0519 | 299.
text: ridiculous, absolutely, absolute
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.61 ± 24.6
- Median ± IQR
- 0.328 ± 3.21
- Min | Max
- 0.0508 | 593.
text: spiraled, maranzano, corleone
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.58 ± 22.4
- Median ± IQR
- 0.315 ± 2.95
- Min | Max
- 0.0511 | 501.
text: goinggoinggoing, productive, recharging
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.57 ± 27.5
- Median ± IQR
- 0.419 ± 4.46
- Min | Max
- 0.0523 | 679.
text: administration, involved, traitorous
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 6.07 ± 22.4
- Median ± IQR
- 0.334 ± 3.00
- Min | Max
- 0.0527 | 535.
text: definitely, constituents, considerable
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 7.54 ± 41.4
- Median ± IQR
- 0.546 ± 4.66
- Min | Max
- 0.0504 | 1.23e+03
text: liberalhypocrisy, politicalmemes, fucksocialism
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.18 ± 28.2
- Median ± IQR
- 0.115 ± 0.263
- Min | Max
- 0.0508 | 469.
text: departments, employees, deteriorated
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 999 (99.9%)
- Mean ± Std
- 5.36 ± 27.4
- Median ± IQR
- 0.315 ± 2.48
- Min | Max
- 0.0506 | 771.
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
text | text: emblazoned, gargantuan, repainting | 1.00 |
text | text: suicide, joking, quirky | 1.00 |
text: spiraled, maranzano, corleone | text: definitely, constituents, considerable | 0.707 |
text: pedophiles, approached, pedophile | text: illegally, economy, hometown | 0.500 |
text: legitimate, legitimacy, keywords | text: definitely, constituents, considerable | 0.500 |
text: legitimate, legitimacy, keywords | text: spiraled, maranzano, corleone | 0.408 |
text: tarrasques, icingdeath, drizzt | text: beautiful, beautifuls, melancholy | 0.408 |
text | text: liberalhypocrisy, politicalmemes, fucksocialism | 0.378 |
text: thought, though, xayah | text: beautiful, beautifuls, melancholy | 0.353 |
text: conservatives, consequences, indoctrination | text: illegally, economy, hometown | 0.353 |
text | text: investigation, libcunts, campaign | 0.289 |
text: legitimate, legitimacy, keywords | text: conservatives, consequences, indoctrination | 0.288 |
text: better, tears, enjoy | text: importantly, exterminates, activities | 0.275 |
text: documentary, lawsuits, ƞỉဌဌᕦѓ | text: beautiful, beautifuls, melancholy | 0.266 |
text: legitimate, legitimacy, keywords | text: congratulations, financial, healthcare | 0.254 |
text: thought, though, xayah | text: tarrasques, icingdeath, drizzt | 0.192 |
text | text: thought, though, xayah | 0.189 |
text: legitimate, legitimacy, keywords | text: departments, employees, deteriorated | 0.181 |
text: better, tears, enjoy | text: qkcuk6, awwwww, nxhcplzrecw | 0.178 |
text: congratulations, financial, healthcare | text: definitely, constituents, considerable | 0.170 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
We can use a heatmap to highlight the highest activations, making them more visible for comparison against the original text and vectors above.
import numpy as np
from matplotlib import pyplot as plt
def plot_gap_feature_importance(X_trans):
x_samples = X_trans.pop("text")
# We slightly format the topics and labels for them to fit on the plot.
topic_labels = [x.replace("text: ", "") for x in X_trans.columns]
labels = x_samples.str[:50].values + "..."
# We clip large outliers to makes activations more visible.
X_trans = np.clip(X_trans, a_min=None, a_max=200)
plt.figure(figsize=(10, 10), dpi=200)
plt.imshow(X_trans.T)
plt.yticks(
range(len(topic_labels)),
labels=topic_labels,
ha="right",
size=12,
)
plt.xticks(range(len(labels)), labels=labels, size=12, rotation=50, ha="right")
plt.colorbar().set_label(label="Topic activations", size=13)
plt.ylabel("Latent topics", size=14)
plt.xlabel("Data entries", size=14)
plt.tight_layout()
plt.show()
plot_gap_feature_importance(X_trans.head())
/home/circleci/project/examples/02_text_with_string_encoders.py:113: UserWarning: Glyph 4108 (\N{MYANMAR LETTER TTHA}) missing from font(s) DejaVu Sans.
plt.tight_layout()
Now that we have an understanding of the vectors produced by the GapEncoder
,
let’s evaluate its performance in toxicity classification. The GapEncoder
excels
at handling categorical columns with high cardinality, but here the column consists
of free-form text. Sentences are generally longer, with more unique ngrams than
high cardinality categories.
To benchmark the performance of the GapEncoder
against the toxicity dataset,
we integrate it into a TableVectorizer
, as introduced in the
previous example,
and create a Pipeline
by appending a HistGradientBoostingClassifier
, which
consumes the vectors produced by the GapEncoder
.
We set n_components
to 30; however, to achieve the best performance, we would
need to find the optimal value for this hyperparameter using either GridSearchCV
or RandomizedSearchCV
. We skip this part to keep the computation time for this
example small.
Recall that the ROC AUC is a metric that quantifies the ranking power of estimators, where a random estimator scores 0.5, and an oracle —providing perfect predictions— scores 1.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer
def plot_box_results(named_results):
fig, ax = plt.subplots()
names, scores = zip(
*[(name, result["test_score"]) for name, result in named_results]
)
ax.boxplot(scores)
ax.set_xticks(range(1, len(names) + 1), labels=list(names), size=12)
ax.set_ylabel("ROC AUC", size=14)
plt.title(
"AUC distribution across folds (higher is better)",
size=14,
)
plt.show()
results = []
y = X.pop("is_toxic").map({"Toxic": 1, "Not Toxic": 0})
gap_pipe = make_pipeline(
TableVectorizer(high_cardinality=GapEncoder(n_components=30)),
HistGradientBoostingClassifier(),
)
gap_results = cross_validate(gap_pipe, X, y, scoring="roc_auc")
results.append(("GapEncoder", gap_results))
plot_box_results(results)
MinHashEncoder#
We now compare these results with the MinHashEncoder
, which is faster
and produces vectors better suited for tree-based estimators like
HistGradientBoostingClassifier
. To do this, we can simply replace
the GapEncoder
with the MinHashEncoder
in the previous pipeline
using set_params()
.
from sklearn.base import clone
from skrub import MinHashEncoder
minhash_pipe = clone(gap_pipe).set_params(
**{"tablevectorizer__high_cardinality": MinHashEncoder(n_components=30)}
)
minhash_results = cross_validate(minhash_pipe, X, y, scoring="roc_auc")
results.append(("MinHashEncoder", minhash_results))
plot_box_results(results)
Remarkably, the vectors produced by the MinHashEncoder
offer less predictive
power than those from the GapEncoder
on this dataset.
TextEncoder#
Let’s now shift our focus to pre-trained deep learning encoders. Our previous encoders are syntactic models that we trained directly on the toxicity dataset. To generate more powerful vector representations for free-form text and diverse entries, we can instead use semantic models, such as BERT, which have been trained on very large datasets.
TextEncoder
enables you to integrate any Sentence Transformer model from the
Hugging Face Hub (or from your local disk) into your Pipeline
to transform a text
column in a dataframe. By default, TextEncoder
uses the e5-small-v2 model.
from skrub import TextEncoder
text_encoder = TextEncoder(
"sentence-transformers/paraphrase-albert-small-v2",
device="cpu",
)
text_encoder_pipe = clone(gap_pipe).set_params(
**{"tablevectorizer__high_cardinality": text_encoder}
)
text_encoder_results = cross_validate(text_encoder_pipe, X, y, scoring="roc_auc")
results.append(("TextEncoder", text_encoder_results))
plot_box_results(results)
The performance of the TextEncoder
is significantly stronger than that of
the syntactic encoders, which is expected. But how long does it take to load
and vectorize text on a CPU using a Sentence Transformer model? Below, we display
the tradeoff between predictive accuracy and training time. Note that since we are
not training the Sentence Transformer model, the “fitting time” refers to the
time taken for vectorization.
def plot_performance_tradeoff(results):
fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
markers = ["s", "o", "^"]
for idx, (name, result) in enumerate(results):
ax.scatter(
result["fit_time"],
result["test_score"],
label=name,
marker=markers[idx],
)
mean_fit_time = np.mean(result["fit_time"])
mean_score = np.mean(result["test_score"])
ax.scatter(
mean_fit_time,
mean_score,
color="k",
marker=markers[idx],
)
std_fit_time = np.std(result["fit_time"])
std_score = np.std(result["test_score"])
ax.errorbar(
x=mean_fit_time,
y=mean_score,
yerr=std_score,
fmt="none",
c="k",
capsize=2,
)
ax.errorbar(
x=mean_fit_time,
y=mean_score,
xerr=std_fit_time,
fmt="none",
c="k",
capsize=2,
)
ax.set_xlabel("Time to fit (seconds)")
ax.set_ylabel("ROC AUC")
ax.set_title("Prediction performance / training time trade-off")
ax.annotate(
"",
xy=(1.5, 0.98),
xytext=(8.5, 0.90),
arrowprops=dict(arrowstyle="->", mutation_scale=15),
)
ax.text(8, 0.86, "Best time / \nperformance trade-off")
ax.legend(bbox_to_anchor=(1, 0.3))
plt.show()
plot_performance_tradeoff(results)
The black points represent the average time to fit and AUC for each vectorizer, and the width of the bars represents one standard deviation
The green outlier dot on the right side of the plot corresponds to the first time the Sentence Transformers model was downloaded and loaded into memory. During the subsequent cross-validation iterations, the model is simply copied, which reduces computation time for the remaining folds.
Conclusion#
In conclusion, TextEncoder
provides powerful vectorization for text, but at
the cost of longer computation times and the need for additional dependencies,
such as torch.
Total running time of the script: (3 minutes 51.226 seconds)