skrub.datasets.fetch_ken_types#

skrub.datasets.fetch_ken_types(search=None, *, exclude=None, embedding_table_id='all_entities')[source]#

Helper function to search for KEN entity types.

The result can then be used with fetch_ken_embeddings.

Parameters:
searchstr, optional

Substring pattern that filters the types of entities.

excludestr, optional

Substring pattern to exclude from the search.

embedding_table_idstr, default=’all_entities’

Table of embedded entities from which to extract the embeddings. Get the supported tables with fetch_ken_table_aliases. It is NOT possible to pass a custom figshare ID.

Returns:
DataFrame

The types of entities containing the substring.

See also

fetch_ken_embeddings

Download Wikipedia embeddings by type.

Notes

Best used in conjunction with fetch_ken_embeddings.

This function requires pyarrow to be installed.

References

For more details, see Cvetkov-Iliev, A., Allauzen, A. & Varoquaux, G.: Relational data embeddings for feature enrichment with background information.

Examples

To get all the existing KEN types of entities:

>>> embedding_types = fetch_ken_types()  
>>> embedding_types.head() 
                                                Type
0                 wikicat_italian_male_screenwriters
1  wikicat_21st-century_roman_catholic_archbishop...
2                 wikicat_2000s_romantic_drama_films
3                  wikicat_music_festivals_in_france
4        wikicat_20th-century_american_women_artists

Let’s search for all KEN types with the strings “dance” or “music”:

>>> embedding_filtered_types = fetch_ken_types(search="dance|music") 
>>> embedding_filtered_types.head() 
                                                Type
0                  wikicat_music_festivals_in_france
1  wikicat_films_scored_by_bharadwaj_(music_direc...
2                  wikicat_english_music_journalists
3       wikicat_20th-century_american_male_musicians
4  wikicat_alumni_of_the_london_academy_of_music_...

Examples using skrub.datasets.fetch_ken_types#

Wikipedia embeddings to enrich the data

Wikipedia embeddings to enrich the data