fetch_ken_embeddings#

skrub.datasets.fetch_ken_embeddings(search_types=None, *, exclude=None, embedding_table_id='all_entities', embedding_type_id=None, pca_components=None, suffix='')[source]#

Download Wikipedia embeddings by type.

More details on the embeddings can be found on https://soda-inria.github.io/ken_embeddings/.

Parameters:

search_typesstr, optional: Substring pattern that filters the types of entities. Will keep all entity types containing the substring. Write in lowercase. If None, all types will be passed.
excludestr, optional: Type of embeddings to exclude from the types search.
embedding_table_idstr, default=’all_entities’: Table of embedded entities from which to extract the embeddings. Get the supported tables with fetch_ken_table_aliases. It is also possible to pass a custom figshare ID.
embedding_type_idstr, optional: Figshare ID of the file containing the type of embeddings. Get the supported tables with fetch_ken_types. Ignored unless a custom embedding_table_id is provided.
pca_componentsint, optional: Size of the dimensional space on which the embeddings will be projected by a principal component analysis. If None, the default dimension (200) of the embeddings will be kept.
suffixstr, optional, default=’’: Suffix to add to the column names of the embeddings.

Returns:

DataFrame: The embeddings of entities and the specified type from Wikipedia.

See also

fetch_ken_table_aliases: Get the supported aliases of embedded entities tables.
fetch_ken_types: Helper function to search for entity types.
fuzzy_join: Join two tables (dataframes) based on approximate column matching.
Joiner: Transformer to enrich a given table via one or more fuzzy joins to external resources.

Notes

The files are read and returned in parquet format, this function needs pyarrow installed to run correctly.

The search_types parameter is there to filter the types by the input string pattern. In case the input is “music”, all types with this string will be included (e.g. “wikicat_musician_from_france”, “wikicat_music_label” etc.). Going directly for the exact type name (e.g. “wikicat_rock_music_bands”) is possible but may not be complete (as some relevant bands may be in other similar types). For exploring available types, the fetch_ken_types function can be used.

References

For more details, see Cvetkov-Iliev, A., Allauzen, A. & Varoquaux, G.: Relational data embeddings for feature enrichment with background information.

Examples

fetch_ken_embeddings allows you to extract embeddings you are interested in. For instance, if we are interested in video games:

>>> games_embedding = fetch_ken_embeddings(search_types="video_games")
>>> games_embedding.head()
                         Entity  ...      X199
0             A_Little_Princess  ...  0.04...
1                 The_Dark_Half  ... -0.00...
2                  Frankenstein  ... -0.11...
3                 Albert_Wesker  ... -0.16...
4  Harukanaru_Toki_no_Naka_de_3  ...  0.14...

Extracts all embeddings with the “games” type. For the list of existing types see fetch_ken_types.

Some tables are available pre-filtered for us using the embedding_table_id parameter:

>>> games_embedding_fast = fetch_ken_embeddings(embedding_table_id="games")
>>> games_embedding_fast.head()
                     Entity  ...      X199
0              R-Type_Delta  ...  0.04...
1  Just_Add_Water_(company)  ... -0.02...
2                 Li_Xiayan  ...  0.00...
3             Vampire_Night  ... -0.14...
4               Shatterhand  ...  0.19...

It takes less time to load the wanted output, and is more precise as the types have been carefully filtered out. For a list of pre-filtered tables, see func:fetch_ken_table_aliases.

Gallery examples#

Wikipedia embeddings to enrich the data

fetch_ken_embeddings#

Gallery examples#

This Page