fetch_ken_embeddings#
- skrub.datasets.fetch_ken_embeddings(search_types=None, *, exclude=None, embedding_table_id='all_entities', embedding_type_id=None, pca_components=None, suffix='')[source]#
Download Wikipedia embeddings by type.
More details on the embeddings can be found on https://soda-inria.github.io/ken_embeddings/.
- Parameters:
- search_types
str
, optional Substring pattern that filters the types of entities. Will keep all entity types containing the substring. Write in lowercase. If None, all types will be passed.
- exclude
str
, optional Type of embeddings to exclude from the types search.
- embedding_table_id
str
, default=’all_entities’ Table of embedded entities from which to extract the embeddings. Get the supported tables with fetch_ken_table_aliases. It is also possible to pass a custom figshare ID.
- embedding_type_id
str
, optional Figshare ID of the file containing the type of embeddings. Get the supported tables with fetch_ken_types. Ignored unless a custom embedding_table_id is provided.
- pca_components
int
, optional Size of the dimensional space on which the embeddings will be projected by a principal component analysis. If None, the default dimension (200) of the embeddings will be kept.
- suffix
str
, optional, default=’’ Suffix to add to the column names of the embeddings.
- search_types
- Returns:
DataFrame
The embeddings of entities and the specified type from Wikipedia.
See also
fetch_ken_table_aliases
Get the supported aliases of embedded entities tables.
fetch_ken_types
Helper function to search for entity types.
fuzzy_join
Join two tables (dataframes) based on approximate column matching.
Joiner
Transformer to enrich a given table via one or more fuzzy joins to external resources.
Notes
The files are read and returned in parquet format, this function needs pyarrow installed to run correctly.
The search_types parameter is there to filter the types by the input string pattern. In case the input is “music”, all types with this string will be included (e.g. “wikicat_musician_from_france”, “wikicat_music_label” etc.). Going directly for the exact type name (e.g. “wikicat_rock_music_bands”) is possible but may not be complete (as some relevant bands may be in other similar types). For exploring available types, the fetch_ken_types function can be used.
References
For more details, see Cvetkov-Iliev, A., Allauzen, A. & Varoquaux, G.: Relational data embeddings for feature enrichment with background information.
Examples
fetch_ken_embeddings allows you to extract embeddings you are interested in. For instance, if we are interested in video games:
>>> games_embedding = fetch_ken_embeddings(search_types="video_games") >>> games_embedding.head() Entity ... X199 0 A_Little_Princess ... 0.04... 1 The_Dark_Half ... -0.00... 2 Frankenstein ... -0.11... 3 Albert_Wesker ... -0.16... 4 Harukanaru_Toki_no_Naka_de_3 ... 0.14...
Extracts all embeddings with the “games” type. For the list of existing types see fetch_ken_types.
Some tables are available pre-filtered for us using the embedding_table_id parameter:
>>> games_embedding_fast = fetch_ken_embeddings(embedding_table_id="games") >>> games_embedding_fast.head() Entity ... X199 0 R-Type_Delta ... 0.04... 1 Just_Add_Water_(company) ... -0.02... 2 Li_Xiayan ... 0.00... 3 Vampire_Night ... -0.14... 4 Shatterhand ... 0.19...
It takes less time to load the wanted output, and is more precise as the types have been carefully filtered out. For a list of pre-filtered tables, see func:fetch_ken_table_aliases.
Gallery examples#
Wikipedia embeddings to enrich the data