fetch_ken_embeddings#

skrub.datasets.fetch_ken_embeddings(search_types=None, *, exclude=None, embedding_table_id='all_entities', embedding_type_id=None, pca_components=None, suffix='')[source]#

Download Wikipedia embeddings by type.

More details on the embeddings can be found on https://soda-inria.github.io/ken_embeddings/.

Parameters:
search_typesstr, optional

Substring pattern that filters the types of entities. Will keep all entity types containing the substring. Write in lowercase. If None, all types will be passed.

excludestr, optional

Type of embeddings to exclude from the types search.

embedding_table_idstr, default=’all_entities’

Table of embedded entities from which to extract the embeddings. Get the supported tables with fetch_ken_table_aliases. It is also possible to pass a custom figshare ID.

embedding_type_idstr, optional

Figshare ID of the file containing the type of embeddings. Get the supported tables with fetch_ken_types. Ignored unless a custom embedding_table_id is provided.

pca_componentsint, optional

Size of the dimensional space on which the embeddings will be projected by a principal component analysis. If None, the default dimension (200) of the embeddings will be kept.

suffixstr, optional, default=’’

Suffix to add to the column names of the embeddings.

Returns:
DataFrame

The embeddings of entities and the specified type from Wikipedia.

See also

fetch_ken_table_aliases

Get the supported aliases of embedded entities tables.

fetch_ken_types

Helper function to search for entity types.

fuzzy_join

Join two tables (dataframes) based on approximate column matching.

Joiner

Transformer to enrich a given table via one or more fuzzy joins to external resources.

Notes

The files are read and returned in parquet format, this function needs pyarrow installed to run correctly.

The search_types parameter is there to filter the types by the input string pattern. In case the input is “music”, all types with this string will be included (e.g. “wikicat_musician_from_france”, “wikicat_music_label” etc.). Going directly for the exact type name (e.g. “wikicat_rock_music_bands”) is possible but may not be complete (as some relevant bands may be in other similar types). For exploring available types, the fetch_ken_types function can be used.

References

For more details, see Cvetkov-Iliev, A., Allauzen, A. & Varoquaux, G.: Relational data embeddings for feature enrichment with background information.

Examples

fetch_ken_embeddings allows you to extract embeddings you are interested in. For instance, if we are interested in video games:

>>> games_embedding = fetch_ken_embeddings(search_types="video_games") 
>>> games_embedding.head() 
                         Entity  ...      X199
0             A_Little_Princess  ...  0.04...
1                 The_Dark_Half  ... -0.00...
2                  Frankenstein  ... -0.11...
3                 Albert_Wesker  ... -0.16...
4  Harukanaru_Toki_no_Naka_de_3  ...  0.14...

Extracts all embeddings with the “games” type. For the list of existing types see fetch_ken_types.

Some tables are available pre-filtered for us using the embedding_table_id parameter:

>>> games_embedding_fast = fetch_ken_embeddings(embedding_table_id="games") 
>>> games_embedding_fast.head() 
                     Entity  ...      X199
0              R-Type_Delta  ...  0.04...
1  Just_Add_Water_(company)  ... -0.02...
2                 Li_Xiayan  ...  0.00...
3             Vampire_Night  ... -0.14...
4               Shatterhand  ...  0.19...

It takes less time to load the wanted output, and is more precise as the types have been carefully filtered out. For a list of pre-filtered tables, see func:fetch_ken_table_aliases.