Assembling: joining multiple tables#
Assembling is the process of collecting and joining together tables. Good analytics requires including as much information as possible, often from different sources.
skrub allows you to join tables on keys of different types (string, numerical, datetime) with imprecise correspondence.
Fuzzy joining tables#
Joining two dataframes can be hard as the corresponding keys may be different.
fuzzy_join()
uses similarities in entries to join tables on one or more
related columns. Furthermore, it chooses the type of fuzzy matching used based
on the column type (string, numerical or datetime). It also outputs a similarity
score, to single out bad matches, so that they can be dropped or replaced.
In sum, equivalent to pandas.merge()
, the fuzzy_join()
has no need for pre-cleaning.
Joining external tables for machine learning#
Joining is straightforward for two tables because you only need to identify the common key.
In addition, skrub also enable more advanced analysis:
Joiner
: fuzzy-joins an external table using a scikit-learn transformer, which can be used in a scikit-learnPipeline
. Pipelines are useful for cross-validation and hyper-parameter search, but also for model deployment.AggJoiner
: instead of performing 1:1 joins likeJoiner
,AggJoiner
aggregates the external table first, then joins it on the main table. Alternatively, it can aggregate the main table and then join it back onto itself.AggTarget
: in some settings, one can derive powerful features from the target y itself. AggTarget aggregates the target without risking data leakage, then joins the result back on the main table, similar to AggJoiner.MultiAggJoiner
: extension of theAggJoiner
that joins multiple auxiliary tables onto the main table.
Column selection inside a pipeline#
Besides joins, another common operation on a dataframe is to select a subset of its columns (also known as a projection).
We sometimes need to perform such a selection in the middle of a pipeline, for example if we need a column for a join (with Joiner
), but in a subsequent step we want to drop that column before fitting an estimator.
skrub provides transformers to perform such an operation:
SelectCols
allows specifying the columns we want to keep.Conversely
DropCols
allows specifying the columns we want to discard.
Going further: embeddings for better analytics#
Data collection comes before joining, but is also an essential process of table assembling. Although many datasets are available on the internet, it is not always easy to find the right one for your analysis.
skrub has some very helpful methods that gives you easy
access to embeddings, or vectorial representations of an entity,
of all common entities from Wikipedia.
You can use datasets.get_ken_embeddings()
to search for the right
embeddings and download them.
Other methods, such as datasets.fetch_world_bank_indicator()
to
fetch data of a World Bank indicator can also help you retrieve
useful data that will be joined to another table.