Inria
I have a table I want to train a model on. I also have access to a large collection of tables.
How do I combine the two to train a better model?
Warning
This terminology is slightly different from that used in the Skrub documentation
Jaccard containment is a “normalized” intersection:
Important
What fraction of of query set Q is in candidate column X?
If you don’t know the tables, everyone is sus
ML Model | Platform | Total compute time |
---|---|---|
RidgeCV | CPU | 4y 3m 10d 7h |
CatBoost | CPU | 1y 3m 29d 21h |
ResNet | GPU | 5y 6m 23d 0h |
RealMLP | GPU | 10y 7m 23d 3h |
Total | Both | 21y 9m 26d 8h |
Research code…
Discover
object can replace (part of) the retrieval step.AggJoiner
or MultiAggJoiner
.MultiAggJoiner
is an additional baseline.TableVectorizer
can handle automated preprocessing of the tables.TableReport
.merged = source_table.clone()
hashes = []
for hash_, mdata in tqdm(
index_cand.items(),
total=len(index_cand),
leave=False,
desc="Full Join",
position=2,
):
cnd_md = mdata.candidate_metadata
hashes.append(cnd_md["hash"])
candidate_table = pl.read_parquet(cnd_md["full_path"])
left_on = mdata.left_on
right_on = mdata.right_on
aggr_right = aggregate_table(
candidate_table, right_on, aggregation_method=aggregation
)
merged = execute_join(
merged,
aggr_right,
left_on=left_on,
right_on=right_on,
how="left",
suffix="_" + hash_[:10],
)
Authors:
https://github.com/skrub-data/skrub