Discover object

The discover object

The discover object

The discover object

  • Give a high level overview of the content of a data lake
  • Build an (approximate) schema of the data
  • Suggest tables that are relevant to what the user provides

Planned features

If no query is provided:

  • Given a collection of tables, profile them and produce aggregated statistics.
  • Dtypes, null values, shape of the tables.

If a query table is provided

  • Measure various pairwise metrics between columns in the query table, and the columns in the collection of tables.
  • Rank the columns based on the metrics to find those that are most relevant.
  • Jaccard containment will be the first metric.
  • Statistics remain available to perform feature selection.

Mock-up of the code

from skrub import Discover

path_to_tables = "./many_tables/"
discover = Discover(path_to_tables)

dataframe_stats = discover.fit_transform() 

Mock-up of the code

from skrub import Discover
import pandas as pd

path_to_tables = "./many_tables/"
query_table = pd.read_csv("this_table.csv")

discover = Discover(path_to_tables)

ranking_by_column = discover.fit_transform(query_table)

Mock-up of the code

from skrub import Discover, MultiAggJoiner
import pandas as pd

path_to_tables = "./many_tables/"
query_table = pd.read_csv("this_table.csv")

discover = Discover(path_to_tables)

ranking_by_column = discover.fit_transform(query_table)

joiner = MultiAggJoiner(ranking_by_column)
joined_table = joiner.fit_transform(query_table)

Interface with the data

  • The initial implementation will read from a path/glob
  • Later version will target SQL databases
  • What other technologies should we consider?