TableReport#
- class skrub.TableReport(dataframe, n_rows=10, order_by=None, title=None, column_filters=None, verbose=None, plot_distributions='auto', compute_associations='auto', open_tab='table', max_plot_columns=None, max_association_columns=None)[source]#
Summarize the contents of a dataframe.
This class summarizes a dataframe or numpy array, providing information such as the type and summary statistics (mean, number of missing values, etc.) for each column. Numpy arrays are converted to pandas DataFrame or Series.
- Parameters:
- dataframepandas or polars
Seriesor DataFrame The dataframe or series to summarize.
- n_rows
int, default=10 Maximum number of rows to show in the sample table. Half will be taken from the beginning (head) of the dataframe and half from the end (tail). Note this is only for display. Summary statistics, histograms etc. are computed using the whole dataframe.
- order_by
str Column name to use for sorting. Other numerical columns will be plotted as function of the sorting column. Must be of numerical or datetime type.
- title
str Title for the report.
- column_filters
dict A dict for adding custom entries to the column filter dropdown menu. Each key is the filter named to be displayed in the dropdown menu (e.g.
"first_10"), and the value is the desired filter. Allowed formats for the filter values are a list of column names, a list of column indices, or a Selector object. See the end of the “Examples” section below for details.- verbose
int, default = 1 Whether to print progress information while the report is being generated.
verbose = 1 prints how many columns have been processed so far.
verbose = 0 silences the output.
- plot_distributions
boolor “auto”, default=”auto” Whether to plot the distributions of the columns.
True: always generate plots, regardless of column count.False: never generate plots."auto"(default): generate plots only when the number of columns does not exceed the configuredtable_report_plots_threshold(seeset_config()).
- compute_associations
boolor “auto”, default=”auto” Whether to compute associations between columns.
True: always compute associations, regardless of column count.False: never compute associations."auto"(default): compute associations only when the number of columns does not exceed the configuredtable_report_associations_threshold(seeset_config()).
- max_plot_columns
intor “all”, deprecated Deprecated in favor of
plot_distributions. This parameter overrides the value chosen forplot_distributionswhen it is not None.Deprecated since version 0.9.0.
- max_association_columns
intor “all”, deprecated Deprecated in favor of
compute_associations. This parameter overrides the value chosen forcompute_associationswhen it is not None.Deprecated since version 0.9.0.
- open_tab
str, default=”table” The tab that will be displayed by default when the report is opened. Must be one of “table”, “stats”, “distributions”, or “associations”.
“table”: Shows a sample of the dataframe rows
“stats”: Shows summary statistics for all columns
“distributions”: Shows plots of column distributions
“associations”: Shows column associations and similarities
- dataframepandas or polars
See also
patch_displayReplace the default DataFrame HTML displays in the output of notebook cells with a TableReport.
Notes
You can see some example reports for a few datasets online. We also provide an experimental online demo that allows you to select a CSV or parquet file and generate a report directly in your web browser.
Examples
>>> import pandas as pd >>> from skrub import TableReport >>> df = pd.DataFrame(dict(a=[1, 2], b=['one', 'two'], c=[11.1, 11.1])) >>> report = TableReport(df)
If you are in a Jupyter notebook, to display the report just have it be the last expression evaluated in a cell so that it is displayed in the cell’s output.
>>> report <TableReport: use .open() to display>
(Note that above we only see the string representation, not the report itself, because we are not in a notebook.)
Whether you are using a notebook or not, you can always open the report as a full page in a separate browser tab with its
openmethod:report.open().You can also get the HTML report as a string. For a full, standalone web page:
>>> report.html() '<!DOCTYPE html>\n<html lang="en-US">\n\n<head>\n <meta charset="utf-8"...'
For an HTML fragment that can be inserted into a page:
>>> report.html_snippet() '\n<div id="report_...-wrapper" hidden>\n <template id="report_...'
Advanced configuration: you can add custom column filters that will appear in the report’s dropdown menu.
>>> filters = { ... "display_name": ["a", "b"], ... } >>> report = TableReport(df, column_filters=filters)
With the code above, in addition to the default filters such as “All columns”, “Numeric columns”, etc., the added “Columns with at least 2 unique values” will be available in the report, selecting columns “a” and “b”.
Methods
html()Get the report as a full HTML page.
Get the report as an HTML fragment that can be inserted in a page.
json()Get the report data in JSON format.
open()Open the HTML report in a web browser.
write_html(file)Store the report into an HTML file.
Gallery examples#
Tutorial: Using Data Ops to build a machine-learning pipeline
Various string encoders: a sentiment analysis example
Multiples tables: building machine learning pipelines with DataOps