Transforming Pandas DataFrames into Publication Quality Assets

Data scientists and engineers often spend the majority of their time on data acquisition, cleaning, and modeling. We meticulously tune hyperparameters and engineer complex features, only to output the results using a standard print(df) or a basic Jupyter Notebook rendering. While adequate for debugging, standard pandas output fails to convey authority or narrative flow. When the goal is to communicate insights to stakeholders, researchers, or the general public, the presentation layer becomes as critical as the analysis itself. While libraries like Matplotlib and Plotly handle graphical representation, the tabular display of data has historically lagged behind in the Python ecosystem. This changed with the introduction of Great Tables.

Great Tables is a port of the renowned R package gt. It introduces a philosophy known as the "Grammar of Tables," providing a structured, layer-based approach to constructing data tables. Instead of treating a table as a grid of text, Great Tables treats it as a collection of semantic components—headers, stubs, column spanners, and footers. This distinction allows developers to craft tables that are not merely readable but are publication-ready, supporting complex formatting, annotations, and hierarchical grouping with a declarative API.

The Architecture of a Great Table

To master Great Tables, one must first understand the structural anatomy it imposes on data. Unlike a spreadsheet which is an infinite grid, a Great Table is composed of six precise parts. Understanding these parts is essential because the API methods map directly to these regions.

The Table Header contains the title and subtitle, setting the context for the data. Below this sits the Stub and the Column Labels. The Stub is the left-most section (conceptually similar to a pandas Index) that identifies the rows, while Column Labels identify the variables. Crucially, Great Tables allows for Spanner Column Labels, which group multiple columns under a unified category, effectively creating a multi-index hierarchy for columns without the pain of managing a MultiIndex in pandas. The Table Body contains the actual cells, which can be formatted extensively. Finally, the Table Footer contains source notes and footnotes, providing necessary citations or clarifications for specific data points.

Let us begin by installing the library and preparing a standard dataset for transformation. We will use a subset of a typical sales dataset to demonstrate these capabilities.

from great_tables import GT, md, html
import pandas as pd

# Sample dataset creation
data = {
    "region": ["North", "North", "South", "South", "East", "East"],
    "product_line": ["Electronics", "Furniture", "Electronics", "Furniture", "Electronics", "Furniture"],
    "revenue": [15000.50, 12000.00, 18500.75, 9000.25, 21000.00, 14500.50],
    "growth": [0.052, -0.015, 0.081, 0.023, 0.125, 0.041],
    "date_reported": ["2023-10-01", "2023-10-01", "2023-10-02", "2023-10-02", "2023-10-03", "2023-10-03"]
}

df = pd.DataFrame(data)

# The simplest invocation
basic_table = GT(df)
basic_table

Formatting Data for Cognitive Ease

Raw data is rarely reader-friendly. A floating-point number representing currency should look like currency, and a fraction representing growth should look like a percentage. In standard pandas, this often requires mutating the underlying string representation of the data, which destroys the numerical integrity of the dataframe. Great Tables solves this by applying a formatting layer on top of the data, leaving the underlying values serving the table untouched.

Handling Numerics and Currencies

The library provides a suite of fmt_* functions. For financial data, fmt_currency is indispensable. It handles currency symbols, thousands separators, and decimal precision automatically. You can target specific columns and even apply localization standards.

formatted_table = (
    GT(df)
    .tab_header(
        title="Q4 Regional Sales Report",
        subtitle="Revenue and Growth analysis by Product Line"
    )
    .fmt_currency(
        columns="revenue",
        currency="USD"
    )
    .fmt_percent(
        columns="growth",
        decimals=1
    )
    .fmt_date(
        columns="date_reported",
        date_style="day_m_year"
    )
)
formatted_table

In the example above, we target the revenue column to display as USD. The growth column is automatically multiplied by 100 and appended with a percent sign, rounded to one decimal place. The date column is transformed from an ISO string into a human-readable format. This declarative approach keeps the code clean and the data separate from the presentation logic.

Structuring Columns and Rows

A key differentiator between a dataframe dump and a publication table is the grouping of information. High-quality tables often require grouping related columns under a "Spanner" label to indicate a shared category. Similarly, rows often need to be grouped to show hierarchical relationships, such as regions within a country or departments within a division.

Implementing Column Spanners

The tab_spanner method creates a label that spans across multiple columns. This is particularly effective when you have metrics that belong to a specific timeframe or logical category. Instead of renaming columns to "Q1_Revenue" and "Q1_Growth", you keep clean column names and add a spanner labeled "Q1 Performance".

Row Grouping

Great Tables utilizes the concept of row groups. By specifying a grouping key during the initialization or using tab_row_group, the table is physically segmented. This eliminates the need for repeated values in a categorical column, reducing visual clutter and emphasizing the structure of the data.

grouped_table = (
    GT(df, rowname_col="product_line", groupname_col="region")
    .tab_header(
        title="Sales Performance",
        subtitle=md("Breakdown by **Region** and **Product Line**")
    )
    .tab_spanner(
        label="Financial Metrics",
        columns=["revenue", "growth"]
    )
    .fmt_currency(columns="revenue")
    .fmt_percent(columns="growth")
    .cols_label(
        revenue="Total Revenue",
        growth="YoY Growth",
        date_reported="Report Date"
    )
)
grouped_table

In this sophisticated example, we moved the "region" column out of the data body and used it to create sectional headers for the table rows. The "product_line" column became the stub (the row label). We also introduced markdown in the subtitle using the md() helper, allowing for bolding or italicizing text within the headers.

Adding Context: Source Notes and Footnotes

In academic and professional reporting, data rarely stands alone. It requires context. Where did the data come from? Are there caveats to specific numbers? Standard pandas displays have no native mechanism for this, often forcing analysts to add text blocks below the image or screenshot.

Great Tables integrates these elements directly into the table object. Source Notes appear at the very bottom of the table and are generally used for citations. Footnotes are more granular; they link specific cells or column labels to explanatory text in the footer via automatic numbering or symbol assignment.

Targeting Cells for Annotation

To attach a footnote, you must specify the location. This can be done by targeting specific columns and rows. The API allows for conditional targeting, meaning you can attach a footnote only to values that meet a certain criteria (e.g., "growth is negative").

annotated_table = (
    grouped_table
    .tab_source_note(
        source_note="Source: Internal Sales Database, Q4 Export."
    )
    .tab_footnote(
        footnote="Furniture sales were impacted by supply chain delays in the North region.",
        locations=loc.body(
            columns="revenue",
            rows=1  # Targeting the specific row index for North-Furniture
        )
    )
    .tab_footnote(
        footnote="Projected growth based on Q3 trends.",
        locations=loc.column_labels(columns="growth")
    )
)

# Note: You would import 'loc' from great_tables to use location targeting
from great_tables import loc

The loc module is powerful. It provides selectors for the body, the stub, column labels, and more. By embedding footnotes programmatically, the table remains self-contained. If you export this table to HTML or capture it as an image, the context travels with the data.

Styling and Visual customization

While structure and formatting ensure correctness, styling ensures engagement. A "Great Table" should align with your organization's brand identity or the publication's style guide. The library offers both high-level presets and granular control over CSS-like attributes.

Using Opt_Stylize

For rapid development, Great Tables includes a set of pre-defined styles accessed via opt_stylize. These come in numbered variations and color schemes. It is the fastest way to turn a plain table into a professional grid with alternating row colors and distinct headers.

# Applying a pre-set style
styled_table = annotated_table.opt_stylize(style=6, color="blue")

Granular Styling with tab_style

For specific requirements, such as highlighting a specific cell that exceeds a threshold, tab_style is used. This method pairs a style definition (what it looks like) with a location (where it applies). You can change background colors, font weights, borders, and text alignment.

This capability effectively allows for "Conditional Formatting" similar to Excel, but reproducible via code. For instance, highlighting any growth value below 0 in red text.

from great_tables import style

final_table = (
    annotated_table
    .tab_style(
        style=style.text(color="red", weight="bold"),
        locations=loc.body(
            columns="growth",
            rows=lambda x: x < 0 # Conditional logic for formatting
        )
    )
    .tab_style(
        style=style.fill(color="#e6f3ff"),
        locations=loc.body(
            columns="revenue",
            rows=lambda x: x > 20000
        )
    )
)
final_table

In this final example, we see the convergence of logic and design. The lambda function iterates over the data, identifying rows where growth is negative, and applies the style dynamically. This ensures that if the underlying data updates, the highlighting automatically adjusts to the new values without manual intervention.

Transforming Pandas DataFrames into Publication Quality Assets

The Architecture of a Great Table

Formatting Data for Cognitive Ease

Handling Numerics and Currencies

Structuring Columns and Rows

Implementing Column Spanners

Row Grouping

Adding Context: Source Notes and Footnotes

Targeting Cells for Annotation

Styling and Visual customization

Using Opt_Stylize

Granular Styling with tab_style

Comments (0)

Article Contents

Convert Audio

Transforming Pandas DataFrames into Publication Quality Assets

The Architecture of a Great Table

Formatting Data for Cognitive Ease

Handling Numerics and Currencies

Structuring Columns and Rows

Implementing Column Spanners

Row Grouping

Adding Context: Source Notes and Footnotes

Targeting Cells for Annotation

Styling and Visual customization

Using Opt_Stylize

Granular Styling with tab_style

Comments (0)

Article Contents

Share

Convert Audio