Wrap `Reader`s with an in-memory caching layer by georgestagg · Pull Request #496 · posit-dev/ggsql

georgestagg · 2026-06-29T15:18:53Z

Add an in-memory caching layer for readers

This PR adds a caching layer that wraps any reader with an in-memory caching reader, for now supporting duckdb or sqlite.
This allows us to visualise with read-only databases, and caching improves the experience for very slow remote databases.

A CachingReader wraps a primary reader, splitting the API into two surfaces:

source (execute_sql) — base reads of the user's data, hitting the primary.
compute (execute_sql_cached) — all dialect-generated / derived SQL over __ggsql_* tables, hitting the cache.

When caching is off, the default execute_sql_cached just calls execute_sql, so the single execution path works unchanged.

Cached reads are tracked in a __ggsql_cache_meta__ table inside the cache reader, itself queryable for introspection.
The cache is bounded by a TTL and an LRU budget. Entries older than the TTL are re-fetched, and once the total size exceeds the budget the least-recently-used entries are evicted until it fits.

Usage

Opt in with the --cache flag when using the ggsql CLI, or otherwise set a composite <primary>+<cache>:// connection string:

# simplest: cache the default duckdb reader
ggsql exec "<query>" --cache duckdb

# composite URI, with per-connection tuning
ggsql exec "<query>" --reader "odbc+duckdb://<dsn>?cache_ttl=600&cache_max_bytes=256mb"

In Jupyter et al., use a composite connection string:

-- @connect: odbc+duckdb://DSN=example_db_dsn
SELECT region, revenue FROM sales
VISUALISE region AS x, revenue AS y
DRAW bar

Force clear the cache mid-session with a meta-command:

-- @uncache

Configuration comes from env vars (GGSQL_CACHE_DISABLED, GGSQL_CACHE_TTL, GGSQL_CACHE_MAX_BYTES)
or connection URI query params (?cache_ttl=300&cache_max_bytes=32mb&cache_disabled=0).
Defaults: enabled, 300s TTL, 512 MB.

Reader trait changes

There are four new Reader methods, all with defaults so existing drivers are unaffected:

execute_sql_cached -- the compute surface. Defaults to just execute_sql.
materialize_table -- materialize a query body. Default is CREATE … TEMP TABLE on the reader.
caches_sources -- defaults to false, true for CachingReader.
clear_cache -- no-op by default, backs the -- @uncache meta-command.

Because SQL runs on cache-resident _ggsql* tables via the compute surface, it must be emitted by ggsql in the cache backend's dialect, not the primary's, so CachingReader::dialect() returns the cache dialect.

Read-only guarantee

When the cache is active, the primary connection is never written to:

materialize_table is overridden by CachingReader to register() the resulting data frame into the cache, rather than using temp tables.
All dialect-generated/derived SQL runs on the compute surface using execute_sql_cached, i.e. the cache backend.
The primary sees only base reads of the user's data.

Grammar

The grammar has been changed to parse joins structured nodes, so every joined table is discoverable. This allows for parser-based SQL rewriting

CTE- and source-reference rewriting now runs off the tree-sitter parse tree instead of ad-hoc regex. This correctly handles comma joins, quoted / schema-qualified / whitespace- and case-variant names.
Moved the SQL-structural helpers out of reader/data.rs into a new file parser/sql.rs.

Mixed-residency staging

When a query body mixes cache-resident tables (CTEs, ggsql: builtins) with primary base tables, the primary tables are
staged into the cache so the whole body can run on the compute surface. This allows one to run queries like,

-- @connect: odbc+duckdb://DSN=example_db_dsn
CREATE TEMP TABLE sales AS SELECT * FROM (VALUES
  ('North', 150), ('South', 250), ('East', 180)) AS t(region, revenue);
WITH targets AS (
  SELECT * FROM (VALUES ('North', 140), ('South', 260), ('East', 160)) AS v(region, target)
)
SELECT sales.region, sales.revenue - targets.target AS delta
FROM sales JOIN targets ON sales.region = targets.region
VISUALISE region AS x, delta AS y
DRAW point

sales lives on the primary (via the DSN). targets is a CTE, materialised in the local cache. The join mixes the two, so sales is transparently staged into the cache and the join runs locally.

Similarly, setup (CREATE/INSERT/UPDATE/DELETE) queries runs before CTE materialisation and staging, so a query that creates a table and then uses it works OK with caching enabled.

The same routing applies to per-layer sources. Each layer can bring its own FROM, and with caching enabled those sources are resolved against the cache rather than the primary,
so you can point one layer at a 'local_file.csv' even when the primary is a remote database:

-- @connect: odbc+duckdb://DSN=example_db_dsn
VISUALISE
DRAW line  MAPPING month AS x, revenue AS y FROM sales
DRAW point MAPPING month AS x, target  AS y FROM 'targets.csv'

sales is read from the remote primary, targets.csv is read locally by the cache backend. Both layers render together; nothing is written back to the primary. Without caching, FROM 'targets.csv' would only work if the primary reader itself could read that path.

Bookeeping

Initially based on #423 and jimhester#2. Jim should be added as a co-author on the final merge commit.

georgestagg added 15 commits June 29, 2026 14:23

Initial caching layer implementation

ae2f411

Explicit compute/source split for caching layer

6bf435e

Update changelog

5a43599

Apply changes from code review

589f179

Store cache metadata in the cache reader

78c856e

Back cache memo with a meta table

bd2b155

Add TTL + LRU byte-budget eviction to the cache

9e6958e

Add -- @uncache and make cell meta-commands line-oriented

fc162a6

Configure the cache via URI query parameters

4f2cd55

Namespace cache query params

55b04c2

Remove now-irrelevant test

e8b6991

Switch ordering in drop_entry

70d3872

Stage primary sources when executing a query with mixed residency tables

e750c14

Update ggsql grammar for JOIN and qualified window function

536a392

Improve SQL table-reference parsing

a423501

georgestagg marked this pull request as ready for review July 2, 2026 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrap `Reader`s with an in-memory caching layer#496

Wrap `Reader`s with an in-memory caching layer#496
georgestagg wants to merge 15 commits into
mainfrom
caching-layer

georgestagg commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

georgestagg commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Reader trait changes

Read-only guarantee

Grammar

Mixed-residency staging

Bookeeping

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

georgestagg commented Jun 29, 2026 •

edited

Loading