Skip to content

Wrap Readers with an in-memory caching layer#496

Open
georgestagg wants to merge 15 commits into
mainfrom
caching-layer
Open

Wrap Readers with an in-memory caching layer#496
georgestagg wants to merge 15 commits into
mainfrom
caching-layer

Conversation

@georgestagg

@georgestagg georgestagg commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Add an in-memory caching layer for readers

This PR adds a caching layer that wraps any reader with an in-memory caching reader, for now supporting duckdb or sqlite.
This allows us to visualise with read-only databases, and caching improves the experience for very slow remote databases.

A CachingReader wraps a primary reader, splitting the API into two surfaces:

  • source (execute_sql) — base reads of the user's data, hitting the primary.
  • compute (execute_sql_cached) — all dialect-generated / derived SQL over __ggsql_* tables, hitting the cache.

When caching is off, the default execute_sql_cached just calls execute_sql, so the single execution path works unchanged.

Cached reads are tracked in a __ggsql_cache_meta__ table inside the cache reader, itself queryable for introspection.
The cache is bounded by a TTL and an LRU budget. Entries older than the TTL are re-fetched, and once the total size exceeds the budget the least-recently-used entries are evicted until it fits.

Usage

Opt in with the --cache flag when using the ggsql CLI, or otherwise set a composite <primary>+<cache>:// connection string:

# simplest: cache the default duckdb reader
ggsql exec "<query>" --cache duckdb

# composite URI, with per-connection tuning
ggsql exec "<query>" --reader "odbc+duckdb://<dsn>?cache_ttl=600&cache_max_bytes=256mb"

In Jupyter et al., use a composite connection string:

-- @connect: odbc+duckdb://DSN=example_db_dsn
SELECT region, revenue FROM sales
VISUALISE region AS x, revenue AS y
DRAW bar

Force clear the cache mid-session with a meta-command:

-- @uncache

Configuration comes from env vars (GGSQL_CACHE_DISABLED, GGSQL_CACHE_TTL, GGSQL_CACHE_MAX_BYTES)
or connection URI query params (?cache_ttl=300&cache_max_bytes=32mb&cache_disabled=0).
Defaults: enabled, 300s TTL, 512 MB.

Reader trait changes

There are four new Reader methods, all with defaults so existing drivers are unaffected:

  • execute_sql_cached -- the compute surface. Defaults to just execute_sql.
  • materialize_table -- materialize a query body. Default is CREATE … TEMP TABLE on the reader.
  • caches_sources -- defaults to false, true for CachingReader.
  • clear_cache -- no-op by default, backs the -- @uncache meta-command.

Because SQL runs on cache-resident _ggsql* tables via the compute surface, it must be emitted by ggsql in the cache backend's dialect, not the primary's, so CachingReader::dialect() returns the cache dialect.

Read-only guarantee

When the cache is active, the primary connection is never written to:

  • materialize_table is overridden by CachingReader to register() the resulting data frame into the cache, rather than using temp tables.
  • All dialect-generated/derived SQL runs on the compute surface using execute_sql_cached, i.e. the cache backend.
  • The primary sees only base reads of the user's data.

Grammar

The grammar has been changed to parse joins structured nodes, so every joined table is discoverable. This allows for parser-based SQL rewriting

  • CTE- and source-reference rewriting now runs off the tree-sitter parse tree instead of ad-hoc regex. This correctly handles comma joins, quoted / schema-qualified / whitespace- and case-variant names.

  • Moved the SQL-structural helpers out of reader/data.rs into a new file parser/sql.rs.

Mixed-residency staging

When a query body mixes cache-resident tables (CTEs, ggsql: builtins) with primary base tables, the primary tables are
staged into the cache so the whole body can run on the compute surface. This allows one to run queries like,

-- @connect: odbc+duckdb://DSN=example_db_dsn
CREATE TEMP TABLE sales AS SELECT * FROM (VALUES
  ('North', 150), ('South', 250), ('East', 180)) AS t(region, revenue);
WITH targets AS (
  SELECT * FROM (VALUES ('North', 140), ('South', 260), ('East', 160)) AS v(region, target)
)
SELECT sales.region, sales.revenue - targets.target AS delta
FROM sales JOIN targets ON sales.region = targets.region
VISUALISE region AS x, delta AS y
DRAW point

sales lives on the primary (via the DSN). targets is a CTE, materialised in the local cache. The join mixes the two, so sales is transparently staged into the cache and the join runs locally.

Similarly, setup (CREATE/INSERT/UPDATE/DELETE) queries runs before CTE materialisation and staging, so a query that creates a table and then uses it works OK with caching enabled.

The same routing applies to per-layer sources. Each layer can bring its own FROM, and with caching enabled those sources are resolved against the cache rather than the primary,
so you can point one layer at a 'local_file.csv' even when the primary is a remote database:

-- @connect: odbc+duckdb://DSN=example_db_dsn
VISUALISE
DRAW line  MAPPING month AS x, revenue AS y FROM sales
DRAW point MAPPING month AS x, target  AS y FROM 'targets.csv'

sales is read from the remote primary, targets.csv is read locally by the cache backend. Both layers render together; nothing is written back to the primary. Without caching, FROM 'targets.csv' would only work if the primary reader itself could read that path.

Bookeeping

Initially based on #423 and jimhester#2. Jim should be added as a co-author on the final merge commit.

@georgestagg georgestagg marked this pull request as ready for review July 2, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant