Wrap Readers with an in-memory caching layer#496
Open
georgestagg wants to merge 15 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add an in-memory caching layer for readers
This PR adds a caching layer that wraps any reader with an in-memory caching reader, for now supporting duckdb or sqlite.
This allows us to visualise with read-only databases, and caching improves the experience for very slow remote databases.
A
CachingReaderwraps a primary reader, splitting the API into two surfaces:execute_sql) — base reads of the user's data, hitting the primary.execute_sql_cached) — all dialect-generated / derived SQL over__ggsql_*tables, hitting the cache.When caching is off, the default
execute_sql_cachedjust callsexecute_sql, so the single execution path works unchanged.Cached reads are tracked in a
__ggsql_cache_meta__table inside the cache reader, itself queryable for introspection.The cache is bounded by a TTL and an LRU budget. Entries older than the TTL are re-fetched, and once the total size exceeds the budget the least-recently-used entries are evicted until it fits.
Usage
Opt in with the
--cacheflag when using the ggsql CLI, or otherwise set a composite<primary>+<cache>://connection string:In Jupyter et al., use a composite connection string:
Force clear the cache mid-session with a meta-command:
-- @uncacheConfiguration comes from env vars (
GGSQL_CACHE_DISABLED,GGSQL_CACHE_TTL,GGSQL_CACHE_MAX_BYTES)or connection URI query params (
?cache_ttl=300&cache_max_bytes=32mb&cache_disabled=0).Defaults: enabled, 300s TTL, 512 MB.
Reader trait changes
There are four new Reader methods, all with defaults so existing drivers are unaffected:
execute_sql_cached-- the compute surface. Defaults to justexecute_sql.materialize_table-- materialize a query body. Default isCREATE … TEMP TABLEon the reader.caches_sources-- defaults tofalse,trueforCachingReader.clear_cache-- no-op by default, backs the-- @uncachemeta-command.Because SQL runs on cache-resident _ggsql* tables via the compute surface, it must be emitted by ggsql in the cache backend's dialect, not the primary's, so
CachingReader::dialect()returns the cache dialect.Read-only guarantee
When the cache is active, the primary connection is never written to:
materialize_tableis overridden byCachingReadertoregister()the resulting data frame into the cache, rather than using temp tables.execute_sql_cached, i.e. the cache backend.Grammar
The grammar has been changed to parse joins structured nodes, so every joined table is discoverable. This allows for parser-based SQL rewriting
CTE- and source-reference rewriting now runs off the tree-sitter parse tree instead of ad-hoc regex. This correctly handles comma joins, quoted / schema-qualified / whitespace- and case-variant names.
Moved the SQL-structural helpers out of
reader/data.rsinto a new fileparser/sql.rs.Mixed-residency staging
When a query body mixes cache-resident tables (CTEs, ggsql: builtins) with primary base tables, the primary tables are
staged into the cache so the whole body can run on the compute surface. This allows one to run queries like,
saleslives on the primary (via the DSN).targetsis a CTE, materialised in the local cache. The join mixes the two, sosalesis transparently staged into the cache and the join runs locally.Similarly, setup (CREATE/INSERT/UPDATE/DELETE) queries runs before CTE materialisation and staging, so a query that creates a table and then uses it works OK with caching enabled.
The same routing applies to per-layer sources. Each layer can bring its own
FROM, and with caching enabled those sources are resolved against the cache rather than the primary,so you can point one layer at a 'local_file.csv' even when the primary is a remote database:
salesis read from the remote primary,targets.csvis read locally by the cache backend. Both layers render together; nothing is written back to the primary. Without caching,FROM 'targets.csv'would only work if the primary reader itself could read that path.Bookeeping
Initially based on #423 and jimhester#2. Jim should be added as a co-author on the final merge commit.