feat: support CLUSTER BY [AUTO, NONE] for Databricks#5846
Conversation
…id clustering Adds parser, validator, and Databricks adapter support for the keyword forms of liquid clustering. Bare AUTO/NONE (unquoted VAR tokens) are recognised as keywords; backtick-quoted `auto`/`none` and parenthesised forms remain real column references. - Add LIQUID_CLUSTERING_KEYWORDS constant to avoid repeating the sentinel set across dialect, meta, definition, and adapter - Parser (dialect.py): detect VAR-token AUTO/NONE on clustered_by; strip Paren from single-column clustered_by to match partitioned_by normalisation - Validator (meta.py): normalise single string input to list; restore keyword sentinels from JSON strings on deserialisation; skip column-count check for keywords, gated on clustered_by + databricks - validate_definition (definition.py): skip keyword sentinels in the column-existence check, same gate - Adapter (databricks.py): emit CLUSTER BY AUTO / CLUSTER BY NONE without a tuple wrapper; raise ValueError on unexpected bare Var - Tests: parser round-trips, Python API (exp.Var and plain string), backtick-quoted columns, render_definition, JSON round-trip, non-Databricks rejection, mixed-list behaviour, adapter SQL emission Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: EhabEasee <ehab.elbadrawi@easee.com>
6f3e9a9 to
4f29141
Compare
CLUSTER BY [AUTO, NONE] for Databricks
|
@EhabEasee Thanks for this PR! Not trying to be nit-picky, but here's a few items:
Let me know if I'm missing anything! |
|
@StuffbyYuki both comments make sense and I've made the updates. However, the comment in the docs feels misplaced and easy to miss. I was considering adding it in the Databricks engine docs but couldn't find a reasonable place to add it. Do you have any suggestions on a more relevant place to add that note? The StarRocks docs seem to have something similar so I could imitate that? |
|
@EhabEasee thanks! Yeah I don't think it has to be that big block like starrocks docs do, but I just figured adding something somewhere in the docs might be helpful! I'll let you decide where and how to put it on the docs |
… clustered_by docs" This reverts commit bb70305.
|
@StuffbyYuki I added a new section to the databricks integration docs. Let me know if you have any more feedback |
Description
Databricks supports two keyword forms of liquid clustering that don't take column arguments:
CLUSTER BY AUTO— lets Databricks automatically select clustering columnsCLUSTER BY NONE— disables liquid clustering on a tablePreviously, SQLMesh had no way to express these in a model definition. This PR adds support for both.
constants.py: AddsLIQUID_CLUSTERING_KEYWORDS = frozenset({"AUTO", "NONE"})as a shared constant used across the parser, validator, and adapter.Parsing (
dialect.py): Theclustered_byproperty parser now recognises bareAUTOandNONEtokens (unquotedVARtokens) as liquid clustering keywords rather than column references. Backtick-quoted `auto` / `none` are still treated as regular column names, preserving backwards compatibility for columns that happen to share those names.Validation (
meta.py): A single string passed toclustered_byis normalised to a list before processing. The validator then skips the column-count check forexp.Var(AUTO|NONE), but only when the field isclustered_byand the dialect isdatabricks. On deserialisation from JSON, keyword strings are restored toexp.Varsentinels beforelist_of_fields_validatorcan normalise them into quoted columns.Validation (
definition.py): Thevalidate_definitioncolumn-existence check skips keyword sentinels for the sameclustered_by+databricksscope.Code generation (
databricks.py):_build_table_properties_expdetects a singleexp.Varinclustered_by(guarded by aValueErrorif the Var holds an unexpected value), and emitsCLUSTER BY AUTO/CLUSTER BY NONEwithout wrapping in a tuple. Multi-column paths are unchanged.Usage:
Via the Python API, both a plain string and
exp.Varare accepted:Columns with the names
autoornoneare still supported via backtick quoting:Test Plan
tests/core/test_dialect.py— parser round-trips:AUTO/NONEkeywords, backtick-quoted columns, paren-wrapped single columns, multi-column lists, mixed list(a, AUTO), non-Databricks dialecttests/core/test_model.py— model DDL; Python API with bothexp.Varand plain string; backtick-quoted column names;render_definitionoutput; JSON serialisation round-trip; non-Databricks dialect rejection; mixed-list column treatmenttests/core/engine_adapter/test_databricks.py— adapter emitsCLUSTER BY AUTO/CLUSTER BY NONEwithout column parensChecklist
make styleand fixed any issuesmake fast-test)git commit -s) per the DCO