Skip to content

refactor(hash-aggr): Simplify aggregate hash table#23309

Open
2010YOUY01 wants to merge 3 commits into
apache:mainfrom
2010YOUY01:split-aggr-simplify-ht
Open

refactor(hash-aggr): Simplify aggregate hash table#23309
2010YOUY01 wants to merge 3 commits into
apache:mainfrom
2010YOUY01:split-aggr-simplify-ht

Conversation

@2010YOUY01

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

Part of #22710

Addressing @alamb ’s comment from #23181 (review)

The motivation behind the EPIC was that the existing aggregation code had too many internal flags and was reused by many different execution paths, making it hard to maintain. The previous refactor made two main changes:

  1. Split different aggregation modes, such as partial and final, into separate streams.
  2. Split the hash tables used by different aggregation modes into separate implementations.

Now that the refactor is almost done, with only single aggregation mode and spilling behavior left, I think decision 1 was still solid, while decision 2 was an overcorrection.

A module usually becomes hard to maintain when it has many tightly coupled internal flags. In the old GroupsHashAggregateStream case, there were ~10 such flags. But a small number of flags, I think <= 3, is still clean and manageable.

This PR unifies the aggregate hash tables used by different modes into a single implementation. The behavioral differences are controlled by one internal mode flag. See the comments in common.rs: struct AggregateHashTable, for details.

Since this refactor is close to completion, I can confirm the internal complexity of AggregateHashTable is not likely to bloat in the long term, so unifying these implementations seems like the better design.

What changes are included in this PR?

Replace separated aggregate hash table implementation with a unified one

Are these changes tested?

Existing tests

Are there any user-facing changes?

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- I think this looks like a nice cleanup / consolidation to me

}

/// Marker for ordered raw rows -> partial state aggregation.
pub(in crate::aggregates) struct PartialMarker;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you plan to remove this PartialMarker for the ordered streams too? Maybe we could do the same simplification

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be done similarly.

But I think there is a better fast path to implement for fully ordered case (query group by a,b, input order order by a,b), this idea can be implemented in a completely separated path without depending on the aggregate hash tables; while the partially ordered case still relies on hash table, and only do early emit optimization.

So I think probably its easier to do after we have further optimized the full order fast path.

Filed #23318 to track.

// table semantics. Consider remove `AggregateMode` and only use `AggregateTableMode`
// after the refactor has finished.
//
// Issue: <https://github.com/apache/datafusion/pull/22729>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a PR (not an issue) -- is it the link you intended?

Note I agree with the fact that AggregateMode seems overly complicated. I vaguely remember trying to remove it once but I can't find the PR now (maybe I never pushed it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a wrong link. Fixed

.iter_mut()
.zip(evaluated_batch.accumulator_args.iter())
{
match mode {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it probably doesn't matter as this is one match per column, but I wonder if we should do the match outside and the loop inside 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this way seems a little bit easier to read: 6871dc7


/// Hash table used only for converting raw input rows directly into partial
/// aggregate state rows after partial aggregation has been skipped.
pub(in crate::aggregates) struct PartialSkipHashTable {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that this has a named newtype wrapper

@alamb

alamb commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

run benchmarks

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4879357464-843-5qvcm 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing split-aggr-simplify-ht (7f68c23) to e4aa41d (merge-base) diff using: tpch
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4879357464-842-kqn89 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing split-aggr-simplify-ht (7f68c23) to e4aa41d (merge-base) diff using: tpcds
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4879357464-841-jrdm8 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing split-aggr-simplify-ht (7f68c23) to e4aa41d (merge-base) diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and split-aggr-simplify-ht
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                           HEAD ┃         split-aggr-simplify-ht ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 38.84 / 41.07 ±1.41 / 42.40 ms │ 38.64 / 39.49 ±1.06 / 41.53 ms │     no change │
│ QQuery 2  │ 19.52 / 20.54 ±0.75 / 21.51 ms │ 19.25 / 19.69 ±0.32 / 20.08 ms │     no change │
│ QQuery 3  │ 31.49 / 33.63 ±1.32 / 34.79 ms │ 31.58 / 34.32 ±1.53 / 36.22 ms │     no change │
│ QQuery 4  │ 17.63 / 18.12 ±0.28 / 18.47 ms │ 17.97 / 18.36 ±0.51 / 19.36 ms │     no change │
│ QQuery 5  │ 40.19 / 42.56 ±1.41 / 44.13 ms │ 38.63 / 41.15 ±1.31 / 42.33 ms │     no change │
│ QQuery 6  │ 16.53 / 16.79 ±0.20 / 17.11 ms │ 16.45 / 16.54 ±0.06 / 16.62 ms │     no change │
│ QQuery 7  │ 45.55 / 48.36 ±2.17 / 51.38 ms │ 43.79 / 45.84 ±1.48 / 48.29 ms │ +1.05x faster │
│ QQuery 8  │ 44.16 / 44.31 ±0.16 / 44.53 ms │ 43.60 / 43.76 ±0.13 / 43.97 ms │     no change │
│ QQuery 9  │ 50.66 / 51.30 ±0.58 / 52.24 ms │ 50.66 / 51.42 ±0.52 / 51.94 ms │     no change │
│ QQuery 10 │ 43.09 / 43.82 ±1.04 / 45.84 ms │ 43.12 / 43.44 ±0.31 / 43.85 ms │     no change │
│ QQuery 11 │ 14.04 / 14.70 ±0.80 / 16.24 ms │ 13.91 / 13.97 ±0.06 / 14.05 ms │     no change │
│ QQuery 12 │ 24.41 / 26.10 ±2.41 / 30.84 ms │ 24.58 / 24.76 ±0.19 / 25.10 ms │ +1.05x faster │
│ QQuery 13 │ 34.31 / 36.23 ±1.66 / 39.33 ms │ 32.50 / 34.47 ±1.16 / 35.80 ms │     no change │
│ QQuery 14 │ 24.58 / 25.34 ±1.11 / 27.51 ms │ 24.59 / 25.15 ±0.54 / 26.09 ms │     no change │
│ QQuery 15 │ 31.86 / 32.68 ±0.77 / 33.99 ms │ 32.03 / 32.70 ±0.50 / 33.56 ms │     no change │
│ QQuery 16 │ 14.04 / 14.40 ±0.26 / 14.77 ms │ 14.61 / 14.76 ±0.12 / 14.96 ms │     no change │
│ QQuery 17 │ 74.24 / 75.68 ±1.10 / 76.95 ms │ 76.21 / 77.13 ±1.30 / 79.71 ms │     no change │
│ QQuery 18 │ 59.40 / 60.83 ±1.27 / 62.74 ms │ 60.90 / 62.32 ±1.46 / 64.28 ms │     no change │
│ QQuery 19 │ 33.73 / 33.97 ±0.21 / 34.25 ms │ 33.90 / 34.10 ±0.21 / 34.49 ms │     no change │
│ QQuery 20 │ 33.20 / 33.31 ±0.12 / 33.50 ms │ 33.00 / 33.25 ±0.42 / 34.08 ms │     no change │
│ QQuery 21 │ 57.04 / 58.10 ±1.01 / 59.81 ms │ 57.58 / 58.61 ±0.93 / 60.15 ms │     no change │
│ QQuery 22 │ 14.42 / 14.55 ±0.09 / 14.63 ms │ 14.08 / 14.75 ±0.48 / 15.49 ms │     no change │
└───────────┴────────────────────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary                     ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 786.41ms │
│ Total Time (split-aggr-simplify-ht)   │ 779.96ms │
│ Average Time (HEAD)                   │  35.75ms │
│ Average Time (split-aggr-simplify-ht) │  35.45ms │
│ Queries Faster                        │        2 │
│ Queries Slower                        │        0 │
│ Queries with No Change                │       20 │
│ Queries with Failure                  │        0 │
└───────────────────────────────────────┴──────────┘

Resource Usage

tpch — base (merge-base)

Metric Value
Wall time 5.0s
Peak memory 1.1 GiB
Avg memory 507.1 MiB
CPU user 22.8s
CPU sys 1.8s
Peak spill 0 B

tpch — branch

Metric Value
Wall time 5.0s
Peak memory 1.1 GiB
Avg memory 501.2 MiB
CPU user 23.0s
CPU sys 1.6s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and split-aggr-simplify-ht
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃                split-aggr-simplify-ht ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │           5.52 / 6.00 ±0.86 / 7.72 ms │           5.58 / 6.08 ±0.86 / 7.79 ms │     no change │
│ QQuery 2  │        82.18 / 82.38 ±0.15 / 82.59 ms │        80.74 / 81.00 ±0.21 / 81.25 ms │     no change │
│ QQuery 3  │        29.51 / 29.76 ±0.24 / 30.22 ms │        29.55 / 29.83 ±0.18 / 30.06 ms │     no change │
│ QQuery 4  │     485.95 / 492.28 ±3.32 / 495.00 ms │    545.86 / 566.81 ±10.66 / 575.63 ms │  1.15x slower │
│ QQuery 5  │        51.91 / 52.16 ±0.31 / 52.75 ms │        51.60 / 52.96 ±1.44 / 55.56 ms │     no change │
│ QQuery 6  │        36.54 / 37.18 ±0.37 / 37.57 ms │        36.19 / 36.67 ±0.25 / 36.92 ms │     no change │
│ QQuery 7  │        94.56 / 95.36 ±0.67 / 96.37 ms │       93.47 / 96.31 ±2.59 / 100.82 ms │     no change │
│ QQuery 8  │        37.01 / 38.27 ±1.76 / 41.73 ms │        36.56 / 36.99 ±0.52 / 38.01 ms │     no change │
│ QQuery 9  │        53.01 / 55.13 ±1.76 / 58.34 ms │        52.87 / 53.92 ±0.82 / 55.17 ms │     no change │
│ QQuery 10 │        62.91 / 63.38 ±0.29 / 63.72 ms │        63.82 / 64.55 ±0.67 / 65.77 ms │     no change │
│ QQuery 11 │     305.88 / 315.55 ±6.94 / 326.16 ms │     297.00 / 301.90 ±2.66 / 305.12 ms │     no change │
│ QQuery 12 │        29.31 / 29.61 ±0.19 / 29.78 ms │        28.55 / 28.81 ±0.14 / 28.96 ms │     no change │
│ QQuery 13 │     119.63 / 120.33 ±0.89 / 122.06 ms │     118.62 / 120.18 ±1.21 / 122.34 ms │     no change │
│ QQuery 14 │     413.99 / 419.88 ±3.84 / 425.70 ms │     413.96 / 418.42 ±3.74 / 425.29 ms │     no change │
│ QQuery 15 │        57.85 / 58.62 ±0.64 / 59.56 ms │        57.10 / 58.67 ±1.10 / 60.16 ms │     no change │
│ QQuery 16 │           6.84 / 6.96 ±0.16 / 7.28 ms │           6.75 / 6.93 ±0.21 / 7.35 ms │     no change │
│ QQuery 17 │        80.93 / 82.30 ±1.95 / 86.15 ms │        80.99 / 82.38 ±1.28 / 84.08 ms │     no change │
│ QQuery 18 │     124.44 / 125.71 ±1.02 / 127.04 ms │     123.43 / 125.26 ±1.42 / 127.13 ms │     no change │
│ QQuery 19 │        41.54 / 41.66 ±0.10 / 41.82 ms │        41.67 / 42.59 ±0.94 / 44.40 ms │     no change │
│ QQuery 20 │        35.80 / 36.97 ±1.83 / 40.62 ms │        35.84 / 37.01 ±0.99 / 38.43 ms │     no change │
│ QQuery 21 │        17.59 / 18.04 ±0.31 / 18.48 ms │        18.02 / 18.18 ±0.15 / 18.44 ms │     no change │
│ QQuery 22 │        62.39 / 62.90 ±0.42 / 63.66 ms │        62.30 / 64.02 ±1.21 / 65.42 ms │     no change │
│ QQuery 23 │     348.97 / 352.54 ±2.80 / 356.84 ms │     341.57 / 347.49 ±3.94 / 351.81 ms │     no change │
│ QQuery 24 │     226.10 / 231.53 ±6.39 / 241.59 ms │     225.63 / 227.34 ±2.82 / 232.94 ms │     no change │
│ QQuery 25 │     118.74 / 119.43 ±0.45 / 120.15 ms │     110.17 / 111.75 ±1.12 / 113.56 ms │ +1.07x faster │
│ QQuery 26 │        62.07 / 63.03 ±0.60 / 63.94 ms │        58.68 / 59.59 ±0.70 / 60.83 ms │ +1.06x faster │
│ QQuery 27 │           7.11 / 7.21 ±0.17 / 7.55 ms │           6.38 / 6.51 ±0.17 / 6.84 ms │ +1.11x faster │
│ QQuery 28 │        63.64 / 64.12 ±0.57 / 65.18 ms │        61.00 / 61.70 ±0.68 / 62.90 ms │     no change │
│ QQuery 29 │     102.26 / 107.95 ±6.86 / 120.47 ms │        97.00 / 97.57 ±0.68 / 98.89 ms │ +1.11x faster │
│ QQuery 30 │        36.18 / 36.86 ±0.55 / 37.84 ms │        33.04 / 34.74 ±1.95 / 38.56 ms │ +1.06x faster │
│ QQuery 31 │     117.80 / 120.84 ±1.74 / 122.86 ms │     111.98 / 112.23 ±0.20 / 112.54 ms │ +1.08x faster │
│ QQuery 32 │        21.78 / 22.68 ±0.68 / 23.83 ms │        20.53 / 21.00 ±0.53 / 21.85 ms │ +1.08x faster │
│ QQuery 33 │        38.92 / 39.55 ±0.65 / 40.46 ms │        37.87 / 38.33 ±0.33 / 38.64 ms │     no change │
│ QQuery 34 │        10.56 / 10.97 ±0.43 / 11.75 ms │         9.93 / 11.72 ±2.70 / 17.04 ms │  1.07x slower │
│ QQuery 35 │        77.97 / 78.29 ±0.19 / 78.51 ms │        72.44 / 73.35 ±0.70 / 74.48 ms │ +1.07x faster │
│ QQuery 36 │           6.29 / 6.43 ±0.15 / 6.72 ms │           6.03 / 6.15 ±0.16 / 6.47 ms │     no change │
│ QQuery 37 │           7.53 / 7.58 ±0.05 / 7.68 ms │           7.04 / 7.17 ±0.11 / 7.35 ms │ +1.06x faster │
│ QQuery 38 │        62.93 / 65.41 ±2.01 / 67.84 ms │        61.85 / 62.52 ±0.57 / 63.38 ms │     no change │
│ QQuery 39 │        90.78 / 91.98 ±0.84 / 93.17 ms │        89.98 / 91.18 ±0.91 / 92.76 ms │     no change │
│ QQuery 40 │        23.69 / 24.10 ±0.30 / 24.39 ms │        23.73 / 24.24 ±0.37 / 24.83 ms │     no change │
│ QQuery 41 │        11.55 / 11.66 ±0.12 / 11.89 ms │        11.57 / 11.79 ±0.24 / 12.25 ms │     no change │
│ QQuery 42 │        23.75 / 24.13 ±0.24 / 24.41 ms │        23.77 / 24.20 ±0.35 / 24.77 ms │     no change │
│ QQuery 43 │           4.91 / 5.02 ±0.16 / 5.34 ms │           5.12 / 5.22 ±0.15 / 5.52 ms │     no change │
│ QQuery 44 │           9.34 / 9.42 ±0.07 / 9.51 ms │           9.50 / 9.67 ±0.11 / 9.85 ms │     no change │
│ QQuery 45 │        39.77 / 42.79 ±3.12 / 48.83 ms │        37.80 / 38.44 ±0.88 / 40.17 ms │ +1.11x faster │
│ QQuery 46 │        11.89 / 12.49 ±0.50 / 13.17 ms │        11.88 / 12.13 ±0.17 / 12.38 ms │     no change │
│ QQuery 47 │     231.00 / 233.21 ±2.95 / 238.89 ms │    228.03 / 261.78 ±26.13 / 293.45 ms │  1.12x slower │
│ QQuery 48 │        96.83 / 97.63 ±0.57 / 98.58 ms │      98.69 / 101.32 ±2.00 / 104.57 ms │     no change │
│ QQuery 49 │        75.98 / 77.31 ±0.93 / 78.91 ms │        79.49 / 81.60 ±2.52 / 86.50 ms │  1.06x slower │
│ QQuery 50 │        58.86 / 59.35 ±0.35 / 59.88 ms │        61.97 / 62.93 ±0.61 / 63.81 ms │  1.06x slower │
│ QQuery 51 │        91.42 / 94.01 ±2.92 / 99.21 ms │       97.24 / 99.03 ±1.40 / 100.82 ms │  1.05x slower │
│ QQuery 52 │        23.87 / 24.75 ±0.85 / 25.92 ms │        25.42 / 26.15 ±0.89 / 27.86 ms │  1.06x slower │
│ QQuery 53 │        29.65 / 29.96 ±0.29 / 30.49 ms │        30.38 / 30.84 ±0.41 / 31.52 ms │     no change │
│ QQuery 54 │        55.55 / 55.88 ±0.33 / 56.46 ms │        57.37 / 57.84 ±0.38 / 58.26 ms │     no change │
│ QQuery 55 │        23.24 / 23.56 ±0.38 / 24.29 ms │        24.18 / 24.44 ±0.24 / 24.82 ms │     no change │
│ QQuery 56 │        38.66 / 39.02 ±0.21 / 39.21 ms │        40.47 / 40.58 ±0.10 / 40.73 ms │     no change │
│ QQuery 57 │     177.64 / 179.94 ±2.34 / 184.17 ms │     176.51 / 183.06 ±6.13 / 191.30 ms │     no change │
│ QQuery 58 │     118.49 / 119.65 ±1.09 / 121.24 ms │     114.88 / 118.06 ±2.48 / 122.34 ms │     no change │
│ QQuery 59 │     118.48 / 118.90 ±0.37 / 119.38 ms │     117.34 / 117.83 ±0.37 / 118.32 ms │     no change │
│ QQuery 60 │        39.87 / 41.31 ±1.41 / 43.66 ms │        39.59 / 40.27 ±0.69 / 41.58 ms │     no change │
│ QQuery 61 │        12.35 / 12.52 ±0.21 / 12.93 ms │        12.60 / 12.73 ±0.15 / 13.02 ms │     no change │
│ QQuery 62 │        46.45 / 47.03 ±0.51 / 47.79 ms │        46.22 / 46.43 ±0.22 / 46.82 ms │     no change │
│ QQuery 63 │        29.79 / 30.03 ±0.27 / 30.50 ms │        29.90 / 30.14 ±0.13 / 30.27 ms │     no change │
│ QQuery 64 │    410.82 / 423.58 ±12.27 / 443.75 ms │     408.39 / 415.09 ±6.76 / 425.01 ms │     no change │
│ QQuery 65 │     154.74 / 159.81 ±4.59 / 167.76 ms │     141.85 / 144.49 ±2.32 / 148.57 ms │ +1.11x faster │
│ QQuery 66 │        85.40 / 86.47 ±0.67 / 87.32 ms │        79.06 / 79.50 ±0.37 / 79.93 ms │ +1.09x faster │
│ QQuery 67 │    265.23 / 273.79 ±10.47 / 294.15 ms │     237.81 / 242.42 ±4.56 / 250.95 ms │ +1.13x faster │
│ QQuery 68 │        11.97 / 12.17 ±0.25 / 12.66 ms │        11.91 / 12.08 ±0.19 / 12.46 ms │     no change │
│ QQuery 69 │        57.43 / 57.96 ±0.49 / 58.83 ms │        58.78 / 60.17 ±2.14 / 64.42 ms │     no change │
│ QQuery 70 │     105.29 / 110.36 ±7.11 / 124.16 ms │     104.78 / 107.04 ±2.30 / 111.39 ms │     no change │
│ QQuery 71 │        35.49 / 36.11 ±0.52 / 37.08 ms │        35.20 / 35.82 ±0.54 / 36.64 ms │     no change │
│ QQuery 72 │ 2098.62 / 2148.50 ±50.13 / 2236.60 ms │ 2144.19 / 2247.86 ±82.17 / 2394.52 ms │     no change │
│ QQuery 73 │          9.65 / 9.92 ±0.22 / 10.24 ms │         9.74 / 10.01 ±0.29 / 10.45 ms │     no change │
│ QQuery 74 │     170.45 / 174.40 ±3.77 / 179.43 ms │     166.95 / 171.09 ±4.80 / 180.15 ms │     no change │
│ QQuery 75 │     149.41 / 151.02 ±1.61 / 154.05 ms │     148.47 / 149.74 ±1.02 / 150.88 ms │     no change │
│ QQuery 76 │        35.17 / 35.65 ±0.31 / 36.10 ms │        35.03 / 35.54 ±0.57 / 36.39 ms │     no change │
│ QQuery 77 │        61.42 / 65.04 ±3.90 / 71.61 ms │        61.13 / 61.49 ±0.34 / 62.11 ms │ +1.06x faster │
│ QQuery 78 │     218.67 / 224.47 ±5.22 / 233.96 ms │     194.52 / 198.27 ±4.18 / 205.00 ms │ +1.13x faster │
│ QQuery 79 │        72.20 / 72.85 ±0.35 / 73.21 ms │        67.06 / 68.06 ±1.46 / 70.96 ms │ +1.07x faster │
│ QQuery 80 │     106.93 / 113.35 ±4.85 / 121.48 ms │      98.68 / 100.35 ±1.25 / 102.43 ms │ +1.13x faster │
│ QQuery 81 │        28.75 / 29.08 ±0.46 / 29.95 ms │        25.94 / 26.11 ±0.18 / 26.45 ms │ +1.11x faster │
│ QQuery 82 │        18.10 / 18.29 ±0.15 / 18.49 ms │        16.62 / 16.67 ±0.05 / 16.76 ms │ +1.10x faster │
│ QQuery 83 │        42.77 / 43.29 ±0.71 / 44.65 ms │        40.40 / 41.06 ±0.90 / 42.85 ms │ +1.05x faster │
│ QQuery 84 │        31.79 / 32.01 ±0.18 / 32.33 ms │        30.47 / 30.74 ±0.19 / 30.97 ms │     no change │
│ QQuery 85 │     112.76 / 115.84 ±3.78 / 122.97 ms │     107.16 / 110.48 ±3.46 / 117.11 ms │     no change │
│ QQuery 86 │        26.61 / 26.84 ±0.27 / 27.36 ms │        24.84 / 25.07 ±0.12 / 25.18 ms │ +1.07x faster │
│ QQuery 87 │        68.02 / 68.95 ±1.02 / 70.89 ms │        62.92 / 63.41 ±0.36 / 63.79 ms │ +1.09x faster │
│ QQuery 88 │        65.42 / 66.28 ±0.73 / 67.43 ms │        62.99 / 64.18 ±0.70 / 65.19 ms │     no change │
│ QQuery 89 │        37.84 / 38.36 ±0.62 / 39.46 ms │        36.33 / 36.87 ±0.52 / 37.82 ms │     no change │
│ QQuery 90 │        18.30 / 18.46 ±0.13 / 18.67 ms │        17.35 / 17.59 ±0.18 / 17.85 ms │     no change │
│ QQuery 91 │        46.87 / 48.23 ±0.75 / 49.17 ms │        46.82 / 47.19 ±0.21 / 47.45 ms │     no change │
│ QQuery 92 │        29.78 / 31.76 ±3.02 / 37.77 ms │        29.76 / 30.39 ±0.52 / 31.14 ms │     no change │
│ QQuery 93 │        50.74 / 51.67 ±0.85 / 53.16 ms │        50.55 / 51.03 ±0.32 / 51.45 ms │     no change │
│ QQuery 94 │        38.98 / 39.29 ±0.24 / 39.69 ms │        41.21 / 43.52 ±2.76 / 48.81 ms │  1.11x slower │
│ QQuery 95 │        81.49 / 82.27 ±0.41 / 82.62 ms │        86.47 / 87.49 ±0.57 / 88.01 ms │  1.06x slower │
│ QQuery 96 │        24.43 / 26.14 ±2.70 / 31.51 ms │        25.46 / 25.57 ±0.07 / 25.65 ms │     no change │
│ QQuery 97 │        47.12 / 47.81 ±0.63 / 48.94 ms │        49.07 / 49.89 ±0.87 / 51.55 ms │     no change │
│ QQuery 98 │        42.35 / 43.27 ±1.04 / 45.20 ms │        46.85 / 49.09 ±3.30 / 55.64 ms │  1.13x slower │
│ QQuery 99 │        70.76 / 71.14 ±0.37 / 71.77 ms │        73.10 / 73.81 ±0.46 / 74.27 ms │     no change │
└───────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 10164.89ms │
│ Total Time (split-aggr-simplify-ht)   │ 10189.70ms │
│ Average Time (HEAD)                   │   102.68ms │
│ Average Time (split-aggr-simplify-ht) │   102.93ms │
│ Queries Faster                        │         22 │
│ Queries Slower                        │         10 │
│ Queries with No Change                │         67 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric Value
Wall time 55.0s
Peak memory 2.0 GiB
Avg memory 1.4 GiB
CPU user 228.8s
CPU sys 5.8s
Peak spill 0 B

tpcds — branch

Metric Value
Wall time 55.0s
Peak memory 2.2 GiB
Avg memory 1.5 GiB
CPU user 233.8s
CPU sys 5.7s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and split-aggr-simplify-ht
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃                split-aggr-simplify-ht ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.23 / 4.05 ±5.49 / 15.03 ms │          1.24 / 3.97 ±5.40 / 14.77 ms │     no change │
│ QQuery 1  │        12.66 / 13.02 ±0.19 / 13.16 ms │        12.34 / 12.63 ±0.15 / 12.78 ms │     no change │
│ QQuery 2  │        35.82 / 36.11 ±0.24 / 36.38 ms │        35.83 / 36.19 ±0.28 / 36.56 ms │     no change │
│ QQuery 3  │        30.77 / 31.62 ±1.12 / 33.83 ms │        30.51 / 30.92 ±0.46 / 31.72 ms │     no change │
│ QQuery 4  │     223.84 / 227.07 ±2.41 / 230.03 ms │     223.49 / 226.15 ±2.47 / 230.33 ms │     no change │
│ QQuery 5  │     273.31 / 273.96 ±0.54 / 274.91 ms │     271.76 / 272.72 ±1.12 / 274.60 ms │     no change │
│ QQuery 6  │           1.31 / 1.47 ±0.23 / 1.91 ms │           1.28 / 1.43 ±0.22 / 1.86 ms │     no change │
│ QQuery 7  │        13.86 / 14.04 ±0.12 / 14.23 ms │        13.76 / 13.98 ±0.13 / 14.15 ms │     no change │
│ QQuery 8  │     323.12 / 326.73 ±2.02 / 329.19 ms │     316.74 / 322.16 ±2.75 / 324.43 ms │     no change │
│ QQuery 9  │    438.72 / 465.95 ±14.14 / 478.29 ms │     453.60 / 461.13 ±6.62 / 471.50 ms │     no change │
│ QQuery 10 │        69.97 / 70.81 ±0.83 / 71.85 ms │        71.15 / 72.41 ±1.30 / 74.76 ms │     no change │
│ QQuery 11 │        80.88 / 82.73 ±2.48 / 87.64 ms │        82.22 / 83.65 ±1.26 / 85.93 ms │     no change │
│ QQuery 12 │     267.15 / 271.64 ±4.61 / 277.44 ms │     266.16 / 274.36 ±8.00 / 284.98 ms │     no change │
│ QQuery 13 │     367.13 / 377.31 ±9.65 / 394.53 ms │    364.32 / 380.35 ±14.23 / 402.66 ms │     no change │
│ QQuery 14 │     281.15 / 283.77 ±2.21 / 287.64 ms │     278.90 / 283.77 ±5.89 / 294.76 ms │     no change │
│ QQuery 15 │     274.29 / 281.90 ±7.93 / 295.76 ms │     268.77 / 274.53 ±3.27 / 277.48 ms │     no change │
│ QQuery 16 │     612.00 / 619.45 ±6.46 / 629.63 ms │     614.30 / 618.34 ±3.44 / 624.36 ms │     no change │
│ QQuery 17 │     628.63 / 634.48 ±4.92 / 643.40 ms │     614.06 / 626.81 ±9.07 / 641.75 ms │     no change │
│ QQuery 18 │ 1238.98 / 1265.03 ±19.70 / 1294.62 ms │ 1240.74 / 1260.59 ±15.76 / 1283.65 ms │     no change │
│ QQuery 19 │        27.70 / 29.25 ±2.75 / 34.74 ms │        27.46 / 30.29 ±4.64 / 39.46 ms │     no change │
│ QQuery 20 │     517.07 / 528.49 ±6.79 / 535.75 ms │     515.30 / 519.37 ±2.74 / 522.02 ms │     no change │
│ QQuery 21 │     512.11 / 518.18 ±5.81 / 529.24 ms │     513.94 / 519.96 ±5.72 / 529.95 ms │     no change │
│ QQuery 22 │    985.77 / 996.83 ±8.63 / 1008.67 ms │    979.96 / 988.81 ±9.32 / 1005.15 ms │     no change │
│ QQuery 23 │ 3044.81 / 3070.47 ±20.65 / 3099.53 ms │ 3083.92 / 3107.22 ±24.64 / 3151.60 ms │     no change │
│ QQuery 24 │        41.03 / 47.91 ±8.30 / 60.86 ms │       42.10 / 50.69 ±10.43 / 66.91 ms │  1.06x slower │
│ QQuery 25 │     111.43 / 112.34 ±1.01 / 114.24 ms │     111.74 / 113.16 ±1.57 / 116.17 ms │     no change │
│ QQuery 26 │        41.73 / 42.32 ±0.41 / 43.00 ms │        41.47 / 41.95 ±0.57 / 43.00 ms │     no change │
│ QQuery 27 │     665.83 / 673.12 ±5.60 / 682.99 ms │     673.06 / 676.68 ±2.71 / 679.75 ms │     no change │
│ QQuery 28 │  3037.33 / 3045.50 ±6.08 / 3055.04 ms │ 3021.32 / 3051.62 ±17.60 / 3076.16 ms │     no change │
│ QQuery 29 │        40.47 / 44.60 ±7.69 / 59.97 ms │        40.65 / 40.84 ±0.13 / 41.03 ms │ +1.09x faster │
│ QQuery 30 │    305.50 / 315.80 ±10.20 / 334.07 ms │    302.12 / 314.55 ±16.25 / 346.64 ms │     no change │
│ QQuery 31 │    287.86 / 296.56 ±10.55 / 312.63 ms │     281.08 / 295.23 ±8.23 / 306.84 ms │     no change │
│ QQuery 32 │    918.23 / 949.74 ±20.64 / 969.99 ms │    911.77 / 926.81 ±15.75 / 954.52 ms │     no change │
│ QQuery 33 │ 1439.46 / 1464.72 ±23.17 / 1507.76 ms │ 1428.11 / 1461.49 ±24.67 / 1491.91 ms │     no change │
│ QQuery 34 │  1475.01 / 1481.97 ±4.32 / 1486.36 ms │ 1478.44 / 1513.12 ±29.95 / 1558.43 ms │     no change │
│ QQuery 35 │    278.56 / 295.84 ±23.63 / 342.62 ms │    276.20 / 289.44 ±14.85 / 316.56 ms │     no change │
│ QQuery 36 │        65.78 / 71.55 ±4.17 / 78.28 ms │        66.51 / 70.28 ±4.34 / 77.41 ms │     no change │
│ QQuery 37 │        35.91 / 40.60 ±4.08 / 46.64 ms │        35.27 / 38.76 ±3.21 / 44.53 ms │     no change │
│ QQuery 38 │        40.98 / 43.78 ±2.75 / 48.74 ms │        41.27 / 43.90 ±2.78 / 48.66 ms │     no change │
│ QQuery 39 │     136.07 / 147.42 ±7.73 / 157.86 ms │     142.77 / 153.15 ±8.31 / 165.88 ms │     no change │
│ QQuery 40 │        14.09 / 16.36 ±3.13 / 22.45 ms │        14.02 / 16.88 ±5.24 / 27.36 ms │     no change │
│ QQuery 41 │        13.65 / 13.93 ±0.22 / 14.19 ms │        13.58 / 14.14 ±0.55 / 14.98 ms │     no change │
│ QQuery 42 │        13.44 / 15.57 ±4.08 / 23.72 ms │        13.25 / 15.26 ±3.76 / 22.78 ms │     no change │
└───────────┴───────────────────────────────────────┴───────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 19543.95ms │
│ Total Time (split-aggr-simplify-ht)   │ 19549.70ms │
│ Average Time (HEAD)                   │   454.51ms │
│ Average Time (split-aggr-simplify-ht) │   454.64ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         41 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 100.0s
Peak memory 12.2 GiB
Avg memory 4.6 GiB
CPU user 1006.4s
CPU sys 65.8s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 100.0s
Peak memory 12.0 GiB
Avg memory 4.4 GiB
CPU user 1007.0s
CPU sys 66.1s
Peak spill 0 B

File an issue against this benchmark runner

.zip(evaluated_batch.accumulator_args.iter())
{
match mode {
AggregateTableMode::Partial | AggregateTableMode::Single => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned codes here seem switching to the way like row_hash.rs: all logic in one place, and using if else to decide where we go.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would help to factor these decisions into other abstractions? Like maybe a method on AggregateTableMode 🤔

// Filters apply only when the table consumes raw input rows. Final and
// partial-reduce modes consume partial states, so their filters are not
// applicable.
let filters = match mode {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern like what in aggregate_batch

/// - `Final`: merge_batch + evaluate
/// - `PartialReduce`: merge_batch + state
/// - `Single`: update_batch + evaluate
pub(in crate::aggregates) struct AggregateHashTable {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should keep different AggregateHashTables for different usage, and remove duplicated codes based on this?
Due to :

  • Concerns in aggregate_batch and other methods.
  • Some method like partial_skip_table, it is logic for partial aggr and seems not common.

@2010YOUY01 2010YOUY01 Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Proposed alternative

I have considered this approach: separating aggregate hash tables by mode and extracting shared internal utilities.

My concern is that this would create a shallow module. To fully understand the implementation, we would still need to reason about the same underlying mode differences, while the external AggregateHashTable variants would add complexity with almost no functional contribution.

So I think keeping the mode as an internal flag is simpler overall.

Complexity concern

For the concern about growing internal complexity, see the rationale in the issue. I think this should remain manageable in the long term.

Partial skip table

Yes, this is a valid concern. The partial skip table logic could be moved out of this module entirely to make the structure cleaner. However, I think this can be delayed a bit, and I plan to do it as a follow-up.
#23113

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this would create a shallow module. To fully understand the implementation, we would still need to reason about the same underlying mode differences, while the external AggregateHashTable variants would add complexity with almost no functional contribution.

One idea could be to leave the structure that is already on main, and refactor the common behavior out as templated functions on the base class

So for example, leave AggregateHashTable<AggrMode>

But then in the parts that are very similar like aggregate_batch instead of

impl AggregateHashTable<PartialReduceMarker> {
...
   pub(in crate::aggregates) fn aggregate_batch(
        &mut self,
        batch: &RecordBatch,
    ) -> Result<()> {
        let evaluated_batch = self.evaluate_batch(batch)?;
        let state = self.state.building_mut();

        let timer = self.group_by_metrics.aggregation_time.timer();
        for group_values in &evaluated_batch.grouping_set_args {
            state
                .group_values
                .intern(group_values, &mut state.batch_group_indices)?;
            let group_indices = &state.batch_group_indices;
            let total_num_groups = state.group_values.len();

            for (acc, values) in state
                .accumulators
                .iter_mut()
                .zip(evaluated_batch.accumulator_args.iter())
            {
                acc.merge_batch(values, group_indices, total_num_groups)?;
            }
        }
        drop(timer);

        Ok(())
    }
..
}

It could be structured something so that the specialization was called

impl AggregateHashTable<PartialReduceMarker> {
...
   pub(in crate::aggregates) fn aggregate_batch(
        &mut self,
        batch: &RecordBatch,
    ) -> Result<()> {
     // call a templated function with a closure with the hash table specific functionality
      self.aggregate_batch_inner(batch, |acc, values, group_indices, total_num_groups| {
                acc.merge_batch(values, group_indices, total_num_groups)
       })?;
    }
...
}

And then move the common logic into the generic method

/// Implement in "base" class (not the specialization)
impl AggregateHashTable<AggrMode> {
...
    fn <F: Fn(...) -> Result<()>)aggregate_batch_inner(
        batch: &RecordBatch,
        agg_function: F,
    ) -> Result<()> {
      let evaluated_batch = self.evaluate_batch(batch)?;
        let state = self.state.building_mut();

        let timer = self.group_by_metrics.aggregation_time.timer();
        for group_values in &evaluated_batch.grouping_set_args {
            state
                .group_values
                .intern(group_values, &mut state.batch_group_indices)?;
            let group_indices = &state.batch_group_indices;
            let total_num_groups = state.group_values.len();

            for (acc, values) in state
                .accumulators
                .iter_mut()
                .zip(evaluated_batch.accumulator_args.iter())
            {
              // *** Note here call the generic function agg_function: F
               agg_function(values, group_indices, total_num_groups)?; <---------
            }
        }
        drop(timer);

        Ok(())
    }
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The templated-function approach seems like a good idea. I’ll give it a try and see how it looks.

If the templated function is not expressive enough in some places/at some point, we can still fully separate the implementation by aggregation variant.

@2010YOUY01

Copy link
Copy Markdown
Contributor Author

Thanks for the review @Rachelint and @alamb

If there are other ideas for making the structure simpler and easier to understand, I’d be happy to discuss and experiment further. I think managing complexity is a hard but important problem, and it’s interesting to figure out methodologies to address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants