IN LIST: add direct-probe hash filter for large primitive lists#23015
Draft
geoffreyclaude wants to merge 4 commits into
Draft
IN LIST: add direct-probe hash filter for large primitive lists#23015geoffreyclaude wants to merge 4 commits into
geoffreyclaude wants to merge 4 commits into
Conversation
This was referenced Jun 18, 2026
12ca843 to
0111ce5
Compare
This was referenced Jun 18, 2026
a109166 to
2e20173
Compare
2e20173 to
f0c8d4d
Compare
436582f to
57a8e6f
Compare
57a8e6f to
5f43d67
Compare
Build Int8 and Int16 IN-list bitmap filters by reinterpreting the input buffers as UInt8 or UInt16 with the same byte width. This avoids copying or numeric conversion while preserving signed integer equality semantics.
a967a85 to
2c36a1f
Compare
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
2c36a1f to
6d59f07
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
INperformance with specialized implementations #19390.Rationale for this change
#23014 handles tiny primitive
INlists by comparing against each constant. That stops being a good tradeoff once the list gets larger.For larger primitive lists, this PR uses a purpose-built lookup table. The mental model is:
x IN (...).This is still a hash-table style lookup, but it is simpler than the generic fallback because primitive values are fixed-width and can be stored directly. There is no need for the generic Arrow comparator path for each candidate.
The earlier bitmap and branchless filters remain in place for the cases where they are cheaper.
What changes are included in this PR?
DirectProbeFilter, a compact open-addressing lookup table with linear probing.INlists to direct probing after the branchless thresholds.Are these changes tested?
Yes.
cargo fmt --all --checkcargo test -p datafusion-physical-expr direct_probe --libcargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is an internal performance optimization only.
Local benchmark snapshot
Benchmark command:
Method: compare adjacent saved baselines using raw Criterion sample minima (
min(time / iters)). Lower is better; changes within +/-5% are treated as noise.Compared baselines: #23014 -> #23015
Relevant scope: large primitive-list rows.
Summary: 13 relevant rows, 13 faster, 0 slower, 0 within +/-5%.
f32/large_list/list=64/match=0%f32/large_list/list=64/match=50%nulls/primitive/i32/large_list/list=64/match=50%/nulls=20%primitive/i32/large_list/list=256/match=0%primitive/i32/large_list/list=256/match=50%primitive/i32/large_list/list=64/match=0%primitive/i32/large_list/list=64/match=50%primitive/i64/large_list/list=128/match=0%primitive/i64/large_list/list=128/match=50%primitive/i64/large_list/list=32/match=0%primitive/i64/large_list/list=32/match=50%timestamp_ns/large_list/list=32/match=0%timestamp_ns/large_list/list=32/match=50%