Skip to content

bug: clear map contents after emit in GroupValuesColumns#23262

Open
Rich-T-kid wants to merge 3 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/fix-map-clear
Open

bug: clear map contents after emit in GroupValuesColumns#23262
Rich-T-kid wants to merge 3 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/fix-map-clear

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

I was implementing dictionary support for this trait when I noticed the emit(emit_to::All) path doesn't update the contents of the the hashtable/map

What changes are included in this PR?

This PR adds test to catch future regressions as well as a simple self.map.clear() to clear the contents of the map on emit

Are these changes tested?

yes, this PR adds two test

Are there any user-facing changes?

no

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 30, 2026
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

From my understanding spilling causes emitTo::All followed by clear_shrink(). But I think its worth making this explicit, to make emit self-consistent.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Rich-T-kid

let fresh = Self::build_group_columns(&self.schema)?;
let group_values = mem::replace(&mut self.group_values, fresh);

self.map.clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment here?

What is the end user implication of this change? less peak memory usage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly,the map's capacity is preserved. This gives the caller a correctness guarantee: emit will never leave GroupValuesColumn in an intermediate or inconsistent state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can emit ever be called more than once with EmitTo::All ? That seems like it would be a bug right there

@Rich-T-kid Rich-T-kid Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EmitTo::All gets called in two cases

  1. memory is filling up, leading to a spill. EmitTo::All is called here and then its written to disk. after EmitTo::All is called GroupValues::clear_shrink() gets called and thats when the map content gets cleared. iirc EmitTo::All can be called multiple times for spills. This wont matter as each call is paired with GroupValues::clear_shrink()

  2. all input has been seen, begin producing output. EmitTo::all is called here is called to produced all of the arrays and then gets repeatedly sliced. After this the exec_state is set to producing output so intern is never called again.

Currently in both cases due to the ordering of surrounding code the state of groupValuesColumns is logically correct.

🤔 It may not be worth adding this map.clear() in this case. My thought was to try to prevent possible invariant violations, but if clear_shrink(0) isn't called, or if more input comes in after the state changes to producing_output, there may be larger logical inconsistencies.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Rich-T-kid

let fresh = Self::build_group_columns(&self.schema)?;
let group_values = mem::replace(&mut self.group_values, fresh);

self.map.clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can emit ever be called more than once with EmitTo::All ? That seems like it would be a bug right there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants