Fix get_data livelock in protocol_transaction_out_106#1070
Open
echennells wants to merge 1 commit into
Open
Conversation
handle_receive_get_data returned false (desubscribe) while send_transaction's completion branch re-subscribed (push_back) into the same unsubscriber list being iterated. notify erases the current node and erase returns the just-pushed node, re-delivering the same get_data forever (the // BUGBUG: registration race site). A block-only get_data to a synced node spins a worker thread at 100%; enable_relay defaults true, so one peer message wedges a thread and N peers exhaust the pool. Stay subscribed and drop the per-request resubscribe, matching protocol_block_out_106.
This was referenced Jul 3, 2026
Member
|
The subscription drop is the mechanism that prevents handling additional messages of the same protocol while one is already being processed. The intent is to resubscribe once the message is fully handled, and without gapping the strand after the last send. The ensures that the caller may feed another request following receipt of the last request response without it getting dropped in a resubscribe gap. Removing this behavior may fix the observed behavior, but unless the message is handled synchronously (which requests for multiples will not be, leaving the subscription open is a DoS bug. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A
get_datafor blocks only (no tx items) sent to a synced node makeshandle_receive_get_dataspin a worker thread indefinitely.Mechanism
The
get_datasubscription is a singleunsubscriber<get_data::cptr>backed by astd::list.notifyiteratesif (!(*it)(...)) it = queue_.erase(it);.handle_receive_get_datareturnsfalse→ erase, and during that same callsend_transaction's completion branch re-subscribes (SUBSCRIBE_CHANNEL→push_back) into the list being iterated.erasereturns the just-pushed node, so the sameget_datais re-delivered to the handler forever (consume-one / produce-one). This is the existing// BUGBUG: registration race.site.The trigger is ordinary peer behavior: a block-only
get_data(the common in-flight block-download batch) sent to a node that holds those blocks —protocol_block_out_106serves them and keeps the channel alive, sotx_outloops.enable_relaydefaults true (parser.cpp), so it is reachable on a default node; a single peer message wedges a worker thread and N peers exhaust the pool.Fix
Stay subscribed (
return true) and drop the per-request resubscribe — the subscribe-once patternprotocol_block_out_106already uses. Two lines (plus a now-stale comment).Verification (testnet3, current master)
protocol_transaction_out_106::handle_receive_get_data.get_data(three interleaved multi-item tx batches with a block-only batch, sent without waiting) → patched serves every requested tx and the block at 0% CPU and stays responsive; unpatched serves only the first batch. Confirms the stay-subscribed change is correct under concurrency —send_transactionis fully parameterized and the type's only member isconst bool node_witness_, so overlapping serve chains share no mutable state.Note: the underlying
unsubscriber::notifyThis fix removes the caller that triggers the loop, but the primitive itself will re-enter any handler that re-subscribes during its own notification (the consume-one/produce-one above), silently and unbounded.
protocol_filter_out_70015independently makes the same mistake (fixed identically in #1071), so the pattern is not a one-off. Re-subscribing duringnotifyis arguably invalid usage — but the failure mode is a silent infinite loop rather than a fail-fast. Worth considering whethernotifyshould assert against a mid-notify re-subscription, so future misuse surfaces immediately instead of hanging a thread. Raised as a question, not addressed here.