Skip to content

Fix get_data livelock in protocol_transaction_out_106#1070

Open
echennells wants to merge 1 commit into
libbitcoin:masterfrom
echennells:fix-transaction-out-106-getdata-livelock
Open

Fix get_data livelock in protocol_transaction_out_106#1070
echennells wants to merge 1 commit into
libbitcoin:masterfrom
echennells:fix-transaction-out-106-getdata-livelock

Conversation

@echennells

@echennells echennells commented Jul 3, 2026

Copy link
Copy Markdown

Summary

A get_data for blocks only (no tx items) sent to a synced node makes handle_receive_get_data spin a worker thread indefinitely.

Mechanism

The get_data subscription is a single unsubscriber<get_data::cptr> backed by a std::list. notify iterates if (!(*it)(...)) it = queue_.erase(it);. handle_receive_get_data returns false → erase, and during that same call send_transaction's completion branch re-subscribes (SUBSCRIBE_CHANNELpush_back) into the list being iterated. erase returns the just-pushed node, so the same get_data is re-delivered to the handler forever (consume-one / produce-one). This is the existing // BUGBUG: registration race. site.

The trigger is ordinary peer behavior: a block-only get_data (the common in-flight block-download batch) sent to a node that holds those blocks — protocol_block_out_106 serves them and keeps the channel alive, so tx_out loops. enable_relay defaults true (parser.cpp), so it is reachable on a default node; a single peer message wedges a worker thread and N peers exhaust the pool.

Fix

Stay subscribed (return true) and drop the per-request resubscribe — the subscribe-once pattern protocol_block_out_106 already uses. Two lines (plus a now-stale comment).

Verification (testnet3, current master)

  • Unpatched: trigger → one thread pegged 75–84% CPU, backtrace in protocol_transaction_out_106::handle_receive_get_data.
  • Patched: same trigger → 0% CPU, block still served.
  • No regression: peer requests a tx by id → served byte-identical to unpatched, and again on a repeat request (validates the stay-subscribed change).
  • Concurrent requests: overlapping/pipelined get_data (three interleaved multi-item tx batches with a block-only batch, sent without waiting) → patched serves every requested tx and the block at 0% CPU and stays responsive; unpatched serves only the first batch. Confirms the stay-subscribed change is correct under concurrency — send_transaction is fully parameterized and the type's only member is const bool node_witness_, so overlapping serve chains share no mutable state.

Note: the underlying unsubscriber::notify

This fix removes the caller that triggers the loop, but the primitive itself will re-enter any handler that re-subscribes during its own notification (the consume-one/produce-one above), silently and unbounded. protocol_filter_out_70015 independently makes the same mistake (fixed identically in #1071), so the pattern is not a one-off. Re-subscribing during notify is arguably invalid usage — but the failure mode is a silent infinite loop rather than a fail-fast. Worth considering whether notify should assert against a mid-notify re-subscription, so future misuse surfaces immediately instead of hanging a thread. Raised as a question, not addressed here.

handle_receive_get_data returned false (desubscribe) while
send_transaction's completion branch re-subscribed (push_back) into the
same unsubscriber list being iterated. notify erases the current node and
erase returns the just-pushed node, re-delivering the same get_data
forever (the // BUGBUG: registration race site). A block-only get_data to
a synced node spins a worker thread at 100%; enable_relay defaults true,
so one peer message wedges a thread and N peers exhaust the pool. Stay
subscribed and drop the per-request resubscribe, matching
protocol_block_out_106.
@evoskuil

evoskuil commented Jul 3, 2026

Copy link
Copy Markdown
Member

The subscription drop is the mechanism that prevents handling additional messages of the same protocol while one is already being processed. The intent is to resubscribe once the message is fully handled, and without gapping the strand after the last send. The ensures that the caller may feed another request following receipt of the last request response without it getting dropped in a resubscribe gap.

Removing this behavior may fix the observed behavior, but unless the message is handled synchronously (which requests for multiples will not be, leaving the subscription open is a DoS bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants