fix(telemetry): bound Prometheus distribution buffer (fork pin)#4663
Open
robacourt wants to merge 1 commit into
Open
fix(telemetry): bound Prometheus distribution buffer (fork pin)#4663robacourt wants to merge 1 commit into
robacourt wants to merge 1 commit into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4663 +/- ##
==========================================
+ Coverage 60.03% 60.20% +0.17%
==========================================
Files 395 410 +15
Lines 43747 44343 +596
Branches 12579 12583 +4
==========================================
+ Hits 26262 26698 +436
- Misses 17407 17567 +160
Partials 78 78
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
When ELECTRIC_PROMETHEUS_PORT is set but the /metrics endpoint is scraped infrequently or never (e.g. OTel-only deployments), telemetry_metrics_prometheus_core buffers one ETS row per distribution observation and only drains on scrape, so the dist table grows without bound (the per-transaction receive_lag metric dominates). An 8GB+ ETS table and eventual OOM was observed in the field. Pin telemetry_metrics_prometheus_core to a fork that bounds the buffer by aggregating automatically on a size threshold (default 10k samples) and a time fallback (default 60s), in addition to on scrape. Only affects the MIX_TARGET=application build (the standalone sync-service / Docker image); the telemetry deps are target-gated out of the Hex package, so this git dep does not affect publishing of `electric`. Upstream PR: beam-telemetry/telemetry_metrics_prometheus_core#77 Revert to the Hex release once it lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01YW7Njz5ZpBDaoGviW1eVR8
bf688eb to
6987ca4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes unbounded ETS growth in the Prometheus metrics reporter. When
ELECTRIC_PROMETHEUS_PORTis set but the/metricsendpoint is scraped infrequently — or never, as on OpenTelemetry-only deployments —telemetry_metrics_prometheus_corebuffers one ETS row per distribution observation and only drains it on scrape. The per-transactionreceive_lagdistribution dominates, so theprometheus_metrics_disttable grows without bound. A customer hit an 8 GB+ ETS table and eventual OOM this way.This pins
telemetry_metrics_prometheus_coreto a fork that bounds the buffer by aggregating automatically — on a size threshold (default 10k buffered samples) and a time fallback (default 60s) — in addition to on scrape, and serializes aggregation in the registry process (also fixing a pre-existing overlapping-scrape race).Upstream PR: beam-telemetry/telemetry_metrics_prometheus_core#77
Why a git dep is safe here
The telemetry deps (the
electric_telemetrypath dep and its transitivetelemetry_metrics_prometheus_core) are gated behindMIX_TARGET=applicationinsync-service/mix.exs(telemetry_deps(_) -> []otherwise). The Hex-publishedelectricpackage is built with the default target, so this dependency is absent from the published dep tree — the git pin cannot affectmix hex.publish. It only affects the standalone sync-service / Docker / CI build, wheremix deps.getfetches git deps fine (git is installed in the builder image).electric-telemetryitself is not published to Hex.Changes
packages/electric-telemetry/mix.exs— pin to the fork branch, with a comment to revert to~> 1.2once upstream lands on Hex.packages/electric-telemetry/mix.lock+packages/sync-service/mix.lock— locked to the fork commit (only that one entry changes).@core/sync-servicepatch).Verified
electric-telemetrycompiles against the fork. Revert this pin to the Hex release after #77 merges and ships.🤖 Generated with Claude Code