docs: add monitoring, troubleshooting, and upgrading operations guides#7660
Open
CharlieTLe wants to merge 2 commits into
Open
Conversation
Surface the bundled Grafana dashboards, Prometheus alerts, recording rules, and Jsonnet mixin that already live in docs/getting-started/ so operators can find and install them. Add a symptom-driven troubleshooting decision tree covering the write path, read path, storage, hash ring, alertmanager, ruler, and noisy-neighbour scenarios. Add an upgrade procedure that documents the canonical component ordering, ingester drain semantics, and downgrade caveats. Rework docs/operations/_index.md so the new pages are navigable from the operations landing page. Signed-off-by: Charlie Le <charlie_le@apple.com>
51a41fc to
85ae646
Compare
Removed CI Modernization section from operations index. Signed-off-by: Charlie Le <charlie_le@apple.com>
SungJin1212
approved these changes
Jul 2, 2026
SungJin1212
left a comment
Member
There was a problem hiding this comment.
Great work! thanks
Not blocking this PR; could we also add a zone awareness guide?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three operator-facing guides to
docs/operations/to close documentation gaps identified against comparable Prometheus long-term-storage projects.monitoring-cortex.md— Surfaces the bundled Grafana dashboards, Prometheus alert rules, recording rules, and Jsonnet mixin that already ship underdocs/getting-started/but were undocumented. Inventories all 12 dashboards and the 10 alert groups (50+ alerts), and walks through installation via raw YAML/JSON or the mixin.troubleshooting.md— Symptom-driven decision tree covering the write path, read path, storage, hash ring / KV store, alertmanager, ruler, and multi-tenant noisy-neighbour scenarios. Each branch cross-references the relevant bundled alert and existing guide.upgrading.md— Documents the canonical layered upgrade order (compactor/store-gateway → query layer → ingester → distributor → ruler/alertmanager), ingester drain semantics, validation between layers, and downgrade caveats.Also reworks
docs/operations/_index.md(previously a stub with no content) to surface core operator guides separately from specialized tools.No code changes. All 18 cross-references to existing pages in
docs/guides/,docs/configuration/, anddocs/operations/were verified to resolve.Test plan
make docruns without errors (or confirm not required — no config/flag changes).{{< relref >}}links resolve in the rendered site.monitoring-cortex.mdmatch the current contents ofdocs/getting-started/dashboards/anddocs/getting-started/alerts.yaml.