Skip to content

docs: add monitoring, troubleshooting, and upgrading operations guides#7660

Open
CharlieTLe wants to merge 2 commits into
cortexproject:masterfrom
CharlieTLe:docs/operations-monitoring-troubleshooting-upgrading
Open

docs: add monitoring, troubleshooting, and upgrading operations guides#7660
CharlieTLe wants to merge 2 commits into
cortexproject:masterfrom
CharlieTLe:docs/operations-monitoring-troubleshooting-upgrading

Conversation

@CharlieTLe

Copy link
Copy Markdown
Member

Summary

Adds three operator-facing guides to docs/operations/ to close documentation gaps identified against comparable Prometheus long-term-storage projects.

  • monitoring-cortex.md — Surfaces the bundled Grafana dashboards, Prometheus alert rules, recording rules, and Jsonnet mixin that already ship under docs/getting-started/ but were undocumented. Inventories all 12 dashboards and the 10 alert groups (50+ alerts), and walks through installation via raw YAML/JSON or the mixin.
  • troubleshooting.md — Symptom-driven decision tree covering the write path, read path, storage, hash ring / KV store, alertmanager, ruler, and multi-tenant noisy-neighbour scenarios. Each branch cross-references the relevant bundled alert and existing guide.
  • upgrading.md — Documents the canonical layered upgrade order (compactor/store-gateway → query layer → ingester → distributor → ruler/alertmanager), ingester drain semantics, validation between layers, and downgrade caveats.

Also reworks docs/operations/_index.md (previously a stub with no content) to surface core operator guides separately from specialized tools.

No code changes. All 18 cross-references to existing pages in docs/guides/, docs/configuration/, and docs/operations/ were verified to resolve.

Test plan

  • make doc runs without errors (or confirm not required — no config/flag changes).
  • Spot-check rendered pages in the Hugo site preview.
  • Verify the {{< relref >}} links resolve in the rendered site.
  • Confirm dashboard filenames and alert names referenced in monitoring-cortex.md match the current contents of docs/getting-started/dashboards/ and docs/getting-started/alerts.yaml.

Surface the bundled Grafana dashboards, Prometheus alerts, recording rules,
and Jsonnet mixin that already live in docs/getting-started/ so operators can
find and install them. Add a symptom-driven troubleshooting decision tree
covering the write path, read path, storage, hash ring, alertmanager, ruler,
and noisy-neighbour scenarios. Add an upgrade procedure that documents the
canonical component ordering, ingester drain semantics, and downgrade caveats.
Rework docs/operations/_index.md so the new pages are navigable from the
operations landing page.

Signed-off-by: Charlie Le <charlie_le@apple.com>
@CharlieTLe CharlieTLe force-pushed the docs/operations-monitoring-troubleshooting-upgrading branch from 51a41fc to 85ae646 Compare June 30, 2026 23:32
@CharlieTLe CharlieTLe marked this pull request as ready for review June 30, 2026 23:37
@dosubot dosubot Bot added component/documentation type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating. labels Jun 30, 2026
Removed CI Modernization section from operations index.

Signed-off-by: Charlie Le <charlie_le@apple.com>

@SungJin1212 SungJin1212 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! thanks
Not blocking this PR; could we also add a zone awareness guide?

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/documentation lgtm This PR has been approved by a maintainer size/L type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants