Skip to content

feat(mutation): route serial add mutations through coordinator#156

Open
gwokhou wants to merge 12 commits into
VectifyAI:mainfrom
gwokhou:pr/add-mutation-coordinator
Open

feat(mutation): route serial add mutations through coordinator#156
gwokhou wants to merge 12 commits into
VectifyAI:mainfrom
gwokhou:pr/add-mutation-coordinator

Conversation

@gwokhou

@gwokhou gwokhou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Background

#142 made serial openkb add crash-safe with journals, touched-path snapshots, staged publish, rollback, and recovery drain. The recovery model is now in place, but its orchestration still lives inline in the CLI add paths.

Before OpenKB can safely add prepare workers or --jobs, official KB mutation ownership needs to be explicit. Future workers may prepare disposable artifacts, but one serial owner must remain responsible for snapshotting rollback-critical paths, publishing staged artifacts, updating the registry, marking the commit point, and handling rollback or post-commit cleanup.

This PR introduces that owner for add-like mutations. The coordinator lifts the crash-safe lifecycle from #142 into a higher-level abstraction: it owns transaction mechanics, while the add paths keep document-specific publish, compile, and register logic.

This is the Stage 3 bridge from the parallel architecture roadmap: #151. It keeps ingestion serial, makes the mutation boundary reviewable, and gives future non-add commands a pattern to adopt when they need the same recovery guarantees.

Summary

  • add AddMutationPlan and run_add_mutation as the serial coordinator for add-like KB mutations
  • route local file add through the coordinator
  • route URL add through the coordinator after the URL is fetched into raw/
  • route PageIndex Cloud import through the same coordinator
  • make touched paths, mutation body, commit point, rollback behavior, staging cleanup, and post-commit hooks explicit
  • preserve the crash-safe recovery behavior introduced in feat: add serial crash-safe mutation recovery #142
  • keep the .openkb/files rollback surface intact without eager snapshotting — the long-doc blob is registered lazily via snapshot.track_new(), and index_long_document deletes the PageIndex doc if it fails after col.add() so a half-applied add leaves no orphan blob (review-driven fix for a gap left by perf(mutation): don't snapshot the whole blob store on every add #155)

Scope

This is the Stage 3 slice from the parallel architecture roadmap: #151

In scope:

  • existing openkb add <file>
  • existing openkb add <URL> after URL fetch
  • existing openkb add --from-pageindex-cloud <DOC_ID>
  • add mutation snapshot/rollback orchestration
  • post-commit ingest log hooks
  • tests for coordinator routing, rollback behavior, URL add regression coverage, and interrupted cloud import rollback

Out of scope:

  • coordinator ownership of URL fetching itself
  • openkb add <dir> --jobs N
  • prepare/commit split
  • worker pools
  • deterministic parallel commit queues
  • migration of remove, recompile, lint --fix, reports, chat sessions, or watch batching

Why this matters

After this PR, future parallel ingestion work has a clear boundary:

  • prepare workers can produce private staging output only
  • the serial coordinator remains the only path that mutates official KB state
  • final names, touched paths, registry writes, commit signals, rollback, and cleanup stay under one owner

That boundary is what lets later PRs split prepare from commit without weakening the recovery guarantees from #142.

Verification

  • UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_add_command.py -q
  • UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/test_mutation.py tests/test_locks.py tests/test_add_command.py tests/test_url_ingest.py tests/test_indexer.py -q
  • git diff --check main...HEAD

@gwokhou gwokhou force-pushed the pr/add-mutation-coordinator branch from 802d0e6 to 0401b81 Compare July 1, 2026 03:35
@gwokhou gwokhou changed the title feat(add): route serial add mutations through coordinator feat(mutation): route serial add mutations through coordinator Jul 1, 2026
@gwokhou gwokhou marked this pull request as ready for review July 1, 2026 03:38
@gwokhou

gwokhou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Pushed three follow-up commits to handle the conflict, the review fix, and the final coverage gaps separately:

  1. d804e80 merges the latest upstream/main into this PR branch and resolves the conflict around the new add mutation coordinator plus the upstream blob-store snapshot optimization. The resolved path now passes the mutation snapshot into the add body and uses snapshot.track_new(...) for newly created long-doc blobs instead of snapshotting .openkb/files.

  2. 9a87b81 fixes a cloud import race found during review. Cloud imports now re-check the synthetic hash and resolve the final doc_name while holding the ingest lock, so a concurrent import cannot remove or overwrite an existing registry entry with the same display name.

  3. 619e78e improves test coverage for the remaining edge cases:

  • same PageIndex Cloud doc_id registered after fetch but before the second ingest-lock check now skips without compiling or mutating;
  • directory add stops immediately if a dirty rollback propagates;
  • kb_ingest_lock_held() is verified as exclusive-lock-only and thread-local.

Validation:

  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_add_command.py tests/test_mutation.py tests/test_locks.py tests/test_url_ingest.py

  • Result: 116 passed

  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev ruff check tests/test_add_command.py tests/test_locks.py

  • Result: All checks passed!

Coverage-mode validation:

  • UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev --with coverage coverage run --branch --source=openkb -m pytest tests/test_add_command.py tests/test_mutation.py tests/test_locks.py tests/test_url_ingest.py

  • Result: 116 passed

  • PR diff coverage for changed executable production lines: 113/119 (95.0%)

@gwokhou gwokhou marked this pull request as draft July 1, 2026 09:50
gwokhou added 2 commits July 1, 2026 18:05
…col.add

Since the add mutation stopped eagerly snapshotting .openkb/files (it registers the new blob via track_new only on success), a blob written by col.add() leaks if index_long_document raises afterward: pageindex.db is rolled back by the snapshot but the blob file is not, and nothing reclaims it. Delete the doc on the failure path so the indexer owns cleanup of a half-applied add, restoring the .openkb/files rollback-surface guarantee.
… prep

run_add_mutation handles snapshot/body failures itself and returns False, so the broad except only ever catches pre-mutation errors (name resolution, registry read, plan construction). Echo the real exception instead of the old 'Failed to prepare mutation snapshot' label, which hid the cause at DEBUG level.
@gwokhou

gwokhou commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Found two more issues in self-review. All fixed.

b5ebf52

  • Leak: col.add() writes the blob; track_new() registers it only on success → a later raise orphans it (pageindex.db rolled back, file unreclaimed).
  • Contract: breaks Parallel Architecture Roadmap #151 — ".openkb/files … affected paths must be in the rollback surface."
  • Fix: the indexer owns cleanup of a half-applied add — try / except BaseException best-effort col.delete_document(doc_id) before re-raising (guarded; matches run_add_mutation).
  • Test: test_deletes_pageindex_doc_when_a_post_add_step_fails

cf53d5f

  • Problem: run_add_mutation returns False on body failures, so the broad except only catches pre-mutation errors (name resolution, registry, plan build) — yet echoed "Failed to prepare mutation snapshot", hiding the cause at DEBUG.
  • Fix: except Exception as exc → "[ERROR] Cloud import failed for {doc_id}: {exc}".
  • Test: test_cloud_import_failure_message_names_real_cause

@gwokhou gwokhou marked this pull request as ready for review July 1, 2026 10:20
gwokhou added 5 commits July 2, 2026 18:00
…rdinator

# Conflicts:
#	openkb/cli.py
#	openkb/indexer.py
cli._cleanup_staging duplicated add_coordinator._cleanup_staging_dirs (both guard None then shutil.rmtree with ignore_errors). Drop the cli copy and route its two pre-coordinator call sites through the coordinator helper.
AddMutationPlan.rollback_error_message had zero callers — every construction site (cli.py x2, tests x6) used the default. Remove the field and inline the literal at its single read site.
@gwokhou

gwokhou commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@KylinMountain The PR is ready for review🚀

Synced with main, CI is green, and MERGEABLE / CLEAN. Would appreciate a review of the Stage 3 coordinator extraction and the crash-safety fixes — happy to split, restructure, or make changes. Details of today's push below.

Update — synced with main + CI fixes + two cleanups

Sync (e0acade): merged upstream/main#159 (agent-first harness + ruff CI gate) and #160 (image basename disambiguation). Resolved conflicts in cli.py / indexer.py, keeping this PR's coordinator and blob-cleanup code. PR is now MERGEABLE / CLEAN.

CI#159's new ruff gates flagged the branch; fixed in two commits:

  • d9ddc29ruff check E501 (3 lines > 100 chars): wrapped.
  • bdb686druff format --check: ran ruff format on add_coordinator.py + test_add_command.py.

Cleanups — two review follow-ups on code this PR introduces:

  • ac80040 — drop the duplicate cli._cleanup_staging; reuse add_coordinator._cleanup_staging_dirs.
  • 08523de — drop the unused AddMutationPlan.rollback_error_message field (zero callers); inline the literal.

Both behavior-preserving. Full suite green (935 passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant