Skip to content

feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101

Open
waleedlatif1 wants to merge 6 commits into
stagingfrom
feat/provider-prompt-caching
Open

feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101
waleedlatif1 wants to merge 6 commits into
stagingfrom
feat/provider-prompt-caching

Conversation

@waleedlatif1

@waleedlatif1 waleedlatif1 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Marks the static request prefix (system prompt + tools) with an ephemeral cache_control breakpoint for Anthropic (and Azure-Anthropic, which shares the core), so repeated calls — agent tool-loops and multi-turn chats — reuse the cached prefix: ~90% cheaper cached input + lower latency.
  • The tagging lives in one directly-tested helper, applyAnthropicPromptCache(payload, tools, systemPrompt) (anthropic/utils.ts), which gates on whether caching is worthwhile and mutates the system block + last tool.
  • Always on — there is no feature flag. Caching is transparent to outputs, so it runs for every eligible request.

When it caches (the gate)

providers/prompt-cache.ts only applies breakpoints when the static prefix is large enough to be cacheable and likely reused (tools present, or a large system prompt). A one-shot, tool-less call is skipped so it never pays the cache-write surcharge for a prefix that's never read back. The gate is sized on the larger of the final payload.system (which may include appended structured-output schema) and the original request.systemPrompt (non-empty even when the no-messages path relocates it into a user message).

Why this is safe

  • Outputs are identical — prompt caching only reuses the model's computed prefix; it never changes generated responses.
  • Faster + cheaper on Claude (cached input ~0.1×).
  • Cost accounting stays accurate — Anthropic already reads cache_read_input_tokens / cache_creation_input_tokens (buildAnthropicSegmentTokens).

Standard practice

Matches the AI SDK / LangChain / Spring AI / Pydantic AI / LiteLLM convention: explicit cache breakpoints for Claude (Anthropic/Bedrock), automatic server-side caching for OpenAI/Gemini/etc. We auto-place breakpoints on the system+tools prefix (the convergent "SYSTEM_AND_TOOLS" strategy), so users don't hand-mark anything.

Type of Change

  • Performance/cost optimization (no behavioral change to outputs)

Testing

  • bun run type-check clean
  • 12 unit tests (gate logic + the applyAnthropicPromptCache payload mutation across all paths: system→cached block, last-tool tagged, relocated/blanked system, schema-appended system, below-threshold/tool-less no-op), verified on vitest 4.1.8
  • bun run lint clean · bun run check:api-validation passed

Follow-ups (not in this PR)

  • Bedrock (cachePoint) and OpenRouter (cache_control passthrough for Claude) — these need cached-token accounting added alongside (Bedrock doesn't read cacheReadInputTokens/cacheWriteInputTokens), so shipping caching there without it would mis-report cost.
  • Optional prompt_cache_key for OpenAI/Azure.

Checklist

  • Code follows project style guidelines
  • Self-reviewed
  • Tests added/updated and passing
  • No new warnings
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Mark the static request prefix (system prompt + tools) with an ephemeral
cache_control breakpoint so repeated calls — agent tool-loops and multi-turn —
reuse the cached prefix (~90% cheaper cached input + lower latency). Azure-
Anthropic inherits this via the shared core.

- New providers/prompt-cache.ts gate: only caches when the static prefix is
  large enough to be cacheable AND likely reused (tools present, or a large
  system prompt), so a one-shot tool-less call never pays the cache-write
  surcharge. Kill switch: PROMPT_CACHE_DISABLED=true.
- anthropic/core.ts: convert system string -> a cached text block (after the
  structured-output concat, which assumes a string) and tag the last tool. Uses
  2 of Anthropic's 4 breakpoints; the tool-loop reuses the tagged payload.
- Outputs are unchanged; cost accounting already reads cache_read/creation
  tokens (buildAnthropicSegmentTokens), so usage stays accurate.

Matches the AI SDK / LangChain / Spring AI convention (explicit breakpoints for
Claude; automatic for OpenAI/Gemini). Bedrock + OpenRouter to follow (they need
cache-token accounting alongside).
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 16, 2026 10:59pm

Request Review

@cursor

cursor Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Request-shape-only optimization with explicit gating; existing cache token accounting in usage handling is unchanged and outputs are not altered.

Overview
Adds automatic Anthropic prompt caching for the shared Anthropic/Azure-Anthropic request path by tagging the static prefix (system + tool definitions) with ephemeral cache_control when reuse is likely, without changing model outputs.

A new shouldCacheStaticPrefix gate (~1,024 estimated tokens, tools or large system) skips small one-shot calls so they do not pay cache-write surcharges. applyAnthropicPromptCache runs in executeAnthropicProviderRequest after structured-output changes to payload.system, converts a non-empty system string into a cached text block, and sets cache_control on the last tool only. Sizing uses the larger of final payload.system (e.g. appended JSON schema) and the original request.systemPrompt (including when the no-messages path blanks payload.system but tools remain).

Unit tests cover the gate, payload mutation edge cases, and end-to-end payload capture on the streaming/no-tools path.

Reviewed by Cursor Bugbot for commit b9a453d. Configure here.

Comment thread apps/sim/providers/anthropic/core.ts Outdated
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR enables Anthropic prompt caching for the Anthropic and Azure-Anthropic providers by stamping cache_control: { type: 'ephemeral' } on the static request prefix (system prompt + last tool definition). A gating function avoids the cache-write surcharge on small, tool-less one-shot calls.

  • providers/prompt-cache.ts introduces shouldCacheStaticPrefix, which gates caching on a ≥1,024-token combined prefix estimate and requires either tools or a large system prompt alone.
  • providers/anthropic/utils.ts adds applyAnthropicPromptCache, called once after schema mutation in core.ts, that converts the system string to a cached block and tags the last tool; 12 unit tests cover all paths including the no-messages relocation edge case.

Confidence Score: 4/5

Safe to merge with awareness that cache token fees are still excluded from the top-level ProviderResponse cost totals.

The caching logic itself is correct and well-tested. However, now that caching is always on, every warm-cache call reports an inaccurate cost: cache_creation_input_tokens (billed at ~1.25× input rate) and cache_read_input_tokens (billed at ~0.1× input rate) are never added to the accumulated tokens or cost objects returned in ProviderResponse. The per-segment trace handles this correctly via buildAnthropicSegmentTokens, but the response-level totals that callers use for billing display will undercount on every cached request.

apps/sim/providers/anthropic/core.ts — all three token-accumulation sites (streaming-no-tools path ~line 422, non-streaming initial response ~line 860, and tool-loop iteration ~line 1141) need cache token accounting added to match what buildAnthropicSegmentTokens already does correctly.

Important Files Changed

Filename Overview
apps/sim/providers/anthropic/core.ts Single-line insertion of applyAnthropicPromptCache is correct in placement (after schema mutation, before thinking config), but the accumulated tokens/cost returned in ProviderResponse still exclude cache_creation_input_tokens and cache_read_input_tokens, causing systematic cost underreporting on every warm-cache call.
apps/sim/providers/anthropic/utils.ts Adds applyAnthropicPromptCache — correctly handles system-string-to-block conversion, last-tool tagging, and the no-messages relocation edge case. Logic and tests are thorough.
apps/sim/providers/prompt-cache.ts New gate function shouldCacheStaticPrefix — well-designed with correct token estimation, clear invariant (require non-empty system prompt to avoid one-shot write surcharges), and comprehensive unit tests covering all branches.
apps/sim/providers/anthropic/utils.test.ts 12 unit tests covering all applyAnthropicPromptCache paths: large/small system, with/without tools, schema-appended system, relocated/blanked system, and below-threshold no-op.
apps/sim/providers/prompt-cache.test.ts Gate tests are complete and use vi.stubEnv correctly (addressing the prior env-coercion concern).
apps/sim/providers/anthropic/core.test.ts Integration-level request-capture tests verify the system block is tagged for large prompts and left as a plain string for small ones.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]
Loading

Comments Outside Diff (1)

  1. apps/sim/providers/anthropic/core.ts, line 860-876 (link)

    P1 Accumulated tokens and cost still omit cache-token fees

    Now that caching is always-on, every cached request produces non-zero cache_creation_input_tokens (billed at ~1.25× the regular input rate) and cache_read_input_tokens (billed at ~0.1×). Neither field is added to the accumulated tokens object, and calculateCost is called without the useCached flag, so the top-level cost returned in ProviderResponse silently undercounts. The per-segment trace path (buildAnthropicSegmentTokenscalculateCost(..., useCached)) handles this correctly, but the accumulated response totals do not.

    The same gap exists in all three code paths: the initial non-streaming token block (lines 860–865), the streaming-only path (~line 422), and the tool-loop accumulation (~line 1141). Before this PR, cache tokens were always zero so it was harmless; it now produces systematic underreporting on every warm-cache call.

Reviews (7): Last reviewed commit: "test(providers): add request-capture tes..." | Re-trigger Greptile

Comment thread apps/sim/providers/prompt-cache.test.ts Outdated
…tubEnv

- anthropic/core.ts: gate on request.systemPrompt instead of payload.system, so
  the no-messages path (where the system text is relocated into a user message
  and payload.system is blanked) still caches the tools prefix. (Cursor review)
- prompt-cache.test.ts: manage the kill-switch env via vi.stubEnv/unstubAllEnvs
  instead of assigning undefined (which coerces to "undefined" and leaks across
  workers). Addresses the Greptile finding while satisfying biome's noDelete rule.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 3a44936. Configure here.

…elper

- Remove the PROMPT_CACHE_DISABLED kill switch — prompt caching is always on.
- Extract the Anthropic tagging into applyAnthropicPromptCache(payload, tools,
  systemPrompt) in anthropic/utils.ts: one place that gates and mutates the
  system block + last tool, replacing the two inline blocks in core.ts.
- Add direct unit tests for the helper (system→cached block, last-tool tagged,
  relocated/blanked-system still tags tools, below-threshold and tool-less cases
  untouched) so the actual payload mutation is verified, not just the gate.

No behavior change to outputs; verified on vitest 4.1.8 (CI's version).
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/providers/anthropic/utils.ts
Comment thread apps/sim/providers/prompt-cache.ts
…m and request prompt

Gate on max(final payload.system, request.systemPrompt) so caching fires both
when the no-messages path blanks payload.system (size via the request prompt)
and when prompt-based structured output appends a large schema to payload.system
(size via the final system string). Add a test for the schema-appended case.

Caught by Cursor Bugbot.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 38140c7. Configure here.

Drop the inline // comments in favor of TSDoc on the helper/gate. The gate-sizing
and call-ordering rationale now lives in applyAnthropicPromptCache's TSDoc; no
behavior change.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 5e90631. Configure here.

Comment thread apps/sim/providers/anthropic/core.ts
Drives the real executeAnthropicProviderRequest down the streaming path with only
the client injected via the createClient seam (real models/utils/attachments),
and asserts the request payload handed to messages.create carries a
cache_control-tagged system block for a large prompt and a plain string for a
small one. Closes the end-to-end wiring gap (AI-SDK-style request-body capture).
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile review

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b9a453d. Configure here.

@waleedlatif1 waleedlatif1 deleted the branch staging July 1, 2026 05:43
@waleedlatif1 waleedlatif1 reopened this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant