feat(providers): prompt caching for Anthropic + Azure-Anthropic by waleedlatif1 · Pull Request #5101 · simstudioai/sim

waleedlatif1 · 2026-06-16T21:12:48Z

Summary

Marks the static request prefix (system prompt + tools) with an ephemeral cache_control breakpoint for Anthropic (and Azure-Anthropic, which shares the core), so repeated calls — agent tool-loops and multi-turn chats — reuse the cached prefix: ~90% cheaper cached input + lower latency.
The tagging lives in one directly-tested helper, applyAnthropicPromptCache(payload, tools, systemPrompt) (anthropic/utils.ts), which gates on whether caching is worthwhile and mutates the system block + last tool.
Always on — there is no feature flag. Caching is transparent to outputs, so it runs for every eligible request.

When it caches (the gate)

providers/prompt-cache.ts only applies breakpoints when the static prefix is large enough to be cacheable and likely reused (tools present, or a large system prompt). A one-shot, tool-less call is skipped so it never pays the cache-write surcharge for a prefix that's never read back. The gate is sized on the larger of the final payload.system (which may include appended structured-output schema) and the original request.systemPrompt (non-empty even when the no-messages path relocates it into a user message).

Why this is safe

Outputs are identical — prompt caching only reuses the model's computed prefix; it never changes generated responses.
Faster + cheaper on Claude (cached input ~0.1×).
Cost accounting stays accurate — Anthropic already reads cache_read_input_tokens / cache_creation_input_tokens (buildAnthropicSegmentTokens).

Standard practice

Matches the AI SDK / LangChain / Spring AI / Pydantic AI / LiteLLM convention: explicit cache breakpoints for Claude (Anthropic/Bedrock), automatic server-side caching for OpenAI/Gemini/etc. We auto-place breakpoints on the system+tools prefix (the convergent "SYSTEM_AND_TOOLS" strategy), so users don't hand-mark anything.

Type of Change

Performance/cost optimization (no behavioral change to outputs)

Testing

bun run type-check clean
12 unit tests (gate logic + the applyAnthropicPromptCache payload mutation across all paths: system→cached block, last-tool tagged, relocated/blanked system, schema-appended system, below-threshold/tool-less no-op), verified on vitest 4.1.8
bun run lint clean · bun run check:api-validation passed

Follow-ups (not in this PR)

Bedrock (cachePoint) and OpenRouter (cache_control passthrough for Claude) — these need cached-token accounting added alongside (Bedrock doesn't read cacheReadInputTokens/cacheWriteInputTokens), so shipping caching there without it would mis-report cost.
Optional prompt_cache_key for OpenAI/Azure.

Checklist

Code follows project style guidelines
Self-reviewed
Tests added/updated and passing
No new warnings
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Mark the static request prefix (system prompt + tools) with an ephemeral cache_control breakpoint so repeated calls — agent tool-loops and multi-turn — reuse the cached prefix (~90% cheaper cached input + lower latency). Azure- Anthropic inherits this via the shared core. - New providers/prompt-cache.ts gate: only caches when the static prefix is large enough to be cacheable AND likely reused (tools present, or a large system prompt), so a one-shot tool-less call never pays the cache-write surcharge. Kill switch: PROMPT_CACHE_DISABLED=true. - anthropic/core.ts: convert system string -> a cached text block (after the structured-output concat, which assumes a string) and tag the last tool. Uses 2 of Anthropic's 4 breakpoints; the tool-loop reuses the tagged payload. - Outputs are unchanged; cost accounting already reads cache_read/creation tokens (buildAnthropicSegmentTokens), so usage stays accurate. Matches the AI SDK / LangChain / Spring AI convention (explicit breakpoints for Claude; automatic for OpenAI/Gemini). Bedrock + OpenRouter to follow (they need cache-token accounting alongside).

vercel · 2026-06-16T21:12:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Jun 16, 2026 10:59pm

cursor · 2026-06-16T21:13:01Z

PR Summary

Low Risk
Request-shape-only optimization with explicit gating; existing cache token accounting in usage handling is unchanged and outputs are not altered.

Overview
Adds automatic Anthropic prompt caching for the shared Anthropic/Azure-Anthropic request path by tagging the static prefix (system + tool definitions) with ephemeral cache_control when reuse is likely, without changing model outputs.

A new shouldCacheStaticPrefix gate (~1,024 estimated tokens, tools or large system) skips small one-shot calls so they do not pay cache-write surcharges. applyAnthropicPromptCache runs in executeAnthropicProviderRequest after structured-output changes to payload.system, converts a non-empty system string into a cached text block, and sets cache_control on the last tool only. Sizing uses the larger of final payload.system (e.g. appended JSON schema) and the original request.systemPrompt (including when the no-messages path blanks payload.system but tools remain).

Unit tests cover the gate, payload mutation edge cases, and end-to-end payload capture on the streaming/no-tools path.

^{Reviewed by Cursor Bugbot for commit b9a453d. Configure here.}

greptile-apps · 2026-06-16T21:16:53Z

Greptile Summary

This PR enables Anthropic prompt caching for the Anthropic and Azure-Anthropic providers by stamping cache_control: { type: 'ephemeral' } on the static request prefix (system prompt + last tool definition). A gating function avoids the cache-write surcharge on small, tool-less one-shot calls.

providers/prompt-cache.ts introduces shouldCacheStaticPrefix, which gates caching on a ≥1,024-token combined prefix estimate and requires either tools or a large system prompt alone.
providers/anthropic/utils.ts adds applyAnthropicPromptCache, called once after schema mutation in core.ts, that converts the system string to a cached block and tags the last tool; 12 unit tests cover all paths including the no-messages relocation edge case.

Confidence Score: 4/5

Safe to merge with awareness that cache token fees are still excluded from the top-level ProviderResponse cost totals.

The caching logic itself is correct and well-tested. However, now that caching is always on, every warm-cache call reports an inaccurate cost: cache_creation_input_tokens (billed at ~1.25× input rate) and cache_read_input_tokens (billed at ~0.1× input rate) are never added to the accumulated tokens or cost objects returned in ProviderResponse. The per-segment trace handles this correctly via buildAnthropicSegmentTokens, but the response-level totals that callers use for billing display will undercount on every cached request.

apps/sim/providers/anthropic/core.ts — all three token-accumulation sites (streaming-no-tools path ~line 422, non-streaming initial response ~line 860, and tool-loop iteration ~line 1141) need cache token accounting added to match what buildAnthropicSegmentTokens already does correctly.

Important Files Changed

Filename	Overview
apps/sim/providers/anthropic/core.ts	Single-line insertion of `applyAnthropicPromptCache` is correct in placement (after schema mutation, before thinking config), but the accumulated `tokens`/`cost` returned in `ProviderResponse` still exclude `cache_creation_input_tokens` and `cache_read_input_tokens`, causing systematic cost underreporting on every warm-cache call.
apps/sim/providers/anthropic/utils.ts	Adds `applyAnthropicPromptCache` — correctly handles system-string-to-block conversion, last-tool tagging, and the no-messages relocation edge case. Logic and tests are thorough.
apps/sim/providers/prompt-cache.ts	New gate function `shouldCacheStaticPrefix` — well-designed with correct token estimation, clear invariant (require non-empty system prompt to avoid one-shot write surcharges), and comprehensive unit tests covering all branches.
apps/sim/providers/anthropic/utils.test.ts	12 unit tests covering all `applyAnthropicPromptCache` paths: large/small system, with/without tools, schema-appended system, relocated/blanked system, and below-threshold no-op.
apps/sim/providers/prompt-cache.test.ts	Gate tests are complete and use `vi.stubEnv` correctly (addressing the prior env-coercion concern).
apps/sim/providers/anthropic/core.test.ts	Integration-level request-capture tests verify the system block is tagged for large prompts and left as a plain string for small ones.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
    B --> C{responseFormat?}
    C -->|prompt-based| D[Append schema to\npayload.system]
    C -->|native / none| E[No mutation]
    D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
    E --> F
    F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
    G -->|prefixTokens < 1024\nor no system| H[No-op: return]
    G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
    I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
    I -->|no - relocated| K[Skip system block]
    J --> L{tools present?}
    K --> L
    L -->|yes| M[tools lastIndex.cache_control = ephemeral]
    L -->|no| N[Done]
    M --> N
    N --> O[Add thinking config if requested]
    O --> P[API call]

Comments Outside Diff (1)

apps/sim/providers/anthropic/core.ts, line 860-876 (link)

Accumulated tokens and cost still omit cache-token fees

Now that caching is always-on, every cached request produces non-zero cache_creation_input_tokens (billed at ~1.25× the regular input rate) and cache_read_input_tokens (billed at ~0.1×). Neither field is added to the accumulated tokens object, and calculateCost is called without the useCached flag, so the top-level cost returned in ProviderResponse silently undercounts. The per-segment trace path (buildAnthropicSegmentTokens → calculateCost(..., useCached)) handles this correctly, but the accumulated response totals do not.

The same gap exists in all three code paths: the initial non-streaming token block (lines 860–865), the streaming-only path (~line 422), and the tool-loop accumulation (~line 1141). Before this PR, cache tokens were always zero so it was harmless; it now produces systematic underreporting on every warm-cache call.

_{Reviews (7): Last reviewed commit: "test(providers): add request-capture tes..." | Re-trigger Greptile}

…tubEnv - anthropic/core.ts: gate on request.systemPrompt instead of payload.system, so the no-messages path (where the system text is relocated into a user message and payload.system is blanked) still caches the tools prefix. (Cursor review) - prompt-cache.test.ts: manage the kill-switch env via vi.stubEnv/unstubAllEnvs instead of assigning undefined (which coerces to "undefined" and leaks across workers). Addresses the Greptile finding while satisfying biome's noDelete rule.

waleedlatif1 · 2026-06-16T21:55:58Z

@greptile review

waleedlatif1 · 2026-06-16T21:55:59Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 3a44936. Configure here.}

…elper - Remove the PROMPT_CACHE_DISABLED kill switch — prompt caching is always on. - Extract the Anthropic tagging into applyAnthropicPromptCache(payload, tools, systemPrompt) in anthropic/utils.ts: one place that gates and mutates the system block + last tool, replacing the two inline blocks in core.ts. - Add direct unit tests for the helper (system→cached block, last-tool tagged, relocated/blanked-system still tags tools, below-threshold and tool-less cases untouched) so the actual payload mutation is verified, not just the gate. No behavior change to outputs; verified on vitest 4.1.8 (CI's version).

waleedlatif1 · 2026-06-16T22:14:04Z

@greptile review

waleedlatif1 · 2026-06-16T22:14:05Z

@cursor review

…m and request prompt Gate on max(final payload.system, request.systemPrompt) so caching fires both when the no-messages path blanks payload.system (size via the request prompt) and when prompt-based structured output appends a large schema to payload.system (size via the final system string). Add a test for the schema-appended case. Caught by Cursor Bugbot.

waleedlatif1 · 2026-06-16T22:22:36Z

@greptile review

waleedlatif1 · 2026-06-16T22:22:37Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 38140c7. Configure here.}

Drop the inline // comments in favor of TSDoc on the helper/gate. The gate-sizing and call-ordering rationale now lives in applyAnthropicPromptCache's TSDoc; no behavior change.

waleedlatif1 · 2026-06-16T22:36:08Z

@greptile review

waleedlatif1 · 2026-06-16T22:36:09Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 5e90631. Configure here.}

Drives the real executeAnthropicProviderRequest down the streaming path with only the client injected via the createClient seam (real models/utils/attachments), and asserts the request payload handed to messages.create carries a cache_control-tagged system block for a large prompt and a plain string for a small one. Closes the end-to-end wiring gap (AI-SDK-style request-body capture).

waleedlatif1 · 2026-06-16T22:59:51Z

@greptile review

waleedlatif1 · 2026-06-16T22:59:52Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit b9a453d. Configure here.}