
Spent some time digging into why two developers on the same Claude Max subscription can have completely different experiences — one hits rate limits by noon, the other runs a dozen sessions without issue.
Anthropic published an internal engineering post that explains it. The answer is prompt caching, and specifically cache hit rate.
Every request Claude Code sends is structured as a prefix chain: system prompt and tool definitions first, then project docs (CLAUDE.md), then session context, then messages. The API caches from the front. If the prefix matches on the next request, those tokens cost 1/10 the normal price. One changed byte anywhere in the chain invalidates everything downstream.
The habits that destroy cache without you realizing:
Model switching mid-conversation. Cache is bound to the model. Using /model to swap to Haiku for a quick task and back wipes everything you've accumulated. The rebuild cost usually exceeds what you saved.
Opening new sessions constantly. Every fresh claude invocation starts from zero. If you're doing short sessions and restarting frequently, you're always paying full price for the first turns.
Adding MCP tools mid-session. Tool definitions are part of the cached prefix. Any change breaks the chain.
The fix that surprised me most: claude --resume. It restores your last session and picks up the cache chain where it left off. I'd been ignoring this flag entirely.
Also counterintuitive: long conversations get cheaper over time, not more expensive. Claude Code's compaction is designed to reuse the same prefix, so the cache chain survives even when history gets compressed.
Full source: Anthropic Engineering — Lessons from building Claude Code
If you're using Claude Code through a gateway, EvoLink handles model routing at the infrastructure level so switching models doesn't break your cache chain: docs.evolink.ai
0
0
0