Cache lifecycle
- The system prompt, tool definitions, and conversation history are assembled into a single prompt
- The API checks if any prefix of this prompt matches a cached entry
- Matched prefix tokens are read from cache at 10% of standard input pricing
- Unmatched tokens (the suffix) are processed at 125% of standard pricing (cache write)
- The new, longer prefix is cached for future turns
Cache TTL: Cached prefixes auto-refresh on each hit but expire after ~5 minutes of inactivity. Long pauses between turns (user thinking, waiting for approval) can cause cache expiry.
What stays stable across turns: System prompt, tool definitions, SKILL.md content, agent instructions, CLAUDE.md content. These form the cacheable prefix.
What changes each turn: New user messages, assistant responses, tool call results. These extend the suffix.
Cache-safe compaction reuses the exact same prefix (system prompt + tools + skill instructions) and appends the compaction summary as a new user message. The prefix cache is preserved because the prefix bytes are identical.
Cache-breaking compaction would rebuild the prompt differently — reordering tools, changing system prompt content, or injecting the summary into the system prompt. This invalidates the entire cache.
[system prompt] ← identical to pre-compaction (CACHED)
[tool definitions] ← identical to pre-compaction (CACHED)
[skill instructions] ← identical to pre-compaction (CACHED)
[user]: "Session summary: Previously we implemented auth module,
fixed 3 bugs, and updated tests. Continuing with..."
[assistant]: "I'll continue from where we left off..."
Cache-breaking — prefix changed:
[modified system prompt] ← different content (CACHE MISS)
[tool definitions] ← even if identical, miss propagates
[skill instructions] ← miss propagates to all downstream tokens
[user]: "Continue working..."
Implication for orchestrators: Orchestrators that run long sessions (implementation-orchestrator, finalization-orchestrator) benefit most from cache-safe compaction. Keep system prompts and tool definitions identical across the session. Do not inject turn-count, elapsed-time, or progress-percentage into the system prompt.
context: fork directive in SKILL.md spawns an isolated sub-session with its own context window. Key cache behaviors:
- Isolated cache: The forked session builds its own cache from scratch. It does not inherit the parent's cached prefix.
- Parent cache preserved: Forking does not modify the parent's prompt, so the parent's cache remains intact.
- Short-lived: Fork sessions typically complete in a few turns, so their cache investment is small.
- No cross-session sharing: Cache entries are per-session. Two forks of the same skill each pay their own cache write costs.
Practical guidance: Fork sessions are already cache-efficient by design — they start fresh, run briefly, and return results. The main concern is the parent session: receiving fork results adds new content to the conversation suffix, not the prefix, so the parent's cache is unaffected.
- Switching from Opus to Haiku mid-conversation invalidates the Opus cache. The Haiku session starts with a cold cache.
- Switching back to Opus may or may not hit the previous cache, depending on TTL expiry and whether the prompt prefix changed.
Recommendation: Use subagents (Task tool) instead of switching models mid-conversation. Subagents run in isolated sessions with their own caches, leaving the parent session's cache intact. This aligns with the model tier optimization approach (SPEC-017) where Haiku handles lightweight subtasks without disrupting the Opus parent cache.