Claude Code Prompt Cache TTL Changed from 1h to 5m — Quota Burn Rate Spike and Cost Impact
Around March 6, 2026, Claude Code users began experiencing a sudden spike in quota consumption and extra-usage billing. Analysis of ~120K API calls from user session JSONL logs revealed that Anthropic changed the default prompt-cache TTL from 1-hour to 5-minute per-request optimization. This caused a 17-26% cost increase for long coding sessions because cache_create operations (charged at write rate, $3.75-$6.25/MTok) replaced cheaper cache_read hits ($0.30-$0.50/MTok) when sessions paused beyond 5 minutes. A client-side bug in versions before v2.1.90 exacerbated the issue: sessions that exhausted subscription quotas would stay permanently on 5m TTL. Keywords: Claude Code cache TTL, prompt caching ephemeral_5m, quota burn rate, cache_create vs cache_read, Anthropic API pricing, v2.1.90 fix, JSONL session analysis, Max plan quota exhaustion.
Symptoms
- Quota limit reached much faster than before — users hitting 5-hour limits for the first time in March 2026 despite similar usage patterns
- Extra usage credits burning rapidly while Max/Pro plan quota shows high remaining capacity (e.g. 86%+ weekly capacity unused but $200 in extra credits consumed)
- Cache-creation token counts spike dramatically in session logs (ephemeral_5m_input_tokens replacing ephemeral_1h_input_tokens)
- Long coding sessions become disproportionately expensive — cost grows super-linearly with session length due to repeated cache re-creation after 5-minute pauses
- Sessions that exhausted subscription quota become stuck on 5m TTL until process restart (fixed in v2.1.90)
Error signatures
ephemeral_5m_input_tokens > 0 and ephemeral_1h_input_tokens == 0 in Claude Code JSONL session logs
API Error 400: 'You're out of extra usage' despite plan dashboard showing available quota
cache_creation tokens in usage object at 5m tier repeatedly for same context blocks
Possible causes
- Server-side per-request TTL optimization activated March 6, 2026: Claude Code client selects 5m vs 1h cache TTL per API request based on expected cache-reuse patterns. Anthropic tuned this heuristic so that one-shot or rarely-revisited requests use cheaper 5m writes (~1.25× base), while frequently re-accessed context uses 1h writes (~2× base with amortized reads). The net effect was more requests landing on 5m TTL, which penalizes long coding sessions with pauses.
- Client-side bug (pre-v2.1.90): sessions that exhausted subscription quota at startup and switched to overage billing became permanently stuck on 5m TTL regardless of request pattern
- Misunderstanding of pricing model: 1h cache writes cost ~2× base input price while 5m writes cost ~1.25× — so '1h everywhere' is NOT cheaper for one-shot or rarely-revisited requests
- Long coding sessions inherently trigger 5m re-creation penalty: any context block not re-accessed within 5 minutes triggers a full-price cache write instead of cheap cache read
Solutions
Diagnose TTL Behavior from Session JSONL Logs Before Taking Action
Claude Code stores per-request API usage data in ~/.claude/projects/**/*.jsonl files. Analyzing these logs reveals whether your sessions are on 5m or 1h TTL, the cache hit rate, and whether the v2.1.90 fix is working. This diagnosis step confirms the issue before applying solutions.
- Locate session logs: `ls ~/.claude/projects/*/`. Each project directory contains JSONL files with per-message API usage data
- Extract cache_creation breakdown: filter for assistant messages and inspect ephemeral_5m vs ephemeral_1h token counts
- Compute cache hit rate: compare cache_read_input_tokens total vs cache_create total across a day's sessions
- Check for exclusively-5m pattern: if ephemeral_1h_input_tokens is always 0 while ephemeral_5m is non-zero, you may be hitting the v2.1.89 bug
- Compare pre-March 6 and post-March 6 data if you have historical logs: the shift from 1h-dominant to 5m-dominant should be visible
Commands
# Find session log directories:
ls -d ~/.claude/projects/*/
# Check TTL tier distribution:
jq -r 'select(.type=="assistant" and .message.usage.cache_creation) | [.message.usage.cache_creation.ephemeral_5m_input_tokens // 0, .message.usage.cache_creation.ephemeral_1h_input_tokens // 0] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | awk '{s5+=$1; s1+=$2} END {printf "5m total: %.0f\n1h total: %.0f\n5m ratio: %.1f%%\n", s5, s1, s5/(s5+s1)*100}'# Check for v2.1.89 bug pattern (stuck on 5m after quota exhaustion):
grep -l 'ephemeral_5m' ~/.claude/projects/**/*.jsonl 2>/dev/null | while read f; do jq -r 'select(.type=="assistant") | .message.usage.cache_creation.ephemeral_1h_input_tokens // 0' "$f"; done | awk '{s+=$1} END {if (s==0) print "WARNING: No 1h cache usage detected — may be stuck on 5m TTL"; else print "OK: 1h cache usage present"}'Risks
- JSONL files may contain sensitive code context — sanitize before sharing
- Large JSONL files may be slow to process with jq; use head/tail for sampling
Verification
- Run the TTL distribution command → expected: both 5m and 1h columns show non-zero values (mixed TTL is normal operation)
- If 5m ratio > 90% consistently: you may benefit from solutions 1-3 above
- Run the v2.1.89 bug check → if 'WARNING' appears, upgrade immediately
- After applying fixes, re-run diagnosis → 1h ratio should increase measurably
Optimize CLAUDE.md to Front-Load Critical Context
CLAUDE.md is loaded as part of the prompt cache. Structuring it to put the most-referenced content first ensures critical context stays accessible via cache_read hits rather than being re-created.
- Audit CLAUDE.md: identify which sections are used in every request vs. occasionally
- Move high-frequency content (build commands, code style, project structure) to the top
- Move rarely-used content (detailed API docs, historical notes) to the bottom or separate files
- Keep CLAUDE.md concise — every token in it is part of the cache write on each session start
- Use @file references for large documentation blocks instead of inlining them
Commands
wc -c CLAUDE.md
# Sections in CLAUDE.md:
grep -c '^#' CLAUDE.md 2>/dev/null || echo '0'
# Check if JSONL logs exist before querying:
ls ~/.claude/projects/**/*.jsonl 2>/dev/null | head -3
# Analyze cache read vs write ratio:
jq -r 'select(.type=="assistant" and .message.usage) | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | head -50Config examples
# Good CLAUDE.md structure (high-value first):
# 1. Build/lint/test commands
# 2. Project architecture overview (5-10 lines)
# 3. Coding conventions
# 4. File structure guide
# 5. @docs/detailed-api.md ← reference, don't inline
Risks
- Overly aggressive CLAUDE.md trimming may reduce Claude Code's project understanding
- May need to manually provide context that was previously auto-available
Verification
- Measure cache hit rate before optimization: save CLAUDE.md backup, record baseline hit rate from JSONL
- After restructuring CLAUDE.md: `jq -r "select(.type=="assistant") | [.message.usage.cache_read_input_tokens // 0, .message.usage.cache_creation // {} | (.ephemeral_5m_input_tokens // 0) + (.ephemeral_1h_input_tokens // 0)] | @tsv" ~/.claude/projects/**/*.jsonl 2>/dev/null` → compute read/write ratio
- Confirm Claude Code can still answer: `claude -p "what are the build commands for this project?"` → should return correct answer without asking clarifying questions
- Check quota burn rate on Anthropic console → should show measurable decrease over 2-3 days
- Token efficiency: CLAUDE.md should be < 5000 chars for optimal cache behavior (check with `wc -c CLAUDE.md`)
Keep Sessions Short and Task-Focused to Reduce Cache Re-creation Penalty
The 5m cache TTL means any pause longer than 5 minutes in a session forces full re-upload of context at write pricing (~12.5× more than cache_read). Structuring work into shorter, single-task sessions eliminates the idle-period penalty and reduces overall cache_create volume.
- Break complex work into discrete tasks — one Claude Code session per task
- Use /compact before pausing a session to reduce cached context size on next resume
- Start a fresh session for each new feature/bug rather than continuing an existing one
- Avoid leaving Claude Code sessions idle for more than 5 minutes mid-task
- If you must pause, save context manually (e.g., write current state to a file) and start fresh on return
Commands
claude -p 'implement X feature' --model claude-sonnet-4-6
# Quick cache hit rate check (single session):
jq -r 'select(.type=="assistant") | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | head -50# Batch compute cache hit rate across all sessions:
jq -r 'select(.type=="assistant" and .message.usage) | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | awk '{cr+=$1; cw+=$2; n++} END {if(n>0) printf "cache hit rate: %.1f%% (n=%d)\n", cr/(cr+cw+1)*100, n; else print "No data — check JSONL path"}'Config examples
# Instead of one long session:
# claude (3-hour session with pauses) → high cache_create cost
# Do this:
# claude -p 'task 1' (5 min) → exit
# claude -p 'task 2' (10 min) → exit
# claude -p 'task 3' (8 min) → exit
Risks
- Shorter sessions mean more context re-establishment overhead — each new session starts with a cold cache
- May reduce Claude Code's ability to understand project-wide context from session history
- If jq returns 'No data': verify JSONL files exist at ~/.claude/projects/**/*.jsonl; try ls ~/.claude/projects/ first
Verification
- Before change, record baseline: `find ~/.claude/projects -name "*.jsonl" -newer /tmp/marker -exec jq -r "select(.type=="assistant") | .message.usage.cache_creation.ephemeral_5m_input_tokens // 0" {} \; | awk "{s+=\$1} END {print s}" > /tmp/cache_before.txt`
- After adopting short-session strategy for 1-2 days: re-run same command, compare counts
- Expected: 5m cache_create token counts should drop 40-60% for same workload volume
- Monitor plan dashboard at console.anthropic.com → check if quota consumption rate decreases
- Run `jq -r "select(.type=="assistant") | .message.usage | {cache_create: (.cache_creation.ephemeral_5m_input_tokens // 0) + (.cache_creation.ephemeral_1h_input_tokens // 0), cache_read: .cache_read_input_tokens // 0}" ~/.claude/projects/**/*.jsonl 2>/dev/null | jq -s "map(.cache_read) / (map(.cache_create) + map(.cache_read)) | add / length"` → cache hit rate should improve
Upgrade to Claude Code v2.1.90+ to Fix Client-Side TTL Bug
A bug in Claude Code versions before v2.1.90 caused sessions that exhausted subscription quota to become permanently stuck on 5m TTL. Upgrading to v2.1.90 or later fixes this, ensuring proper per-request TTL selection even after quota exhaustion.
- Check current Claude Code version: claude --version
- Upgrade via npm: npm update -g @anthropic-ai/claude-code
- Verify version >= 2.1.90: claude --version
- Restart all active Claude Code sessions to pick up new version
- Monitor cache behavior: check ~/.claude/projects/**/*.jsonl for ephemeral_1h_input_tokens values
Commands
claude --version
npm install -g @anthropic-ai/claude-code@latest
npm view @anthropic-ai/claude-code version
claude --version # verify upgrade took effect
Risks
- npm update may pull a newer version with different behavior
- Some features may differ between minor versions
Verification
- Run `claude --version` → expected output: "2.1.90" or higher (e.g., "2.1.170")
- Run `npm view @anthropic-ai/claude-code version` → expected: latest version number (>= 2.1.90)
- Start a test session: `cd /tmp && mkdir test-ttl && cd test-ttl && git init && echo test > f.txt && git add . && git commit -m "test" && claude -p "say hello" --model claude-sonnet-4-6` → should complete without "out of extra usage" error
- After session: `jq -r "select(.type=="assistant") | .message.usage.cache_creation // {} | {e5m: .ephemeral_5m_input_tokens, e1h: .ephemeral_1h_input_tokens}" ~/.claude/projects/**/*.jsonl 2>/dev/null | head -20` → should show both e5m and e1h fields appearing (not just e5m=0 for all)
- For sessions that previously exhausted quota: restart session, verify ephemeral_1h_input_tokens values are non-zero on subsequent turns
Agent JSON
Canonical machine-readable representation of this issue:
{
"issue_id": "a30cd249-acf2-45b7-bdcb-c90615538d08",
"slug": "claude-code-prompt-cache-ttl-changed-from-1h-to-5m-quota-burn-rate-spike-and-cost-impact-xbwqd7",
"verification_status": "unverified",
"canonical_json": "https://codekb.dev/v1/issues/claude-code-prompt-cache-ttl-changed-from-1h-to-5m-quota-burn-rate-spike-and-cost-impact-xbwqd7"
}