Unverified

Claude Code Prompt Cache TTL Changed from 1h to 5m — Quota Burn Rate Spike and Cost Impact

Around March 6, 2026, Claude Code users began experiencing a sudden spike in quota consumption and extra-usage billing. Analysis of ~120K API calls from user session JSONL logs revealed that Anthropic changed the default prompt-cache TTL from 1-hour to 5-minute per-request optimization. This caused a 17-26% cost increase for long coding sessions because cache_create operations (charged at write rate, $3.75-$6.25/MTok) replaced cheaper cache_read hits ($0.30-$0.50/MTok) when sessions paused beyond 5 minutes. A client-side bug in versions before v2.1.90 exacerbated the issue: sessions that exhausted subscription quotas would stay permanently on 5m TTL. Keywords: Claude Code cache TTL, prompt caching ephemeral_5m, quota burn rate, cache_create vs cache_read, Anthropic API pricing, v2.1.90 fix, JSONL session analysis, Max plan quota exhaustion.

Symptoms

Quota limit reached much faster than before — users hitting 5-hour limits for the first time in March 2026 despite similar usage patterns
Extra usage credits burning rapidly while Max/Pro plan quota shows high remaining capacity (e.g. 86%+ weekly capacity unused but $200 in extra credits consumed)
Cache-creation token counts spike dramatically in session logs (ephemeral_5m_input_tokens replacing ephemeral_1h_input_tokens)
Long coding sessions become disproportionately expensive — cost grows super-linearly with session length due to repeated cache re-creation after 5-minute pauses
Sessions that exhausted subscription quota become stuck on 5m TTL until process restart (fixed in v2.1.90)

Error signatures

ephemeral_5m_input_tokens > 0 and ephemeral_1h_input_tokens == 0 in Claude Code JSONL session logs

API Error 400: 'You're out of extra usage' despite plan dashboard showing available quota

cache_creation tokens in usage object at 5m tier repeatedly for same context blocks

Possible causes

Server-side per-request TTL optimization activated March 6, 2026: Claude Code client selects 5m vs 1h cache TTL per API request based on expected cache-reuse patterns. Anthropic tuned this heuristic so that one-shot or rarely-revisited requests use cheaper 5m writes (~1.25× base), while frequently re-accessed context uses 1h writes (~2× base with amortized reads). The net effect was more requests landing on 5m TTL, which penalizes long coding sessions with pauses.
Client-side bug (pre-v2.1.90): sessions that exhausted subscription quota at startup and switched to overage billing became permanently stuck on 5m TTL regardless of request pattern
Misunderstanding of pricing model: 1h cache writes cost ~2× base input price while 5m writes cost ~1.25× — so '1h everywhere' is NOT cheaper for one-shot or rarely-revisited requests
Long coding sessions inherently trigger 5m re-creation penalty: any context block not re-accessed within 5 minutes triggers a full-price cache write instead of cheap cache read

Solutions

Diagnose TTL Behavior from Session JSONL Logs Before Taking Action

risk: lowgithubpending_review

Claude Code stores per-request API usage data in ~/.claude/projects/**/*.jsonl files. Analyzing these logs reveals whether your sessions are on 5m or 1h TTL, the cache hit rate, and whether the v2.1.90 fix is working. This diagnosis step confirms the issue before applying solutions.

Locate session logs: `ls ~/.claude/projects/*/`. Each project directory contains JSONL files with per-message API usage data
Extract cache_creation breakdown: filter for assistant messages and inspect ephemeral_5m vs ephemeral_1h token counts
Compute cache hit rate: compare cache_read_input_tokens total vs cache_create total across a day's sessions
Check for exclusively-5m pattern: if ephemeral_1h_input_tokens is always 0 while ephemeral_5m is non-zero, you may be hitting the v2.1.89 bug
Compare pre-March 6 and post-March 6 data if you have historical logs: the shift from 1h-dominant to 5m-dominant should be visible

Commands

# Find session log directories:

ls -d ~/.claude/projects/*/

# Check TTL tier distribution:

jq -r 'select(.type=="assistant" and .message.usage.cache_creation) | [.message.usage.cache_creation.ephemeral_5m_input_tokens // 0, .message.usage.cache_creation.ephemeral_1h_input_tokens // 0] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | awk '{s5+=$1; s1+=$2} END {printf "5m total: %.0f\n1h total: %.0f\n5m ratio: %.1f%%\n", s5, s1, s5/(s5+s1)*100}'

# Check for v2.1.89 bug pattern (stuck on 5m after quota exhaustion):

grep -l 'ephemeral_5m' ~/.claude/projects/**/*.jsonl 2>/dev/null | while read f; do jq -r 'select(.type=="assistant") | .message.usage.cache_creation.ephemeral_1h_input_tokens // 0' "$f"; done | awk '{s+=$1} END {if (s==0) print "WARNING: No 1h cache usage detected — may be stuck on 5m TTL"; else print "OK: 1h cache usage present"}'

Risks

JSONL files may contain sensitive code context — sanitize before sharing
Large JSONL files may be slow to process with jq; use head/tail for sampling

Verification

Run the TTL distribution command → expected: both 5m and 1h columns show non-zero values (mixed TTL is normal operation)
If 5m ratio > 90% consistently: you may benefit from solutions 1-3 above
Run the v2.1.89 bug check → if 'WARNING' appears, upgrade immediately
After applying fixes, re-run diagnosis → 1h ratio should increase measurably

✓ 0 verified✕ 0 failed

Optimize CLAUDE.md to Front-Load Critical Context

risk: lowgithubpending_review

CLAUDE.md is loaded as part of the prompt cache. Structuring it to put the most-referenced content first ensures critical context stays accessible via cache_read hits rather than being re-created.

Audit CLAUDE.md: identify which sections are used in every request vs. occasionally
Move high-frequency content (build commands, code style, project structure) to the top
Move rarely-used content (detailed API docs, historical notes) to the bottom or separate files
Keep CLAUDE.md concise — every token in it is part of the cache write on each session start
Use @file references for large documentation blocks instead of inlining them

Commands

wc -c CLAUDE.md

# Sections in CLAUDE.md:

grep -c '^#' CLAUDE.md 2>/dev/null || echo '0'

# Check if JSONL logs exist before querying:

ls ~/.claude/projects/**/*.jsonl 2>/dev/null | head -3

# Analyze cache read vs write ratio:

jq -r 'select(.type=="assistant" and .message.usage) | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | head -50

Config examples

# Good CLAUDE.md structure (high-value first):

# 1. Build/lint/test commands

# 2. Project architecture overview (5-10 lines)

# 3. Coding conventions

# 4. File structure guide

# 5. @docs/detailed-api.md  ← reference, don't inline

Risks

Overly aggressive CLAUDE.md trimming may reduce Claude Code's project understanding
May need to manually provide context that was previously auto-available

Verification

Measure cache hit rate before optimization: save CLAUDE.md backup, record baseline hit rate from JSONL
After restructuring CLAUDE.md: `jq -r "select(.type=="assistant") | [.message.usage.cache_read_input_tokens // 0, .message.usage.cache_creation // {} | (.ephemeral_5m_input_tokens // 0) + (.ephemeral_1h_input_tokens // 0)] | @tsv" ~/.claude/projects/**/*.jsonl 2>/dev/null` → compute read/write ratio
Confirm Claude Code can still answer: `claude -p "what are the build commands for this project?"` → should return correct answer without asking clarifying questions
Check quota burn rate on Anthropic console → should show measurable decrease over 2-3 days
Token efficiency: CLAUDE.md should be < 5000 chars for optimal cache behavior (check with `wc -c CLAUDE.md`)

✓ 0 verified✕ 0 failed

Keep Sessions Short and Task-Focused to Reduce Cache Re-creation Penalty

risk: lowgithubpending_review

The 5m cache TTL means any pause longer than 5 minutes in a session forces full re-upload of context at write pricing (~12.5× more than cache_read). Structuring work into shorter, single-task sessions eliminates the idle-period penalty and reduces overall cache_create volume.

Break complex work into discrete tasks — one Claude Code session per task
Use /compact before pausing a session to reduce cached context size on next resume
Start a fresh session for each new feature/bug rather than continuing an existing one
Avoid leaving Claude Code sessions idle for more than 5 minutes mid-task
If you must pause, save context manually (e.g., write current state to a file) and start fresh on return

Commands

claude -p 'implement X feature' --model claude-sonnet-4-6

# Quick cache hit rate check (single session):

jq -r 'select(.type=="assistant") | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | head -50

# Batch compute cache hit rate across all sessions:

jq -r 'select(.type=="assistant" and .message.usage) | [(.message.usage.cache_read_input_tokens // 0), ((.message.usage.cache_creation // {}).ephemeral_5m_input_tokens // 0) + ((.message.usage.cache_creation // {}).ephemeral_1h_input_tokens // 0)] | @tsv' ~/.claude/projects/**/*.jsonl 2>/dev/null | awk '{cr+=$1; cw+=$2; n++} END {if(n>0) printf "cache hit rate: %.1f%% (n=%d)\n", cr/(cr+cw+1)*100, n; else print "No data — check JSONL path"}'

Config examples

# Instead of one long session:

# claude (3-hour session with pauses) → high cache_create cost

# Do this:

# claude -p 'task 1'  (5 min) → exit

# claude -p 'task 2'  (10 min) → exit

# claude -p 'task 3'  (8 min) → exit

Risks

Shorter sessions mean more context re-establishment overhead — each new session starts with a cold cache
May reduce Claude Code's ability to understand project-wide context from session history
If jq returns 'No data': verify JSONL files exist at ~/.claude/projects/**/*.jsonl; try ls ~/.claude/projects/ first

Verification

Before change, record baseline: `find ~/.claude/projects -name "*.jsonl" -newer /tmp/marker -exec jq -r "select(.type=="assistant") | .message.usage.cache_creation.ephemeral_5m_input_tokens // 0" {} \; | awk "{s+=\$1} END {print s}" > /tmp/cache_before.txt`
After adopting short-session strategy for 1-2 days: re-run same command, compare counts
Expected: 5m cache_create token counts should drop 40-60% for same workload volume
Monitor plan dashboard at console.anthropic.com → check if quota consumption rate decreases
Run `jq -r "select(.type=="assistant") | .message.usage | {cache_create: (.cache_creation.ephemeral_5m_input_tokens // 0) + (.cache_creation.ephemeral_1h_input_tokens // 0), cache_read: .cache_read_input_tokens // 0}" ~/.claude/projects/**/*.jsonl 2>/dev/null | jq -s "map(.cache_read) / (map(.cache_create) + map(.cache_read)) | add / length"` → cache hit rate should improve

✓ 0 verified✕ 0 failed

Upgrade to Claude Code v2.1.90+ to Fix Client-Side TTL Bug

risk: lowofficialpending_review

A bug in Claude Code versions before v2.1.90 caused sessions that exhausted subscription quota to become permanently stuck on 5m TTL. Upgrading to v2.1.90 or later fixes this, ensuring proper per-request TTL selection even after quota exhaustion.

Check current Claude Code version: claude --version
Upgrade via npm: npm update -g @anthropic-ai/claude-code
Verify version >= 2.1.90: claude --version
Restart all active Claude Code sessions to pick up new version
Monitor cache behavior: check ~/.claude/projects/**/*.jsonl for ephemeral_1h_input_tokens values

Commands

claude --version

npm install -g @anthropic-ai/claude-code@latest

npm view @anthropic-ai/claude-code version

claude --version  # verify upgrade took effect

Risks

npm update may pull a newer version with different behavior
Some features may differ between minor versions

Verification

Run `claude --version` → expected output: "2.1.90" or higher (e.g., "2.1.170")
Run `npm view @anthropic-ai/claude-code version` → expected: latest version number (>= 2.1.90)
Start a test session: `cd /tmp && mkdir test-ttl && cd test-ttl && git init && echo test > f.txt && git add . && git commit -m "test" && claude -p "say hello" --model claude-sonnet-4-6` → should complete without "out of extra usage" error
After session: `jq -r "select(.type=="assistant") | .message.usage.cache_creation // {} | {e5m: .ephemeral_5m_input_tokens, e1h: .ephemeral_1h_input_tokens}" ~/.claude/projects/**/*.jsonl 2>/dev/null | head -20` → should show both e5m and e1h fields appearing (not just e5m=0 for all)
For sessions that previously exhausted quota: restart session, verify ephemeral_1h_input_tokens values are non-zero on subsequent turns

✓ 0 verified✕ 0 failed

Agent JSON

Canonical machine-readable representation of this issue:

{
  "issue_id": "a30cd249-acf2-45b7-bdcb-c90615538d08",
  "slug": "claude-code-prompt-cache-ttl-changed-from-1h-to-5m-quota-burn-rate-spike-and-cost-impact-xbwqd7",
  "verification_status": "unverified",
  "canonical_json": "https://codekb.dev/v1/issues/claude-code-prompt-cache-ttl-changed-from-1h-to-5m-quota-burn-rate-spike-and-cost-impact-xbwqd7"
}