KBCodeKB
Unverified

Claude Code Cache TTL Regression: 1h→5m Causes 20-32% Quota Inflation (v2.1.90/v2.1.108 Fix)

Analysis of ~120K API calls across two machines reveals that Anthropic's prompt cache TTL silently regressed from 1 hour to 5 minutes around March 6-8, 2026, causing a 20-32% increase in cache creation costs and significant quota consumption spikes for subscription users. Two Claude Code bugs were confirmed as root causes: an overage-latch bug (fixed in v2.1.90) and a telemetry-disabled fallback (fixed in v2.1.108). Users on Pro/Max plans who previously never hit limits began exhausting quotas in 1.5 hours instead of 5+. Upgrade to v2.1.108+ resolves the known bugs. Anthropic is also introducing env vars for manual TTL control per staff comments.

Symptoms

  • Sudden 20-32% increase in cache creation token consumption without changes to usage patterns
  • Subscription quota exhausted 3-5x faster than before (e.g., 5-hour limit hit in 1.5 hours on Pro Max 5x)
  • Cache read token counts drop significantly while cache creation tokens dominate usage breakdown
  • ephemeral_5m_input_tokens surge back after disappearing during the 1h-only period (Feb 1 - Mar 5)
  • ephemeral_1h_input_tokens drop to zero or near-zero after March 6-8
  • Users who never hit subscription limits before suddenly encounter rate-limiting and quota exhaustion
  • Long-running sessions with subagents (>5 min tool calls) experience cache invalidation between every turn

Error signatures

ephemeral_5m_input_tokens > 0 appearing in usage after March 6 where previously zero
ephemeral_1h_input_tokens = 0 consistently after March 6 on the main conversation loop
cache_creation tokens >> cache_read tokens indicating TTL mismatch
tool definitions (~24K tokens) shipped without cache_control headers in request inspection

Possible causes

  • Most likely (check first): Overage-latch bug — seats in quota-overage state permanently fall back to 5-minute TTL instead of intended 1-hour TTL for subscribers (fixed in v2.1.90, April 1, 2026)
  • Most likely (check first): Telemetry-disabled fallback — users who disabled telemetry were classified as non-subscription API users and given 5-minute TTL (fixed in v2.1.108, April 13, 2026)
  • Server-side TTL optimization change: Anthropic introduced per-request TTL selection heuristics around March 6-8, 2026 that prioritized 5-minute TTL for most users to reduce costs on one-shot cache writes
  • Architectural cache bypass: System tool catalog (~24K tokens) consistently shipped without cache_control headers, forcing full input charges on every turn regardless of TTL
  • The 'Msg 0' cache escape: First message in each session/subagent turn fails to attach cache_control headers, causing guaranteed cache miss on the largest payload
  • Subagent one-shot architecture: Subagents with <5 min inter-turn gaps make 5m TTL economical, but main-agent turns with >5 min gaps (code review, thinking, long tool calls) suffer from repeated cache rewrites

Solutions

Session Optimization: Shorter Sessions and Cache-Aware Workflows

risk: lowgithubpending_review

While waiting for or after applying fixes, reduce the impact of 5-minute TTL by keeping sessions shorter (one task per session), front-loading CLAUDE.md with critical context, and using the /compact command before pausing. This is a mitigation, not a fix — upgrade remains the primary solution.

  1. Structure work into single-task sessions rather than marathon coding sessions spanning multiple hours
  2. Move the most frequently needed context to the top of CLAUDE.md for better cache hit probability (cache reads from the prefix)
  3. Use `/compact` proactively before pausing for > 5 minutes to reduce the context that will need to be re-cached on return
  4. Avoid using cache-keepalive ping tools — they consume real tokens for cache reads that may never be used, and often cost more than they save

Commands

# Run inside Claude Code session before pausing:
/compact

Config examples

# CLAUDE.md structure for cache efficiency (cache reads from start of message):
# TOP (~500 words): Project architecture overview, key file paths and their purposes, current task context, active constraints
# BOTTOM: Detailed coding conventions, long-form documentation references, historical notes

Risks

  • Shorter sessions mean more context-switching overhead — test whether the reduced cache pressure outweighs the overhead for your workflow
  • Cache-keepalive ping tools (e.g., claude-code-cache-keepalive) are counterproductive: each ping is a full cache-read on the prefix plus response tokens, consuming quota for cache that may never be used

Verification

  • After adopting shorter sessions, track daily quota usage with `/cost` — EXPECTED: burn rate should decrease compared to pre-optimization baseline
  • Ensure session-to-session context is sufficient: can you resume work without re-explaining the codebase? If not, add more context to CLAUDE.md TOP section
0 verified0 failed

Diagnose Cache TTL Status from JSONL Session Logs

risk: lowgithubpending_review

Use jq queries against Claude Code's JSONL session logs (~/.claude/projects/) to determine whether your sessions are using 1-hour or 5-minute TTL caching, identify which sessions and versions are affected, and verify the fix is working after upgrade.

  1. Pre-check: verify JSONL files exist in ~/.claude/projects/ with `ls ~/.claude/projects/*.jsonl 2>/dev/null | head -5`
  2. Run the TTL distribution query to see which dates/versions are affected
  3. Run the 1h vs 5m summary query to get total token counts per TTL tier
  4. Compare results before and after upgrade: 1h tokens should be non-zero post-upgrade

Commands

# Pre-check: verify JSONL files exist
ls ~/.claude/projects/*.jsonl 2>/dev/null | head -5 || echo 'NO_JSONL_FILES_FOUND'
# TTL distribution by date and version (filters non-haiku, non-sidechain):
grep -h -r -E 'ephemeral_.*_input_tokens' ~/.claude/projects/ 2>/dev/null | jq -r 'select((.isSidechain // true) == false and ((.message.model // "") | startswith("claude-haiku") | not) and (.message.usage.cache_creation.ephemeral_5m_input_tokens // 0) > 0) | (.timestamp // "unknown") + "," + (.version // "unknown")' 2>/dev/null | sed 's/T.*,/,/' | sort | uniq -c
# 1h vs 5m token summary:
find ~/.claude/projects/ -name '*.jsonl' -exec cat {} + 2>/dev/null | jq -s 'map(select(.message.usage.cache_creation)) | {total_1h_tokens: (map(.message.usage.cache_creation.ephemeral_1h_input_tokens // 0) | add), total_5m_tokens: (map(.message.usage.cache_creation.ephemeral_5m_input_tokens // 0) | add), api_call_count: length}' 2>/dev/null

Risks

  • JSONL files can be very large (multiple GBs) — the summary query reads ALL files; on large datasets, use `find ... -name '*.jsonl' -newer <date-file>` to limit scope to recent sessions
  • jq queries on large files may be slow; run in background with `&` and check output later if dataset exceeds 100MB

Verification

  • Run `ls ~/.claude/projects/*.jsonl 2>/dev/null | head -5` — EXPECTED OUTPUT: at least one .jsonl file path, or 'NO_JSONL_FILES_FOUND' if directory is empty
  • Run the TTL distribution query: dates after your upgrade (>= April 2026) should show fewer/no lines in output (indicating 5m tokens are rare)
  • Run the 1h vs 5m summary: `total_1h_tokens` should be > 0 after upgrade — non-zero 1h tokens confirm the fix is active
0 verified0 failed

Enable Telemetry to Prevent TTL Fallback

risk: lowofficialpending_review

Users who have disabled telemetry in Claude Code may be incorrectly classified as non-subscription users and given 5-minute TTL. Enabling telemetry ensures the server correctly identifies subscription status and applies 1-hour TTL where appropriate. This is necessary even after upgrading if you previously disabled telemetry.

  1. Check if telemetry is disabled: look for CLAUDE_CODE_DISABLE_TELEMETRY or CLAUDE_CODE_TELEMETRY env vars
  2. Remove or comment out any telemetry-disabling environment variables from shell profiles (~/.bashrc, ~/.zshrc, ~/.config/fish/config.fish)
  3. Remove telemetry-disable flags from any Claude Code configuration files in ~/.claude/
  4. Restart Claude Code and verify telemetry status is no longer suppressed

Commands

env | grep -iE 'telemetry|CLAUDE_CODE_DISABLE' 2>/dev/null || echo 'NO_TELEMETRY_VARS_FOUND'
grep -r 'telemetry' ~/.claude/ 2>/dev/null | head -10 || echo 'NO_TELEMETRY_CONFIG_FOUND'

Config examples

# BEFORE (remove these lines from ~/.bashrc or ~/.zshrc):
# export CLAUDE_CODE_DISABLE_TELEMETRY=1  ← DELETE
# export CLAUDE_CODE_TELEMETRY=false      ← DELETE

# AFTER (no telemetry-disabling vars should remain):
# (these lines should be absent from your shell profile)

Risks

  • Enabling telemetry sends usage data to Anthropic — review privacy implications before proceeding
  • Some users may prefer higher costs over data sharing — this is a tradeoff, not mandatory

Verification

  • Run `env | grep -iE 'telemetry|CLAUDE_CODE_DISABLE'` — EXPECTED OUTPUT: empty (no matches) or 'NO_TELEMETRY_VARS_FOUND'
  • After a session, check 1h cache activity: `grep -h -r 'ephemeral_1h_input_tokens' ~/.claude/projects/ 2>/dev/null | jq 'select(.message.usage.cache_creation.ephemeral_1h_input_tokens // 0 > 0)' 2>/dev/null | wc -l` — EXPECTED OUTPUT: number > 0
0 verified0 failed

Upgrade Claude Code to v2.1.108+ (Primary Fix)

risk: lowofficialpending_review

Upgrade to v2.1.108 or later, which includes both the overage-latch bug fix (v2.1.90) and the telemetry-disabled fallback fix (v2.1.108). Version 2.1.108 and above contain both fixes and additional cache optimizations from Anthropic.

  1. Check current Claude Code version: `claude --version` or `npm list -g @anthropic-ai/claude-code`
  2. Upgrade to latest: `npm install -g @anthropic-ai/claude-code@latest`
  3. Verify upgrade: `claude --version` should report >= 2.1.108
  4. Restart all running Claude Code sessions after upgrade
  5. Run a test session and check /cost to verify cache_read tokens are non-trivial

Commands

npm list -g @anthropic-ai/claude-code 2>/dev/null | grep claude-code
npm install -g @anthropic-ai/claude-code@latest
claude --version

Config examples

# package.json (project-local install)
{
  "devDependencies": {
    "@anthropic-ai/claude-code": "^2.1.108"
  }
}

Risks

  • Version 2.1.108+ may have breaking changes in MCP or tool API — review changelog before upgrading in production pipelines
  • If using custom MCP servers, verify compatibility with the new Claude Code version

Verification

  • Run `claude --version` — EXPECTED OUTPUT: version number >= 2.1.108 (e.g., '2.1.170')
  • Start a new Claude Code session, work for >= 10 minutes with pauses between prompts
  • In Claude Code, run `/cost` — EXPECTED: cache_read tokens > 0 (non-zero indicates cache hits are working)
  • Check JSONL for 1h cache activity: `grep -h -r 'ephemeral_1h_input_tokens' ~/.claude/projects/ 2>/dev/null | jq 'select(.message.usage.cache_creation.ephemeral_1h_input_tokens // 0 > 0)' 2>/dev/null | head -3` — EXPECTED OUTPUT: at least one JSON object with non-zero ephemeral_1h_input_tokens
0 verified0 failed

Agent JSON

Canonical machine-readable representation of this issue:

{
  "issue_id": "78037c97-9a41-4817-ba06-d8ff35f07a09",
  "slug": "claude-code-cache-ttl-regression-1h-5m-causes-20-32-quota-inflation-v2-1-90-v2-1-108-fix-2tra50",
  "verification_status": "unverified",
  "canonical_json": "https://codekb.dev/v1/issues/claude-code-cache-ttl-regression-1h-5m-causes-20-32-quota-inflation-v2-1-90-v2-1-108-fix-2tra50"
}
← Back to all issuesPowered by CodeKB