← back to blog

Prompt caching at 1h TTL: cutting LLM spend by 80%

Most teams underuse prompt caching. Here's the exact pattern for placing cache breakpoints, the mistakes that silently disable caching, and how we cut one workload's bill by 80% with a single line of config.

prompt-cachingcost-optimizationclaudellm-ops

Prompt caching is the single biggest cost lever in an LLM application. Cache reads cost about 10% of fresh input. For any workload where most of your input is repeated across calls, the math is obvious and the savings are big — if you set it up right.

Most teams set it up wrong.

The two TTLs and what they're for

Claude supports two cache lifetimes:

| TTL | Cost of cache write | Use case | |---|---|---| | 5 minutes | ~1.25× input | Active conversation, session-scoped state | | 1 hour | ~2× input | System prompts, knowledge bases, cross-session context |

Cache reads cost ~0.1× input regardless of TTL. The only difference is how long the cache persists and what the write costs.

Rule: use 1h when the cache is reused across independent requests or users; use 5m for conversations inside a single session.

The breakpoint model

You get up to 4 cache breakpoints per request. Everything up to a breakpoint is cached as a prefix. The typical production layout:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 4096,
  system: [
    { type: "text", text: systemPrompt, cache_control: { type: "ephemeral", ttl: "1h" } },
  ],
  tools: [
    // ... all tool definitions ...
    { ...lastTool, cache_control: { type: "ephemeral", ttl: "1h" } },
  ],
  messages: [
    { role: "user", content: bigDocument },
    {
      role: "assistant",
      content: [{ type: "text", text: "Understood.", cache_control: { type: "ephemeral", ttl: "5m" } }],
    },
    { role: "user", content: currentQuestion },
  ],
});

Four breakpoints: end of system prompt, end of tools, end of document/history, end of last assistant turn. The current user turn is not cached — it changes every call.

What silently disables caching

Caching requires byte-exact prefix match up to the breakpoint. Five things break it:

  1. A timestamp or UUID in the system prompt — moves every call, cache miss every call
  2. Reordering tools — same tools in different order = different prefix
  3. Changing any earlier message — even whitespace invalidates
  4. Different model — caches are per-model
  5. Prefix below minimum size — 1024 tokens for most, 2048 for Haiku. Under that, cache_control is ignored

Every team we've debugged has hit at least one of these. The Date.now() someone dropped into a system prompt 6 months ago is costing the company tens of thousands per quarter.

Measuring hit rate

Every call returns usage metrics. Log them every time:

const usage = response.usage;
const hitRate = usage.cache_read_input_tokens /
  (usage.cache_read_input_tokens + usage.cache_creation_input_tokens + usage.input_tokens);

Target: > 80% hit rate for steady-state production workloads.

Three diagnostic patterns when it's lower:

  • cache_creation > 0 every call: prefix is changing. Diff consecutive requests byte-by-byte
  • cache_read = 0 after first call: cache expired, different model, or different parameter
  • Both zero: cache_control missing, or prefix below minimum size

A real cost win

A workload we measured: customer support assistant, 50K token system prompt (brand voice, tool descriptions, FAQ), ~200K queries/day, Sonnet 4.6.

Before caching:

  • 200K × 50K input tokens = 10B input tokens/day
  • At $3/M = $30,000/day = $900K/month

After 1h caching on the system prompt, assuming ~80% cache hit rate:

  • 80% × 50K × 0.1 = 4K effective input tokens per cached call
  • 20% × 50K = 10K input tokens per uncached call
  • Weighted average: ~5.2K input tokens per call (vs 50K)
  • Monthly: ~$93K

Savings: ~$807K/month. One cache_control line.

Numbers will vary — query volume, prompt size, hit rate all matter. But the magnitude of the win is not subtle. If you have a stable multi-KB system prompt and > 1000 requests per hour, caching is free money you haven't picked up.

The multi-turn pattern

For conversations, move a 5-minute breakpoint to the last assistant message each turn. The cache accumulates history cheaply:

function addBreakpoint(messages) {
  const last = [...messages].reverse().find(m => m.role === "assistant");
  if (last && Array.isArray(last.content)) {
    const tail = last.content[last.content.length - 1];
    if (tail.type === "text") tail.cache_control = { type: "ephemeral", ttl: "5m" };
  }
  return messages;
}

Each call reuses the cache from the previous turn plus new content. For a 20-turn conversation, this is the difference between paying full cost on 20 accumulated histories vs paying cache-read cost on 19 of them.

Combining with 1M context

1M context is expensive per call. Caching is what makes it affordable. Cache the codebase or document once at 1h TTL; ask many questions against it for the next hour at cache-read rates. The first call is ~2× input cost for the write; breakeven hits around the second question.

Without caching, 1M context is usually too expensive to be the right choice. With caching, it often beats RAG for "I have a large corpus and many questions" workloads.

What to do tomorrow

Three concrete steps:

  1. Add a usage logger that tracks cache_read_input_tokens and cache_creation_input_tokens per call. Plot hit rate per endpoint
  2. Pick your most-called endpoint. Add a 1h breakpoint at the end of the system prompt. Measure before/after bill
  3. Hunt for cache-breakers: grep your prompt assembly code for Date.now(), crypto.randomUUID(), and anything else that shifts per call. Move them to the user message

If your hit rate is already > 80%, you've nailed it. If it's lower, every percentage point up is money you're leaving on the table.

For deeper patterns — four-breakpoint layouts, multi-user cache isolation, tool definition caching — see our prompt caching TTL skill.

════════════════════════════════════════════════════════════════
latestaiagents | MIT License

Made with </> for the AI agent community