Context Caching: The Feature Nobody Talks About

We Were Burning Through Tokens

At CloneForce, we were automating everything — planning, iterating, running the Ralph Wiggum loop, building entire features through AI pipelines. Every step in the process was an API call. And every API call included the same information: the system prompt, the project context, the tool definitions, the architectural guidelines.

The stuff that changed between calls — the actual task, the new error log, the user's question — was maybe 20% of the tokens. The other 80% was identical every time. And we were paying full price for all of it, on every single call.

When you're running agent loops that make dozens of calls per session, that adds up fast. We were burning through tokens at a rate that wasn't sustainable if we wanted to scale.

That's when I found context caching. And honestly, it felt like discovering a cheat code that was just sitting in the documentation the whole time.

How I Found It

I was looking at our token usage trying to figure out how to scale it down. I did what I always do — I asked Claude. "How can we reduce our token costs?" It suggested context caching. I'd never heard of it, so I dug into the Anthropic API docs and found cache_control in the messages API reference.

The idea is dead simple: if you're sending the same block of text at the beginning of every request — your system prompt, project context, tool definitions — you can tell the API to cache that prefix. The first request pays full price. Every request after that gets charged a fraction of the cost for those cached tokens.

The token costs for cached content dropped by roughly 90%. For our agent loops that were making dozens of calls with the same prefix, that was a massive difference.

But there's a gotcha: the cache only lasts about five minutes. If your calls are spaced out — say a user asks a question, wanders off for ten minutes, and comes back — the cache is cold and you're paying full price again. For our use case at CloneForce, this was fine. The agent loops were making rapid-fire calls, so the cache stayed warm. But for something like a chatbot where users take their time between messages, you have to design around that five-minute window.

And one more thing: the cached prefix has to be exactly the same. One character difference — a timestamp, a dynamic variable, anything — and the cache is blown. You're paying full price again and you might not even notice. We learned to be very careful about keeping the cached portion truly static and pushing anything dynamic below the cache breakpoint.

It's not a set-it-and-forget-it feature — you have to think about your request patterns.

How Context Caching Actually Works

Here's the mental model. When you make an API call to Claude, you're sending a sequence of messages — system prompt first, then the conversation. Context caching lets you draw a line in that sequence and say, "Everything above this line? Cache it."

In Anthropic's API, you do this by adding a cache_control field with type: "ephemeral" to the content block you want cached. That's it. That's the whole implementation.

{
  "system": [
    {
      "type": "text",
      "text": "You are BJT's career assistant. Here is everything you need to know about BJT...",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [
    { "role": "user", "content": "What's BJT's experience with React?" }
  ]
}

The first call processes and caches that system block. For the next five minutes or so, any subsequent call with the same prefix hits the cache. Cached input tokens cost roughly 10% of what regular input tokens cost. The cache refreshes automatically — if you keep making calls, it stays warm.

You can set multiple cache breakpoints too. Say you have a system prompt plus a big chunk of retrieved documents. Cache both. The model only needs to process the new stuff — the user's actual question.

When This Changes Everything

Context caching isn't equally useful for every use case. It shines brightest in specific patterns:

AI assistants with rich system prompts. If your assistant has a 4,000-token system prompt and handles 100 conversations an hour, you're re-sending 400,000 tokens of identical content every hour. With caching, you pay full price once and get the 90% discount on the rest. That changes your unit economics completely.

RAG pipelines with overlapping context. If your retrieval step pulls similar documents across queries — and it often does — caching the shared prefix means you're only paying for the delta. The documents that are genuinely new to each request.

Agent loops. This is the big one — and the one that saved us at CloneForce. When you're running an agent that makes dozens of API calls in a loop — planning, executing, testing, fixing — the system prompt and tool definitions stay the same across every iteration. Without caching, you're paying for that scaffolding on every single step. With caching, the scaffolding is basically free after the first call. The five-minute window works perfectly here because agent loops fire calls in rapid succession — the cache never goes cold.

Multi-turn conversations. The conversation history grows with each turn, but the system prompt at the top stays the same. Cache it, and you only pay full price for the new messages.

What This Really Means

Here's my actual take on this: context caching is a litmus test.

It's not a flashy feature. It's not a new model. It won't get you Twitter engagement. But it's the kind of thing that separates people who are building real AI applications from people who are following tutorials.

When I talk to other engineers building on LLMs, the ones who know about caching are almost always the ones who've actually shipped something to production and had to look at the bill. The ones who don't know about it are usually still in the "vibe coding" phase where costs don't matter because nobody's actually using their thing yet.

And that's fine — everyone starts somewhere. But if you're serious about building AI products that scale, you need to care about the plumbing. Context caching is plumbing. It's the difference between a prototype that costs a fortune at 100 users and a product that stays viable at 10,000.

The bigger lesson for me was this: be curious and ask a lot of questions. Even stupid questions work sometimes. I found context caching because I asked Claude a simple question about token costs. That's it. No deep research strategy. Just curiosity and a willingness to ask.

There's something I learned in sales a long time ago, and I teach it to my kids: you don't ask, you don't get. When my kids tell me they didn't get something, I ask them — did you ask? Usually the answer is no. They were too scared, or they assumed the answer would be no. But if you don't ask at all, you will never get it. You have a 50/50 chance when you ask. You have a 0% chance when you don't.

That applies to AI engineering too. Ask the model. Ask the docs. Ask stupid questions. Ask "is there a cheaper way to do this?" and see what comes back. The most valuable features in any API are usually the boring ones buried three pages deep in the docs — the ones that don't get a launch blog post. But you'll never find them if you don't ask.

Your Move

Next time you're building something on Claude (or any LLM API), look at your request pattern. If you're sending the same prefix more than a few times a minute, you're leaving money on the table.

Go read the caching docs. Add the parameter. Check your usage dashboard the next day.

Then ask yourself what other features are sitting in the documentation that you haven't looked at yet.

— Bill John Tran