Which topics does this article cover?

It highlights LLM, AI Cost Optimization, Token Engineering, Claude API, OpenAI.

Token Budgeting: The Engineering Skill Nobody Talks About

Q: What is "Token Budgeting: The Engineering Skill Nobody Talks About" about?

Most developers think token optimization means shorter prompts. In 2026, the biggest costs come from bloated chat history, unused tool schemas, cache misses, and overusing expensive models. This guide covers five high-impact levers, with pricing, cost breakdowns, and a case study that cut a Claude bill from $2,400/month to $680.

1. The Misconception That's Costing You Money

Ask a developer how to reduce their LLM API costs and they'll say: "write shorter prompts." Remove adjectives. Cut explanations. Use fewer examples. Trim the system prompt.

This advice is not wrong. It's just the lowest-leverage version of the right idea, and it misses the places where 80% of the money actually goes.

Token optimization is a context engineering problem. The question is not "how many words is my prompt?" The questions are:

What is in your context that does not need to be there?
How is your context structured — and does that structure let the cache work?
How does your context grow across a conversation — and is that growth controlled?
Is the model you're paying for the right model for this specific task?
Does this request need to be real-time at all, or could it run in a batch?

Answer those five questions and you'll reduce your bill by 60–80%. Shorten your prompts and you'll reduce it by 5%.

This guide covers the five levers, in order of impact, with real 2026 numbers.

2. The 2026 Pricing Landscape: What You're Actually Paying

Before optimizing, know what you're optimizing. Here is the current pricing landscape as of June 2026:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input	Context
Claude Opus 4.8	$5.00	$25.00	$0.50	1M
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M
Claude Haiku 4.5	$0.80	$4.00	$0.08	200K
GPT-5.5	$5.00	$30.00	$0.50	1M
GPT-4.1	$2.00	$8.00	$1.00	1M
GPT-4.1 Nano	$0.10	$0.40	$0.05	1M
Gemini 3.1 Pro	$2.00	$12.00	$0.20/hr storage	1M
DeepSeek V4 Flash	$0.14	$0.28	$0.003	1M

Two things jump immediately from this table.

First: output tokens cost 4–8× more than input tokens. This is the most important single fact in token economics. It is not symmetrical. Across nearly all models the median output-to-input ratio is approximately 4×, with premium reasoning models reaching 8×. The practical implication: a verbose model response is far more expensive than a verbose prompt. Controlling output length matters more than controlling input length.

Second: the spread between models is 89×. GPT-4.1 Nano at $0.10/1M input vs Claude Opus 4.8 at $5.00/1M. DeepSeek V4 Flash at $0.14 vs Claude Fable 5 at $10.00. Every task routed to Opus when Haiku would do is burning 6× the necessary budget. Every task routed to a frontier model when a mid-tier would do is burning 3–7× the budget.

xychart-beta
    title "Output Token Price per 1M (USD) — June 2026"
    x-axis ["DeepSeek V4 Flash", "GPT-4.1 Nano", "Haiku 4.5", "GPT-4.1", "Gemini 3.1 Pro", "Sonnet 4.6", "Opus 4.8", "GPT-5.5"]
    y-axis "$/1M output tokens" 0 --> 32
    bar [0.28, 0.40, 4.00, 8.00, 12.00, 15.00, 25.00, 30.00]

And the savings from caching:

Anthropic: 90% off cached input (cache_control: "ephemeral") → $3.00/1M becomes $0.30/1M on Sonnet 4.6. OpenAI: 50% off cached input (automatic, no code changes) → $5.00/1M becomes $2.50/1M on GPT-5.5. Batch API (both providers): 50% off all tokens for async requests.

These multipliers stack. A cached batch request at Anthropic gets both the 90% cache discount and the 50% batch discount on the cached portion.

3. Where Your Tokens Actually Go

Most teams optimize the wrong things because they haven't diagnosed where their tokens actually go. Here are the real cost drivers, in rough order of impact:

pie title "Typical Token Distribution — Multi-Turn Agent Session"
    "Conversation history (replayed each turn)" : 42
    "System prompt (replayed each turn)" : 18
    "Tool schemas (loaded but unused)" : 15
    "RAG retrieved context" : 14
    "Actual user message" : 7
    "Model output" : 4

Conversation history is usually the largest line item. Every turn replays the entire prior conversation — every message, every response, every tool result. A session that runs 20 turns is not paying for 20 messages; it's paying for the sum of 1 + 2 + 3 + ... + 20 = 210 message-equivalents. We'll explore the math below.

Tool schemas are the hidden tax nobody talks about. An agent loaded with a full tool library — search, database queries, email, calendar, file system, code execution — can carry 55,000 to 134,000 tokens of tool schema overhead on every single request, regardless of which tools it actually calls.

RAG pipelines routinely over-retrieve. Teams pass 10,000+ tokens of retrieved context into a prompt by default because the context window can hold it — not because it improves results. In many cases, 2–3 focused chunks outperform 10 broad ones, both in quality and cost.

The system prompt baseline. Even an empty session starts with overhead. An empty Claude message triggers approximately 5,000 tokens of baseline processing (system instructions, safety scaffolding). Attempts to reduce the user-facing prompt by a few tokens have minimal impact against this fixed baseline. The only ways to reduce it: cache it, or solve more per session so the fixed cost is amortized across more value.

4. The Quadratic Conversation Problem

This is the mechanism most teams don't realize is happening until the bill arrives.

LLMs have no persistent memory between API calls. Every time you send a new message, the model receives the entire conversation history — every prior message, every prior response — as raw input and processes it from scratch. The practical consequence: token usage grows with conversation length, and it grows faster than linearly.

Consider a conversation with n turns, where each turn averages t tokens (user message + assistant response):

Turn 1:  1t tokens processed
Turn 2:  2t tokens processed  (history + new)
Turn 3:  3t tokens processed
...
Turn n:  nt tokens processed

Total:   t × (1 + 2 + 3 + ... + n) = t × n(n+1)/2

xychart-beta
    title "Cumulative Token Cost — 200-Token Average Turn (input only)"
    x-axis ["Turn 5", "Turn 10", "Turn 20", "Turn 30", "Turn 50"]
    y-axis "Cumulative tokens (thousands)" 0 --> 260
    bar [3, 11, 42, 93, 255]

At turn 50, you have paid for 255,000 tokens of input to process what is nominally a 50-turn conversation with 200 tokens per turn. The naive estimate would be 50 × 200 = 10,000 tokens. The actual cost is 25.5× that.

This is not a rounding error. It is the dominant cost driver for any application with extended conversations — support agents, coding assistants, research tools, anything that maintains a thread. And it compounds with model cost: at Sonnet 4.6 rates ($3.00/1M), a 50-turn conversation costs $0.77 in input tokens alone. A 100-turn conversation in the same session costs $3.03. The quadratic growth is real and measurable.

The fix is context pruning and compaction — covered in Lever 2.

5. Lever 1: Prompt Caching — 90% Off Your Static Content

Prompt caching is the single highest-impact optimization available in 2026 for applications with consistent system prompts, tool definitions, or reference documents. At Anthropic, cached input tokens cost 10% of the normal rate — a 90% discount. At OpenAI, cached input is 50% off automatically.

The mechanism: when Claude processes a prompt, it generates a KV (key-value) cache — a computed representation of the tokens already processed. Prompt caching stores that computed state so it does not need to be recalculated on the next request. If the next request uses the same prefix, Claude retrieves the cached computation instead of running the full calculation again.

How to Use It (Anthropic)

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Your static system prompt — ideally 1,024+ tokens to qualify for caching
const SYSTEM_PROMPT = `You are an expert customer support agent for Acme Corp.

## Product Catalogue
[... 2,000 words of product documentation ...]

## Policies
[... return policy, shipping policy, warranty terms ...]

## Response Guidelines
[... tone, escalation rules, formatting requirements ...]
`;

// Static tool definitions — also cacheable
const TOOLS = [
  {
    name: 'lookup_order',
    description: 'Look up an order by order ID',
    input_schema: { /* schema */ },
  },
  {
    name: 'check_inventory',
    description: 'Check product inventory levels',
    input_schema: { /* schema */ },
  },
  // ... more tools
];

async function chat(conversationHistory: Anthropic.MessageParam[], userMessage: string) {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: [
      {
        type: 'text',
        text: SYSTEM_PROMPT,
        // ← This block is cached. On cache hit: 90% cheaper.
        cache_control: { type: 'ephemeral' },
      },
    ],
    tools: TOOLS.map((tool, index) =>
      // Cache the last tool definition to cache the whole tool block
      index === TOOLS.length - 1
        ? { ...tool, cache_control: { type: 'ephemeral' } }
        : tool
    ),
    messages: [
      ...conversationHistory,
      { role: 'user', content: userMessage },
    ],
  });

  // Inspect cache performance in the response
  const usage = response.usage;
  console.log({
    inputTokens:         usage.input_tokens,
    cacheReadTokens:     usage.cache_read_input_tokens,    // ← paid 10%
    cacheCreationTokens: usage.cache_creation_input_tokens, // ← paid 125%
    outputTokens:        usage.output_tokens,
  });

  return response;
}

The Cache Rules You Must Know

Minimum token threshold: Claude requires at least 1,024 tokens in a block before it will cache it. If your system prompt is 800 tokens, caching will not activate. This is a nudge toward being thorough with static context — consolidate instructions, tool background, and reference material into a single cacheable block.

Order is everything: The cache is prefix-based. Cached content must come before dynamic content in your prompt. System prompt first, tool definitions second, reference documents third, conversation history last, current user message last of all. Any dynamic content before your static content breaks the cache.

Default TTL is 5 minutes. For workflows with longer gaps between turns — a human approval step, an overnight batch — you need the 1-hour cache extension (available at additional cost but cheaper than cache misses across long gaps).

Up to 4 cache breakpoints per request. You can cache the system prompt, the tool definitions, a reference document, and one more block independently. Use them strategically on your four largest static blocks.

Cache writes cost 25% more on the first request (you're paying for Anthropic to store it). The breakeven is typically 3–4 requests. After that, every cache hit pays dividends.

OpenAI Automatic Caching

OpenAI handles this automatically — no code changes needed:

// OpenAI: caching is automatic. Same prefix = cache hit. 50% discount.
// No cache_control markers required.
// Check usage.prompt_tokens_details.cached_tokens in the response to verify.
const response = await openai.chat.completions.create({
  model: 'gpt-4.1',
  messages: [
    { role: 'system', content: LARGE_STATIC_SYSTEM_PROMPT }, // auto-cached
    ...conversationHistory,
    { role: 'user', content: userMessage },
  ],
});

console.log(response.usage?.prompt_tokens_details?.cached_tokens);

The trade-off: OpenAI's caching gives 50% off (vs Anthropic's 90%) but requires zero implementation effort.

6. Lever 2: Context Pruning and Compaction

Caching helps with static content. Context pruning handles the dynamic problem — the quadratic growth of conversation history.

There are three distinct tools here, used at different stages of a session:

flowchart TD
    A["New Turn"] --> B{"History\nsize?"}
    B -->|"Small\n< 20% of window"| C["Pass through\nno changes"]
    B -->|"Medium\n20–70% of window"| D["Context Editing\nPrune stale tool results\nand completed subtasks"]
    B -->|"Large\n> 70% of window"| E{"Has old turns\nwith full detail?"}
    E -->|"Yes"| F["Summarize old turns\ninto static cached block"]
    E -->|"Approaching limit"| G["Context Compaction\nbeta: compact-2026-01-12\nServer-side condensation"]
    D --> H["Continue"]
    F --> H
    G --> H
    C --> H

Tool 1: Context Editing (Pruning)

Over a long agent run, the transcript fills with content that no longer informs the next step: tool results from completed sub-tasks, intermediate reasoning from earlier nodes, exploratory searches that were dead ends. Remove them:

function pruneConversationHistory(
  messages: Message[],
  keepLastN: number = 10
): Message[] {
  if (messages.length <= keepLastN) return messages;

  // Always keep: system messages, the original user goal, recent messages
  const pinned  = messages.filter(m => m.role === 'system' || m.isPinned);
  const recent  = messages.slice(-keepLastN);
  const stale   = messages.slice(0, -keepLastN).filter(m => !m.isPinned);

  // Remove stale tool results (they've been acted on; the action is in recent history)
  const prunedStale = stale.map(msg => {
    if (msg.role === 'tool' && isOlderThan(msg, minutesAgo(15))) {
      return { ...msg, content: '[tool result pruned — see summary]' };
    }
    return msg;
  });

  return [...pinned, ...prunedStale, ...recent];
}

// In your agent loop:
state.messages = pruneConversationHistory(state.messages, 12);

Tool 2: Summarize Into a Cached Block

When the conversation has accumulated substantial history, summarize older turns into a static block — then cache it:

async function compressOldHistory(
  messages: Message[],
  threshold: number = 20
): Promise<{ summary: string; recentMessages: Message[] }> {
  if (messages.length < threshold) {
    return { summary: '', recentMessages: messages };
  }

  const old    = messages.slice(0, -10);
  const recent = messages.slice(-10);

  // Summarize the old turns into a compact representation
  const summaryResponse = await client.messages.create({
    model:      'claude-haiku-4-5',   // ← use a cheap model for summarization
    max_tokens: 500,
    messages: [
      {
        role:    'user',
        content: `Summarize the following conversation history into a compact factual summary.
Include: decisions made, key findings, completed actions, and current state.
Exclude: detailed reasoning, intermediate steps, failed attempts.

<history>
${old.map(m => `${m.role}: ${m.content}`).join('\n')}
</history>`,
      },
    ],
  });

  return {
    summary:        summaryResponse.content[0].text,
    recentMessages: recent,
  };
}

// Build the next request with summary cached + recent messages dynamic
const { summary, recentMessages } = await compressOldHistory(state.messages);

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  system: [
    { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },
    // The summary becomes a second cached block — static, stable, cheap
    ...(summary ? [{
      type:          'text' as const,
      text:          `## Conversation Summary\n${summary}`,
      cache_control: { type: 'ephemeral' as const },
    }] : []),
  ],
  messages: recentMessages,
  max_tokens: 1024,
});

Tool 3: Context Compaction (Beta)

Anthropic's server-side context compaction (compact-2026-01-12) condenses earlier history into a summary automatically when a conversation approaches the context window limit. It is a beta feature but available on all current Claude models:

const response = await client.messages.create({
  model:      'claude-sonnet-4-6',
  max_tokens: 1024,
  messages:   fullHistory,
  betas:      ['compact-2026-01-12'],   // enable server-side compaction
});

Warning: Over-compressing context degrades answers and triggers retries that cost more than the context you saved. The goal is not the smallest possible context but the smallest sufficient one. Validate quality on representative traffic after any aggressive context change before shipping.

7. Lever 3: Model Routing — Pay for Reasoning Only When You Need It

This is the highest-leverage structural change most teams never make. The 89× price spread between DeepSeek V4 Flash ($0.28/1M output) and Claude Fable 5 ($50/1M output) is budget left on the table by every team that uses one model for everything.

The principle: match model capability to task complexity. Not every task needs the frontier. A routing classifier costs a fraction of a cent and decides which model handles the work.

// src/router.ts — simple but effective model router
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

type TaskComplexity = 'simple' | 'medium' | 'complex';

// Light classifier — uses the cheapest model to categorize the task
async function classifyTask(userMessage: string): Promise<TaskComplexity> {
  const response = await client.messages.create({
    model:      'claude-haiku-4-5',  // $0.80/1M — cheapest Anthropic model
    max_tokens: 10,
    messages: [{
      role:    'user',
      content: `Classify this task. Reply with ONLY one word: simple, medium, or complex.

simple: extraction, formatting, classification, yes/no questions, short factual lookups
medium: summarization, translation, structured analysis, multi-step but predictable tasks
complex: deep reasoning, code architecture, multi-turn problem solving, research synthesis

Task: "${userMessage}"`,
    }],
  });

  const classification = (response.content[0] as { text: string }).text
    .trim()
    .toLowerCase() as TaskComplexity;

  return ['simple', 'medium', 'complex'].includes(classification)
    ? classification
    : 'medium';  // safe default
}

// Route to the right model based on complexity
const MODEL_MAP: Record<TaskComplexity, string> = {
  simple:  'claude-haiku-4-5',   // $0.80/$4.00 per 1M — fast, cheap
  medium:  'claude-sonnet-4-6',  // $3.00/$15.00 per 1M — balanced
  complex: 'claude-opus-4-8',    // $5.00/$25.00 per 1M — frontier reasoning
};

export async function routedCompletion(
  userMessage: string,
  systemPrompt: string,
  conversationHistory: Anthropic.MessageParam[]
) {
  const complexity = await classifyTask(userMessage);
  const model      = MODEL_MAP[complexity];

  console.log(`Routed to: ${model} (${complexity})`);

  const response = await client.messages.create({
    model,
    max_tokens: complexity === 'simple' ? 256 : complexity === 'medium' ? 1024 : 4096,
    system:     [{ type: 'text', text: systemPrompt, cache_control: { type: 'ephemeral' } }],
    messages:   [...conversationHistory, { role: 'user', content: userMessage }],
  });

  return { response, model, complexity };
}

Real-world routing distribution for a typical support application:

~60% of requests are simple (classification, lookup, formatting)
~30% are medium (summarization, structured responses)
~10% are complex (edge cases, escalations, deep problem solving)

Routing to Haiku for 60% of requests and Sonnet for 30% — instead of Opus for 100% — reduces cost by approximately 75% with minimal quality impact on the simple and medium tasks.

For agent teams: subagent model selection matters even more. Agent sessions use approximately 7× more tokens than standard sessions, so model choice compounds dramatically.

8. Lever 4: Batch API — 50% Off Everything Async

Not every LLM request needs to be real-time. Many common workloads — content generation, classification, embedding, extraction, data enrichment — can run asynchronously. Both Anthropic and OpenAI offer batch APIs that cost 50% less than real-time endpoints, with no quality difference.

// src/batch.ts — Anthropic Message Batches API
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

interface BatchItem {
  id:      string;
  content: string;
  context: Record<string, string>;
}

// Create a batch — send thousands of requests at once, 50% cheaper
async function processBatch(items: BatchItem[]) {
  const batch = await client.beta.messages.batches.create({
    requests: items.map(item => ({
      custom_id: item.id,
      params: {
        model:      'claude-sonnet-4-6',
        max_tokens: 512,
        system: [{
          type:          'text' as const,
          text:          CLASSIFICATION_SYSTEM_PROMPT,
          cache_control: { type: 'ephemeral' as const },
        }],
        messages: [{
          role:    'user' as const,
          content: item.content,
        }],
      },
    })),
  });

  console.log(`Batch created: ${batch.id}`);
  console.log(`Status: ${batch.processing_status}`);

  return batch.id;
}

// Poll for results (or use a webhook)
async function getBatchResults(batchId: string) {
  // Wait for completion
  let batch = await client.beta.messages.batches.retrieve(batchId);

  while (batch.processing_status === 'in_progress') {
    await new Promise(resolve => setTimeout(resolve, 5000));
    batch = await client.beta.messages.batches.retrieve(batchId);
  }

  // Stream the results
  const results = [];
  for await (const result of await client.beta.messages.batches.results(batchId)) {
    if (result.result.type === 'succeeded') {
      results.push({
        id:      result.custom_id,
        content: result.result.message.content[0],
        usage:   result.result.message.usage,
      });
    }
  }

  return results;
}

Best workloads for batch processing:

Workload	Real-time needed?	Batch saves
Content moderation	Sometimes	50%
Document classification	Rarely	50%
Bulk summarization	Never	50%
Data extraction from documents	Never	50%
Embedding generation	Never	~free via alternatives
Email drafts (generated nightly)	Never	50%
Report generation	Never	50%
A/B prompt testing	Never	50%

Stack with caching: batch + cached input on a repeated system prompt → a request that would cost $3.00/1M normally costs $0.30/1M (cache read) × 0.50 (batch) = $0.15/1M — a 95% reduction.

9. Lever 5: Surgical RAG — Stop Dumping Documents

Retrieval-Augmented Generation is the most common source of context bloat in production AI applications. Teams retrieve 10,000+ tokens of context by default — because the context window can hold it, not because it improves results.

The research is clear: the context window does not drive quality. The placement and precision of context does. More context, placed poorly, can actively hurt performance due to the "lost in the middle" effect — models perform worse when critical information is buried in the middle of a long context window.

// src/rag.ts — surgical retrieval, not document dumping

interface RetrievedChunk {
  content:   string;
  score:     number;
  tokenCount: number;
  source:    string;
}

// BAD: pass everything the retriever returns
async function naiveRag(query: string): Promise<string> {
  const chunks = await vectorDB.search(query, { limit: 10 });
  return chunks.map(c => c.content).join('\n\n');
  // Typically: 8,000–15,000 tokens. Costs $0.024–$0.045 per call on Sonnet 4.6.
}

// GOOD: targeted retrieval with a token budget
async function surgicalRag(
  query:        string,
  tokenBudget:  number = 2000,   // hard cap on context retrieved
  minScore:     number = 0.75,   // filter out weak matches
): Promise<string> {
  // 1. Retrieve more candidates than you'll use
  const candidates = await vectorDB.search(query, { limit: 20 });

  // 2. Filter by relevance threshold — weak matches add noise, not signal
  const relevant = candidates.filter(c => c.score >= minScore);

  // 3. Rerank: MMR (Maximal Marginal Relevance) for diversity + relevance
  const reranked = maximalMarginalRelevance(relevant, query, 5);

  // 4. Fill up to the token budget, most relevant first
  let usedTokens = 0;
  const selected: RetrievedChunk[] = [];

  for (const chunk of reranked) {
    if (usedTokens + chunk.tokenCount > tokenBudget) break;
    selected.push(chunk);
    usedTokens += chunk.tokenCount;
  }

  // 5. Format with source attribution (helps the model use context correctly)
  return selected
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.content}`)
    .join('\n\n---\n\n');
  // Result: ~1,500–2,000 tokens. Costs $0.005–$0.006. Same or better quality.
}

Common RAG token traps:

// ❌ Retrieving entire documents
const context = await fetchDocument(docId);  // might be 50,000 tokens

// ✅ Retrieving targeted chunks with a hard token cap
const context = await surgicalRag(query, tokenBudget = 1500);

// ❌ Passing raw retrieval with no filtering
const context = chunks.map(c => c.content).join('\n\n');

// ✅ Filtering by score, deduplicating, trimming to budget
const context = await surgicalRag(query, tokenBudget = 2000, minScore = 0.75);

// ❌ Re-retrieving the same documents on every turn
// ✅ Cache stable reference documents (product catalogue, policies, FAQs)
//    using prompt caching — retrieve once, cache for the session

The savings from surgical RAG are significant. Limiting retrieval to 2–3 shorter chunks instead of 8–10 full documents can cut input tokens by more than half, with no loss in — and often an improvement in — answer precision.

10. Cache Structure: The Ordering That Makes or Breaks Cache Hits

Prompt caching is prefix-based: the cache must match from the beginning of the prompt. A single dynamic element placed before your static content breaks every cache hit.

// ❌ WRONG ORDER — cache never hits because timestamp is before static content
const messages = [
  { role: 'user', content: `Current time: ${new Date().toISOString()}` }, // changes every call
  { role: 'assistant', content: LARGE_STATIC_CONTEXT },  // cache broken
];

// ✅ CORRECT ORDER — static content first, dynamic content last
const system = [
  {
    type:          'text',
    text:          SYSTEM_PROMPT,      // static — cache this
    cache_control: { type: 'ephemeral' },
  },
  {
    type:          'text',
    text:          TOOL_CONTEXT,       // static — cache this too
    cache_control: { type: 'ephemeral' },
  },
  {
    type:          'text',
    text:          REFERENCE_DOCS,     // static — third cache breakpoint
    cache_control: { type: 'ephemeral' },
  },
];

const messages = [
  // Dynamic: conversation history (grows with turns)
  ...conversationHistory,
  // Dynamic: current user message (changes every turn)
  { role: 'user', content: `[${new Date().toISOString()}] ${userMessage}` },
  // ↑ Put timestamps in the user message, not in the system block
];

The correct structure — every time:

1. System prompt                    ← cache_control: ephemeral
2. Tool definitions                 ← cache_control: ephemeral
3. Reference documents              ← cache_control: ephemeral (optional 4th breakpoint)
4. Summarized conversation history  ← cache_control: ephemeral (if summarized)
────────────────────────────────── cache boundary ──
5. Recent conversation history      ← dynamic, no cache
6. Current user message             ← dynamic, no cache

Common ways cache hits break in practice:

Misplaced or missing cache_control markers (many third-party LLM clients don't implement this)
Conversation history restructured or compressed differently on each call (changing the prefix)
Dynamic data (timestamps, session IDs, personalization) injected into static blocks
Tool schemas sorted differently across requests (different order = different prefix = cache miss)

11. Measuring First: You Can't Optimize What You Can't See

Before touching a single line of optimization code, instrument what you actually have.

Token Counting Before Sending

import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();

// Count tokens in a request before you send it — no inference cost
const tokenCount = await client.messages.countTokens({
  model:  'claude-sonnet-4-6',
  system: SYSTEM_PROMPT,
  tools:  TOOLS,
  messages: conversationHistory,
});

console.log(`This request will use ${tokenCount.input_tokens} input tokens`);
// At $3/1M: ${(tokenCount.input_tokens * 0.000003).toFixed(6)} per call

Reading Cache Performance From Responses

const response = await client.messages.create({ ... });

const {
  input_tokens,
  output_tokens,
  cache_read_input_tokens,    // paid 10% (cache hit)
  cache_creation_input_tokens // paid 125% (cache write, first time)
} = response.usage;

const cacheHitRate = cache_read_input_tokens /
  (cache_read_input_tokens + cache_creation_input_tokens + input_tokens);

const costWithoutCache = (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)
  * 0.000003;  // Sonnet 4.6 input price

const actualCost = (input_tokens * 0.000003)
  + (cache_read_input_tokens * 0.0000003)   // 10% of normal
  + (cache_creation_input_tokens * 0.00000375); // 125% of normal

const savings = costWithoutCache - actualCost;

console.log({
  cacheHitRate:    `${(cacheHitRate * 100).toFixed(1)}%`,
  costSaved:       `$${savings.toFixed(6)}`,
  effectivePricePerToken: `$${(actualCost / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens) * 1_000_000).toFixed(2)}/1M`,
});

A Simple Token Budget Tracker

// src/token-tracker.ts — per-session budget tracking
class TokenBudget {
  private inputTokens  = 0;
  private outputTokens = 0;
  private cacheHits    = 0;
  private cacheMisses  = 0;

  // Input/output prices for Claude Sonnet 4.6
  private readonly INPUT_PRICE  = 3.00 / 1_000_000;
  private readonly OUTPUT_PRICE = 15.00 / 1_000_000;
  private readonly CACHE_PRICE  = 0.30 / 1_000_000;  // 90% off

  record(usage: Anthropic.Usage) {
    this.inputTokens  += usage.input_tokens;
    this.outputTokens += usage.output_tokens;
    this.cacheHits    += usage.cache_read_input_tokens ?? 0;
    this.cacheMisses  += usage.cache_creation_input_tokens ?? 0;
  }

  report() {
    const inputCost  = this.inputTokens  * this.INPUT_PRICE;
    const outputCost = this.outputTokens * this.OUTPUT_PRICE;
    const cacheCost  = this.cacheHits    * this.CACHE_PRICE;
    const totalCost  = inputCost + outputCost + cacheCost;

    return {
      tokens: {
        input:   this.inputTokens,
        output:  this.outputTokens,
        cached:  this.cacheHits,
      },
      cost: {
        input:   `$${inputCost.toFixed(4)}`,
        output:  `$${outputCost.toFixed(4)}`,
        cached:  `$${cacheCost.toFixed(4)}`,
        total:   `$${totalCost.toFixed(4)}`,
      },
      cacheHitRate: `${(this.cacheHits / (this.cacheHits + this.cacheMisses + this.inputTokens) * 100).toFixed(1)}%`,
    };
  }
}

The Monthly Cost Formula

Before optimizing, calculate your actual baseline cost:

Monthly Cost =
  (Daily Requests × Avg Input Tokens  × Input Price/1M  × 30) +
  (Daily Requests × Avg Output Tokens × Output Price/1M × 30)

Example: 10,000 daily requests, 3,000 avg input tokens, 500 avg output tokens, Claude Sonnet 4.6:

Input:  10,000 × 3,000 × ($3.00/1M)  × 30 = $2,700/month
Output: 10,000 × 500   × ($15.00/1M) × 30 = $2,250/month
Total:  $4,950/month

After caching (assume 80% cache hit rate on input):
Cached input:  10,000 × 2,400 × ($0.30/1M) × 30 = $216/month  ← 90% off
Fresh input:   10,000 ×   600 × ($3.00/1M)  × 30 = $540/month
Output (unchanged):                                 $2,250/month
Total:  $3,006/month  ← 39% reduction from caching alone

12. Real Case Study: $2,400 to $680 in 3 Months

A 6-person engineering team logged 6 months of Claude Code usage across their development workflow. Month 1: $2,400 in API spend (approximately 240 million tokens at typical Sonnet rates). After systematically applying the optimizations in this guide across months 2 and 3 — prompt caching for static context, token budget enforcement, and model switching (routing simple tasks to Haiku) — month 3 spend was $680. A 72% reduction.

The breakdown of where the savings came from:

pie title "Source of 72% Cost Reduction"
    "Model routing (Haiku for simple tasks)" : 41
    "Prompt caching (system prompts, tool schemas)" : 33
    "Context pruning (removing stale history)" : 18
    "Output length budgeting (max_tokens)" : 8

The most surprising finding: the largest single source of savings was model routing — not prompt caching, which most developers assume is the big lever. This reflects a team that was routing almost everything to Sonnet (or higher) when a significant portion of requests were routine enough for Haiku.

The second surprise: output length budgeting (setting max_tokens deliberately low for simple tasks) contributed meaningfully. Many teams leave max_tokens at 4,096 as a default, even for tasks where a 200-token answer is sufficient. Output tokens cost 5× more than input tokens. Every unnecessary output token is expensive.

13. The Anti-Patterns: What Not to Do

Anti-pattern 1: Optimizing prompt length before diagnosing cost drivers

Spending hours trimming 200 tokens from a system prompt when a 20,000-token conversation history is growing quadratically. Measure first.

Anti-pattern 2: Using max_tokens: 4096 as a default everywhere

Output tokens cost 4–8× more than input. Set max_tokens deliberately:

// Simple classification: 10 tokens is enough
max_tokens: 10

// Factual lookup: 256 tokens
max_tokens: 256

// Structured analysis: 1024 tokens
max_tokens: 1024

// Full document generation: 4096
max_tokens: 4096

Anti-pattern 3: Breaking cache prefixes without knowing it

Injecting timestamps, session IDs, or personalization into the system prompt block instead of the user message block. Costs full price every time.

Anti-pattern 4: Over-compressing context

Trimming or summarizing past the point of sufficiency degrades answers and triggers retries. A retry on Sonnet 4.6 costs 3,000+ tokens just to redo what the compressed context should have handled. The goal is the smallest sufficient context, not the smallest possible one.

Anti-pattern 5: Routing everything to the frontier model

The default mental model — "use the best model to be safe" — is incorrect for production economics. Haiku handles 60% of real-world support tasks correctly. Routing those to Opus is 6× the price for no quality gain on routine tasks.

Anti-pattern 6: Ignoring the Opus 4.7+ tokenizer change

Anthropic's models from Opus 4.7 onward use a new tokenizer that produces up to 35% more tokens for the same text compared to Opus 4.6, particularly for code and structured data. If you migrated from Opus 4.6 to Opus 4.8 and your costs increased more than expected, this is likely why. Recalibrate your token estimates and budgets after model migrations.

14. Tokenizer Quirks You Should Know in 2026

Tokenization is not standardized across providers. The same text yields different token counts depending on the model's tokenizer — and the differences have direct billing implications.

// The same text can produce materially different token counts across models
const sampleText = `
function calculateTax(income: number, rate: number): number {
  const deduction = Math.min(income * 0.1, 5000);
  return (income - deduction) * rate;
}
`;

// Approximate token counts (varies by tokenizer):
// GPT-4 (tiktoken):        ~45 tokens
// Claude Sonnet 4.6:       ~48 tokens
// Claude Opus 4.8 (new tokenizer): ~58–62 tokens  ← up to 35% more for code

// Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words (English prose)
// Code, JSON, and structured data tokenize less efficiently than prose
// Non-Latin scripts (Chinese, Arabic) can produce 1 token per character

Key quirks in 2026:

Opus 4.7+ tokenizer: Up to 35% more tokens than Opus 4.6 for code-heavy inputs, negligible difference for English prose. Factor this into cost estimates when migrating models.
Cross-provider variation: A prompt that registers as 140 tokens in GPT-4 may exceed 180 tokens in Claude or Gemini. Always count tokens on the target model before estimating costs.
Structured data bloat: JSON, XML, and YAML tokenize poorly — many tokens for structural characters. If you're passing large JSON payloads into context, consider extracting only the fields the model needs.
Polite language: "Please" adds a token. "Thank you" adds two. "I was wondering if you could possibly help me with" adds eight. At scale, conversational padding in system prompts is measurable.

15. FAQ

Q: Should I optimize for input tokens or output tokens first? Output tokens. They cost 4–8× more and are where most teams have the most untapped leverage. Set max_tokens deliberately per task type, ask the model for concise responses when appropriate, and use structured output (JSON schemas) to eliminate verbose prose when you need data.

Q: How do I know if my prompt caching is actually working? Check cache_read_input_tokens in the API response. If it's 0 on every call, caching is not activating. Common reasons: content is below the 1,024-token minimum, cache_control markers are missing or misplaced, or dynamic content is breaking the prefix before the cached block.

Q: Is it worth implementing model routing for a small application? Yes, if you're running more than ~1,000 requests per day. The classifier call (using Haiku) costs a fraction of a cent and pays for itself immediately by routing 50–60% of calls to cheaper models. The breakeven is typically within the first day.

Q: What's the right max_tokens for my use case? Measure your actual output token distribution across production traffic (or a sample). Set max_tokens at the 95th percentile of your real outputs for that task type. For most classification and extraction tasks, 256 is generous. For summaries, 512. For generation tasks, 1,024–2,048. Frontier model outputs for simple tasks at 4,096 is a common source of silent waste.

Q: How does the Batch API affect latency? Batch requests are processed within 24 hours (typically much faster — often within minutes for small batches). This rules it out for real-time user-facing features but makes it ideal for background processing, nightly jobs, content pipelines, and evaluation runs. If your users are not waiting for the response, use the Batch API.

Q: We're hitting rate limits, not spending limits. How does that interact with token budgeting? Rate limits are measured in tokens per minute (TPM), not dollars. Reducing tokens per request — via pruning, surgical RAG, and concise output — directly increases how many requests you can fit within your TPM limit. Token budgeting and rate limit management are the same problem from different angles.

16. Further Reading

Official Documentation

📖 Anthropic Prompt Caching Documentation
📖 Anthropic Message Batches API
📖 Anthropic Token Counting API
📖 OpenAI Batch API
📖 OpenAI Prompt Caching

Research and Deep Dives

📰 LLM Token Optimization Strategies: The Complete Guide — TokenOptimize (2026)
📰 Token Optimization and Cost Management for ChatGPT & Claude — IntuitionLabs
📰 Understanding LLM Cost Per Token: A 2026 Practical Guide — SiliconData
📰 Claude API Cost Optimization: 60% Token Reduction in Production — DEV Community
📰 Anthropic Prompt Caching and Token Efficiency Guide — Hidekazu Konishi
📰 Lost in the Middle: How Language Models Use Long Contexts — Stanford NLP (the research behind why context placement matters)

Tools

Start with measurement — run client.messages.countTokens() on your most frequent request type before changing anything. The number will almost certainly surprise you, and it will tell you exactly which lever to pull first.

1. The Misconception That's Costing You Money

Ask a developer how to reduce their LLM API costs and they'll say: "write shorter prompts." Remove adjectives. Cut explanations. Use fewer examples. Trim the system prompt.

This advice is not wrong. It's just the lowest-leverage version of the right idea, and it misses the places where 80% of the money actually goes.

Token optimization is a context engineering problem. The question is not "how many words is my prompt?" The questions are:

What is in your context that does not need to be there?
How is your context structured — and does that structure let the cache work?
How does your context grow across a conversation — and is that growth controlled?
Is the model you're paying for the right model for this specific task?
Does this request need to be real-time at all, or could it run in a batch?

Answer those five questions and you'll reduce your bill by 60–80%. Shorten your prompts and you'll reduce it by 5%.

This guide covers the five levers, in order of impact, with real 2026 numbers.

2. The 2026 Pricing Landscape: What You're Actually Paying

Before optimizing, know what you're optimizing. Here is the current pricing landscape as of June 2026:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input	Context
Claude Opus 4.8	$5.00	$25.00	$0.50	1M
Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M
Claude Haiku 4.5	$0.80	$4.00	$0.08	200K
GPT-5.5	$5.00	$30.00	$0.50	1M
GPT-4.1	$2.00	$8.00	$1.00	1M
GPT-4.1 Nano	$0.10	$0.40	$0.05	1M
Gemini 3.1 Pro	$2.00	$12.00	$0.20/hr storage	1M
DeepSeek V4 Flash	$0.14	$0.28	$0.003	1M

Two things jump immediately from this table.

xychart-beta
    title "Output Token Price per 1M (USD) — June 2026"
    x-axis ["DeepSeek V4 Flash", "GPT-4.1 Nano", "Haiku 4.5", "GPT-4.1", "Gemini 3.1 Pro", "Sonnet 4.6", "Opus 4.8", "GPT-5.5"]
    y-axis "$/1M output tokens" 0 --> 32
    bar [0.28, 0.40, 4.00, 8.00, 12.00, 15.00, 25.00, 30.00]

And the savings from caching:

These multipliers stack. A cached batch request at Anthropic gets both the 90% cache discount and the 50% batch discount on the cached portion.

3. Where Your Tokens Actually Go

Most teams optimize the wrong things because they haven't diagnosed where their tokens actually go. Here are the real cost drivers, in rough order of impact:

pie title "Typical Token Distribution — Multi-Turn Agent Session"
    "Conversation history (replayed each turn)" : 42
    "System prompt (replayed each turn)" : 18
    "Tool schemas (loaded but unused)" : 15
    "RAG retrieved context" : 14
    "Actual user message" : 7
    "Model output" : 4

4. The Quadratic Conversation Problem

This is the mechanism most teams don't realize is happening until the bill arrives.

Consider a conversation with n turns, where each turn averages t tokens (user message + assistant response):

Turn 1:  1t tokens processed
Turn 2:  2t tokens processed  (history + new)
Turn 3:  3t tokens processed
...
Turn n:  nt tokens processed

Total:   t × (1 + 2 + 3 + ... + n) = t × n(n+1)/2

xychart-beta
    title "Cumulative Token Cost — 200-Token Average Turn (input only)"
    x-axis ["Turn 5", "Turn 10", "Turn 20", "Turn 30", "Turn 50"]
    y-axis "Cumulative tokens (thousands)" 0 --> 260
    bar [3, 11, 42, 93, 255]

The fix is context pruning and compaction — covered in Lever 2.

5. Lever 1: Prompt Caching — 90% Off Your Static Content

How to Use It (Anthropic)

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

// Your static system prompt — ideally 1,024+ tokens to qualify for caching
const SYSTEM_PROMPT = `You are an expert customer support agent for Acme Corp.

## Product Catalogue
[... 2,000 words of product documentation ...]

## Policies
[... return policy, shipping policy, warranty terms ...]

## Response Guidelines
[... tone, escalation rules, formatting requirements ...]
`;

// Static tool definitions — also cacheable
const TOOLS = [
  {
    name: 'lookup_order',
    description: 'Look up an order by order ID',
    input_schema: { /* schema */ },
  },
  {
    name: 'check_inventory',
    description: 'Check product inventory levels',
    input_schema: { /* schema */ },
  },
  // ... more tools
];

async function chat(conversationHistory: Anthropic.MessageParam[], userMessage: string) {
  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: [
      {
        type: 'text',
        text: SYSTEM_PROMPT,
        // ← This block is cached. On cache hit: 90% cheaper.
        cache_control: { type: 'ephemeral' },
      },
    ],
    tools: TOOLS.map((tool, index) =>
      // Cache the last tool definition to cache the whole tool block
      index === TOOLS.length - 1
        ? { ...tool, cache_control: { type: 'ephemeral' } }
        : tool
    ),
    messages: [
      ...conversationHistory,
      { role: 'user', content: userMessage },
    ],
  });

  // Inspect cache performance in the response
  const usage = response.usage;
  console.log({
    inputTokens:         usage.input_tokens,
    cacheReadTokens:     usage.cache_read_input_tokens,    // ← paid 10%
    cacheCreationTokens: usage.cache_creation_input_tokens, // ← paid 125%
    outputTokens:        usage.output_tokens,
  });

  return response;
}

The Cache Rules You Must Know

Cache writes cost 25% more on the first request (you're paying for Anthropic to store it). The breakeven is typically 3–4 requests. After that, every cache hit pays dividends.

OpenAI Automatic Caching

OpenAI handles this automatically — no code changes needed:

// OpenAI: caching is automatic. Same prefix = cache hit. 50% discount.
// No cache_control markers required.
// Check usage.prompt_tokens_details.cached_tokens in the response to verify.
const response = await openai.chat.completions.create({
  model: 'gpt-4.1',
  messages: [
    { role: 'system', content: LARGE_STATIC_SYSTEM_PROMPT }, // auto-cached
    ...conversationHistory,
    { role: 'user', content: userMessage },
  ],
});

console.log(response.usage?.prompt_tokens_details?.cached_tokens);

The trade-off: OpenAI's caching gives 50% off (vs Anthropic's 90%) but requires zero implementation effort.

6. Lever 2: Context Pruning and Compaction

Caching helps with static content. Context pruning handles the dynamic problem — the quadratic growth of conversation history.

There are three distinct tools here, used at different stages of a session:

flowchart TD
    A["New Turn"] --> B{"History\nsize?"}
    B -->|"Small\n< 20% of window"| C["Pass through\nno changes"]
    B -->|"Medium\n20–70% of window"| D["Context Editing\nPrune stale tool results\nand completed subtasks"]
    B -->|"Large\n> 70% of window"| E{"Has old turns\nwith full detail?"}
    E -->|"Yes"| F["Summarize old turns\ninto static cached block"]
    E -->|"Approaching limit"| G["Context Compaction\nbeta: compact-2026-01-12\nServer-side condensation"]
    D --> H["Continue"]
    F --> H
    G --> H
    C --> H

Tool 1: Context Editing (Pruning)

function pruneConversationHistory(
  messages: Message[],
  keepLastN: number = 10
): Message[] {
  if (messages.length <= keepLastN) return messages;

  // Always keep: system messages, the original user goal, recent messages
  const pinned  = messages.filter(m => m.role === 'system' || m.isPinned);
  const recent  = messages.slice(-keepLastN);
  const stale   = messages.slice(0, -keepLastN).filter(m => !m.isPinned);

  // Remove stale tool results (they've been acted on; the action is in recent history)
  const prunedStale = stale.map(msg => {
    if (msg.role === 'tool' && isOlderThan(msg, minutesAgo(15))) {
      return { ...msg, content: '[tool result pruned — see summary]' };
    }
    return msg;
  });

  return [...pinned, ...prunedStale, ...recent];
}

// In your agent loop:
state.messages = pruneConversationHistory(state.messages, 12);

Tool 2: Summarize Into a Cached Block

When the conversation has accumulated substantial history, summarize older turns into a static block — then cache it:

async function compressOldHistory(
  messages: Message[],
  threshold: number = 20
): Promise<{ summary: string; recentMessages: Message[] }> {
  if (messages.length < threshold) {
    return { summary: '', recentMessages: messages };
  }

  const old    = messages.slice(0, -10);
  const recent = messages.slice(-10);

  // Summarize the old turns into a compact representation
  const summaryResponse = await client.messages.create({
    model:      'claude-haiku-4-5',   // ← use a cheap model for summarization
    max_tokens: 500,
    messages: [
      {
        role:    'user',
        content: `Summarize the following conversation history into a compact factual summary.
Include: decisions made, key findings, completed actions, and current state.
Exclude: detailed reasoning, intermediate steps, failed attempts.

<history>
${old.map(m => `${m.role}: ${m.content}`).join('\n')}
</history>`,
      },
    ],
  });

  return {
    summary:        summaryResponse.content[0].text,
    recentMessages: recent,
  };
}

// Build the next request with summary cached + recent messages dynamic
const { summary, recentMessages } = await compressOldHistory(state.messages);

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  system: [
    { type: 'text', text: SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },
    // The summary becomes a second cached block — static, stable, cheap
    ...(summary ? [{
      type:          'text' as const,
      text:          `## Conversation Summary\n${summary}`,
      cache_control: { type: 'ephemeral' as const },
    }] : []),
  ],
  messages: recentMessages,
  max_tokens: 1024,
});

Tool 3: Context Compaction (Beta)

const response = await client.messages.create({
  model:      'claude-sonnet-4-6',
  max_tokens: 1024,
  messages:   fullHistory,
  betas:      ['compact-2026-01-12'],   // enable server-side compaction
});

Warning: Over-compressing context degrades answers and triggers retries that cost more than the context you saved. The goal is not the smallest possible context but the smallest sufficient one. Validate quality on representative traffic after any aggressive context change before shipping.

7. Lever 3: Model Routing — Pay for Reasoning Only When You Need It

The principle: match model capability to task complexity. Not every task needs the frontier. A routing classifier costs a fraction of a cent and decides which model handles the work.

// src/router.ts — simple but effective model router
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

type TaskComplexity = 'simple' | 'medium' | 'complex';

// Light classifier — uses the cheapest model to categorize the task
async function classifyTask(userMessage: string): Promise<TaskComplexity> {
  const response = await client.messages.create({
    model:      'claude-haiku-4-5',  // $0.80/1M — cheapest Anthropic model
    max_tokens: 10,
    messages: [{
      role:    'user',
      content: `Classify this task. Reply with ONLY one word: simple, medium, or complex.

simple: extraction, formatting, classification, yes/no questions, short factual lookups
medium: summarization, translation, structured analysis, multi-step but predictable tasks
complex: deep reasoning, code architecture, multi-turn problem solving, research synthesis

Task: "${userMessage}"`,
    }],
  });

  const classification = (response.content[0] as { text: string }).text
    .trim()
    .toLowerCase() as TaskComplexity;

  return ['simple', 'medium', 'complex'].includes(classification)
    ? classification
    : 'medium';  // safe default
}

// Route to the right model based on complexity
const MODEL_MAP: Record<TaskComplexity, string> = {
  simple:  'claude-haiku-4-5',   // $0.80/$4.00 per 1M — fast, cheap
  medium:  'claude-sonnet-4-6',  // $3.00/$15.00 per 1M — balanced
  complex: 'claude-opus-4-8',    // $5.00/$25.00 per 1M — frontier reasoning
};

export async function routedCompletion(
  userMessage: string,
  systemPrompt: string,
  conversationHistory: Anthropic.MessageParam[]
) {
  const complexity = await classifyTask(userMessage);
  const model      = MODEL_MAP[complexity];

  console.log(`Routed to: ${model} (${complexity})`);

  const response = await client.messages.create({
    model,
    max_tokens: complexity === 'simple' ? 256 : complexity === 'medium' ? 1024 : 4096,
    system:     [{ type: 'text', text: systemPrompt, cache_control: { type: 'ephemeral' } }],
    messages:   [...conversationHistory, { role: 'user', content: userMessage }],
  });

  return { response, model, complexity };
}

Real-world routing distribution for a typical support application:

~60% of requests are simple (classification, lookup, formatting)
~30% are medium (summarization, structured responses)
~10% are complex (edge cases, escalations, deep problem solving)

Routing to Haiku for 60% of requests and Sonnet for 30% — instead of Opus for 100% — reduces cost by approximately 75% with minimal quality impact on the simple and medium tasks.

For agent teams: subagent model selection matters even more. Agent sessions use approximately 7× more tokens than standard sessions, so model choice compounds dramatically.

8. Lever 4: Batch API — 50% Off Everything Async

// src/batch.ts — Anthropic Message Batches API
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

interface BatchItem {
  id:      string;
  content: string;
  context: Record<string, string>;
}

// Create a batch — send thousands of requests at once, 50% cheaper
async function processBatch(items: BatchItem[]) {
  const batch = await client.beta.messages.batches.create({
    requests: items.map(item => ({
      custom_id: item.id,
      params: {
        model:      'claude-sonnet-4-6',
        max_tokens: 512,
        system: [{
          type:          'text' as const,
          text:          CLASSIFICATION_SYSTEM_PROMPT,
          cache_control: { type: 'ephemeral' as const },
        }],
        messages: [{
          role:    'user' as const,
          content: item.content,
        }],
      },
    })),
  });

  console.log(`Batch created: ${batch.id}`);
  console.log(`Status: ${batch.processing_status}`);

  return batch.id;
}

// Poll for results (or use a webhook)
async function getBatchResults(batchId: string) {
  // Wait for completion
  let batch = await client.beta.messages.batches.retrieve(batchId);

  while (batch.processing_status === 'in_progress') {
    await new Promise(resolve => setTimeout(resolve, 5000));
    batch = await client.beta.messages.batches.retrieve(batchId);
  }

  // Stream the results
  const results = [];
  for await (const result of await client.beta.messages.batches.results(batchId)) {
    if (result.result.type === 'succeeded') {
      results.push({
        id:      result.custom_id,
        content: result.result.message.content[0],
        usage:   result.result.message.usage,
      });
    }
  }

  return results;
}

Best workloads for batch processing:

Workload	Real-time needed?	Batch saves
Content moderation	Sometimes	50%
Document classification	Rarely	50%
Bulk summarization	Never	50%
Data extraction from documents	Never	50%
Embedding generation	Never	~free via alternatives
Email drafts (generated nightly)	Never	50%
Report generation	Never	50%
A/B prompt testing	Never	50%

Stack with caching: batch + cached input on a repeated system prompt → a request that would cost $3.00/1M normally costs $0.30/1M (cache read) × 0.50 (batch) = $0.15/1M — a 95% reduction.

9. Lever 5: Surgical RAG — Stop Dumping Documents

// src/rag.ts — surgical retrieval, not document dumping

interface RetrievedChunk {
  content:   string;
  score:     number;
  tokenCount: number;
  source:    string;
}

// BAD: pass everything the retriever returns
async function naiveRag(query: string): Promise<string> {
  const chunks = await vectorDB.search(query, { limit: 10 });
  return chunks.map(c => c.content).join('\n\n');
  // Typically: 8,000–15,000 tokens. Costs $0.024–$0.045 per call on Sonnet 4.6.
}

// GOOD: targeted retrieval with a token budget
async function surgicalRag(
  query:        string,
  tokenBudget:  number = 2000,   // hard cap on context retrieved
  minScore:     number = 0.75,   // filter out weak matches
): Promise<string> {
  // 1. Retrieve more candidates than you'll use
  const candidates = await vectorDB.search(query, { limit: 20 });

  // 2. Filter by relevance threshold — weak matches add noise, not signal
  const relevant = candidates.filter(c => c.score >= minScore);

  // 3. Rerank: MMR (Maximal Marginal Relevance) for diversity + relevance
  const reranked = maximalMarginalRelevance(relevant, query, 5);

  // 4. Fill up to the token budget, most relevant first
  let usedTokens = 0;
  const selected: RetrievedChunk[] = [];

  for (const chunk of reranked) {
    if (usedTokens + chunk.tokenCount > tokenBudget) break;
    selected.push(chunk);
    usedTokens += chunk.tokenCount;
  }

  // 5. Format with source attribution (helps the model use context correctly)
  return selected
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.content}`)
    .join('\n\n---\n\n');
  // Result: ~1,500–2,000 tokens. Costs $0.005–$0.006. Same or better quality.
}

Common RAG token traps:

// ❌ Retrieving entire documents
const context = await fetchDocument(docId);  // might be 50,000 tokens

// ✅ Retrieving targeted chunks with a hard token cap
const context = await surgicalRag(query, tokenBudget = 1500);

// ❌ Passing raw retrieval with no filtering
const context = chunks.map(c => c.content).join('\n\n');

// ✅ Filtering by score, deduplicating, trimming to budget
const context = await surgicalRag(query, tokenBudget = 2000, minScore = 0.75);

// ❌ Re-retrieving the same documents on every turn
// ✅ Cache stable reference documents (product catalogue, policies, FAQs)
//    using prompt caching — retrieve once, cache for the session

10. Cache Structure: The Ordering That Makes or Breaks Cache Hits

Prompt caching is prefix-based: the cache must match from the beginning of the prompt. A single dynamic element placed before your static content breaks every cache hit.

// ❌ WRONG ORDER — cache never hits because timestamp is before static content
const messages = [
  { role: 'user', content: `Current time: ${new Date().toISOString()}` }, // changes every call
  { role: 'assistant', content: LARGE_STATIC_CONTEXT },  // cache broken
];

// ✅ CORRECT ORDER — static content first, dynamic content last
const system = [
  {
    type:          'text',
    text:          SYSTEM_PROMPT,      // static — cache this
    cache_control: { type: 'ephemeral' },
  },
  {
    type:          'text',
    text:          TOOL_CONTEXT,       // static — cache this too
    cache_control: { type: 'ephemeral' },
  },
  {
    type:          'text',
    text:          REFERENCE_DOCS,     // static — third cache breakpoint
    cache_control: { type: 'ephemeral' },
  },
];

const messages = [
  // Dynamic: conversation history (grows with turns)
  ...conversationHistory,
  // Dynamic: current user message (changes every turn)
  { role: 'user', content: `[${new Date().toISOString()}] ${userMessage}` },
  // ↑ Put timestamps in the user message, not in the system block
];

The correct structure — every time:

1. System prompt                    ← cache_control: ephemeral
2. Tool definitions                 ← cache_control: ephemeral
3. Reference documents              ← cache_control: ephemeral (optional 4th breakpoint)
4. Summarized conversation history  ← cache_control: ephemeral (if summarized)
────────────────────────────────── cache boundary ──
5. Recent conversation history      ← dynamic, no cache
6. Current user message             ← dynamic, no cache

Common ways cache hits break in practice:

Misplaced or missing cache_control markers (many third-party LLM clients don't implement this)
Conversation history restructured or compressed differently on each call (changing the prefix)
Dynamic data (timestamps, session IDs, personalization) injected into static blocks
Tool schemas sorted differently across requests (different order = different prefix = cache miss)

11. Measuring First: You Can't Optimize What You Can't See

Before touching a single line of optimization code, instrument what you actually have.

Token Counting Before Sending

import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();

// Count tokens in a request before you send it — no inference cost
const tokenCount = await client.messages.countTokens({
  model:  'claude-sonnet-4-6',
  system: SYSTEM_PROMPT,
  tools:  TOOLS,
  messages: conversationHistory,
});

console.log(`This request will use ${tokenCount.input_tokens} input tokens`);
// At $3/1M: ${(tokenCount.input_tokens * 0.000003).toFixed(6)} per call

Reading Cache Performance From Responses

const response = await client.messages.create({ ... });

const {
  input_tokens,
  output_tokens,
  cache_read_input_tokens,    // paid 10% (cache hit)
  cache_creation_input_tokens // paid 125% (cache write, first time)
} = response.usage;

const cacheHitRate = cache_read_input_tokens /
  (cache_read_input_tokens + cache_creation_input_tokens + input_tokens);

const costWithoutCache = (input_tokens + cache_read_input_tokens + cache_creation_input_tokens)
  * 0.000003;  // Sonnet 4.6 input price

const actualCost = (input_tokens * 0.000003)
  + (cache_read_input_tokens * 0.0000003)   // 10% of normal
  + (cache_creation_input_tokens * 0.00000375); // 125% of normal

const savings = costWithoutCache - actualCost;

console.log({
  cacheHitRate:    `${(cacheHitRate * 100).toFixed(1)}%`,
  costSaved:       `$${savings.toFixed(6)}`,
  effectivePricePerToken: `$${(actualCost / (input_tokens + cache_read_input_tokens + cache_creation_input_tokens) * 1_000_000).toFixed(2)}/1M`,
});

A Simple Token Budget Tracker

// src/token-tracker.ts — per-session budget tracking
class TokenBudget {
  private inputTokens  = 0;
  private outputTokens = 0;
  private cacheHits    = 0;
  private cacheMisses  = 0;

  // Input/output prices for Claude Sonnet 4.6
  private readonly INPUT_PRICE  = 3.00 / 1_000_000;
  private readonly OUTPUT_PRICE = 15.00 / 1_000_000;
  private readonly CACHE_PRICE  = 0.30 / 1_000_000;  // 90% off

  record(usage: Anthropic.Usage) {
    this.inputTokens  += usage.input_tokens;
    this.outputTokens += usage.output_tokens;
    this.cacheHits    += usage.cache_read_input_tokens ?? 0;
    this.cacheMisses  += usage.cache_creation_input_tokens ?? 0;
  }

  report() {
    const inputCost  = this.inputTokens  * this.INPUT_PRICE;
    const outputCost = this.outputTokens * this.OUTPUT_PRICE;
    const cacheCost  = this.cacheHits    * this.CACHE_PRICE;
    const totalCost  = inputCost + outputCost + cacheCost;

    return {
      tokens: {
        input:   this.inputTokens,
        output:  this.outputTokens,
        cached:  this.cacheHits,
      },
      cost: {
        input:   `$${inputCost.toFixed(4)}`,
        output:  `$${outputCost.toFixed(4)}`,
        cached:  `$${cacheCost.toFixed(4)}`,
        total:   `$${totalCost.toFixed(4)}`,
      },
      cacheHitRate: `${(this.cacheHits / (this.cacheHits + this.cacheMisses + this.inputTokens) * 100).toFixed(1)}%`,
    };
  }
}

The Monthly Cost Formula

Before optimizing, calculate your actual baseline cost:

Monthly Cost =
  (Daily Requests × Avg Input Tokens  × Input Price/1M  × 30) +
  (Daily Requests × Avg Output Tokens × Output Price/1M × 30)

Example: 10,000 daily requests, 3,000 avg input tokens, 500 avg output tokens, Claude Sonnet 4.6:

Input:  10,000 × 3,000 × ($3.00/1M)  × 30 = $2,700/month
Output: 10,000 × 500   × ($15.00/1M) × 30 = $2,250/month
Total:  $4,950/month

After caching (assume 80% cache hit rate on input):
Cached input:  10,000 × 2,400 × ($0.30/1M) × 30 = $216/month  ← 90% off
Fresh input:   10,000 ×   600 × ($3.00/1M)  × 30 = $540/month
Output (unchanged):                                 $2,250/month
Total:  $3,006/month  ← 39% reduction from caching alone

12. Real Case Study: $2,400 to $680 in 3 Months

The breakdown of where the savings came from:

pie title "Source of 72% Cost Reduction"
    "Model routing (Haiku for simple tasks)" : 41
    "Prompt caching (system prompts, tool schemas)" : 33
    "Context pruning (removing stale history)" : 18
    "Output length budgeting (max_tokens)" : 8

13. The Anti-Patterns: What Not to Do

Anti-pattern 1: Optimizing prompt length before diagnosing cost drivers

Spending hours trimming 200 tokens from a system prompt when a 20,000-token conversation history is growing quadratically. Measure first.

Anti-pattern 2: Using max_tokens: 4096 as a default everywhere

Output tokens cost 4–8× more than input. Set max_tokens deliberately:

// Simple classification: 10 tokens is enough
max_tokens: 10

// Factual lookup: 256 tokens
max_tokens: 256

// Structured analysis: 1024 tokens
max_tokens: 1024

// Full document generation: 4096
max_tokens: 4096

Anti-pattern 3: Breaking cache prefixes without knowing it

Injecting timestamps, session IDs, or personalization into the system prompt block instead of the user message block. Costs full price every time.

Anti-pattern 4: Over-compressing context

Anti-pattern 5: Routing everything to the frontier model

Anti-pattern 6: Ignoring the Opus 4.7+ tokenizer change

14. Tokenizer Quirks You Should Know in 2026

Tokenization is not standardized across providers. The same text yields different token counts depending on the model's tokenizer — and the differences have direct billing implications.

// The same text can produce materially different token counts across models
const sampleText = `
function calculateTax(income: number, rate: number): number {
  const deduction = Math.min(income * 0.1, 5000);
  return (income - deduction) * rate;
}
`;

// Approximate token counts (varies by tokenizer):
// GPT-4 (tiktoken):        ~45 tokens
// Claude Sonnet 4.6:       ~48 tokens
// Claude Opus 4.8 (new tokenizer): ~58–62 tokens  ← up to 35% more for code

// Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words (English prose)
// Code, JSON, and structured data tokenize less efficiently than prose
// Non-Latin scripts (Chinese, Arabic) can produce 1 token per character

Key quirks in 2026:

Opus 4.7+ tokenizer: Up to 35% more tokens than Opus 4.6 for code-heavy inputs, negligible difference for English prose. Factor this into cost estimates when migrating models.
Cross-provider variation: A prompt that registers as 140 tokens in GPT-4 may exceed 180 tokens in Claude or Gemini. Always count tokens on the target model before estimating costs.
Structured data bloat: JSON, XML, and YAML tokenize poorly — many tokens for structural characters. If you're passing large JSON payloads into context, consider extracting only the fields the model needs.
Polite language: "Please" adds a token. "Thank you" adds two. "I was wondering if you could possibly help me with" adds eight. At scale, conversational padding in system prompts is measurable.

15. FAQ

16. Further Reading

Official Documentation

📖 Anthropic Prompt Caching Documentation
📖 Anthropic Message Batches API
📖 Anthropic Token Counting API
📖 OpenAI Batch API
📖 OpenAI Prompt Caching

Research and Deep Dives

📰 LLM Token Optimization Strategies: The Complete Guide — TokenOptimize (2026)
📰 Token Optimization and Cost Management for ChatGPT & Claude — IntuitionLabs
📰 Understanding LLM Cost Per Token: A 2026 Practical Guide — SiliconData
📰 Claude API Cost Optimization: 60% Token Reduction in Production — DEV Community
📰 Anthropic Prompt Caching and Token Efficiency Guide — Hidekazu Konishi
📰 Lost in the Middle: How Language Models Use Long Contexts — Stanford NLP (the research behind why context placement matters)

Tools

Token Budgeting: The Engineering Skill Nobody Talks About

1. The Misconception That's Costing You Money

2. The 2026 Pricing Landscape: What You're Actually Paying

3. Where Your Tokens Actually Go

4. The Quadratic Conversation Problem

5. Lever 1: Prompt Caching — 90% Off Your Static Content

How to Use It (Anthropic)

The Cache Rules You Must Know

OpenAI Automatic Caching

6. Lever 2: Context Pruning and Compaction

Tool 1: Context Editing (Pruning)

Tool 2: Summarize Into a Cached Block

Tool 3: Context Compaction (Beta)

7. Lever 3: Model Routing — Pay for Reasoning Only When You Need It

8. Lever 4: Batch API — 50% Off Everything Async

9. Lever 5: Surgical RAG — Stop Dumping Documents

10. Cache Structure: The Ordering That Makes or Breaks Cache Hits

11. Measuring First: You Can't Optimize What You Can't See

Token Counting Before Sending

Reading Cache Performance From Responses

A Simple Token Budget Tracker

The Monthly Cost Formula

12. Real Case Study: $2,400 to $680 in 3 Months

13. The Anti-Patterns: What Not to Do

14. Tokenizer Quirks You Should Know in 2026

15. FAQ

16. Further Reading

Sanju Singh

Comments (0)

Token Budgeting: The Engineering Skill Nobody Talks About

1. The Misconception That's Costing You Money

2. The 2026 Pricing Landscape: What You're Actually Paying

3. Where Your Tokens Actually Go

4. The Quadratic Conversation Problem

5. Lever 1: Prompt Caching — 90% Off Your Static Content

How to Use It (Anthropic)

The Cache Rules You Must Know

OpenAI Automatic Caching

6. Lever 2: Context Pruning and Compaction

Tool 1: Context Editing (Pruning)

Tool 2: Summarize Into a Cached Block

Tool 3: Context Compaction (Beta)

7. Lever 3: Model Routing — Pay for Reasoning Only When You Need It

8. Lever 4: Batch API — 50% Off Everything Async

9. Lever 5: Surgical RAG — Stop Dumping Documents

10. Cache Structure: The Ordering That Makes or Breaks Cache Hits

11. Measuring First: You Can't Optimize What You Can't See

Token Counting Before Sending

Reading Cache Performance From Responses

A Simple Token Budget Tracker

The Monthly Cost Formula

12. Real Case Study: $2,400 to $680 in 3 Months

13. The Anti-Patterns: What Not to Do

14. Tokenizer Quirks You Should Know in 2026

15. FAQ

16. Further Reading

Sanju Singh

Comments (0)

Related Posts

IDOR Vulnerabilities in NestJS: How to Build Ownership Guards That Actually Protect Your Data

The Evolution of TypeScript Compilers: SWC vs TSC

This Week in AI: Claude Goes Dark, SpaceX Buys Cursor for $60B

The "Native-First" Revolution: How Node.js 24 Is Ending Dependency Hell in 2026

I Built a Tiny AI Agent From Scratch — Every Line Tested Before It Touched a Real API

Popular Tags