Context Economics — Tokens as Budget

The Core Problem

A context window is a fixed resource. Every token you put in is a token you can’t use for something else. In a coding session that spans hundreds of turns, reads dozens of files, and executes scores of tools, the naive approach — accumulate everything — runs out of room fast.

Claude Code treats context not as a buffer to fill but as a budget to manage. The question isn’t “how much can we fit?” but “what’s the most valuable thing to keep?”

Four Compression Mechanisms

Compression happens at four levels, applied in order of severity:

Snip — the gentlest pass. Individual tool results that are very long get truncated. A file read that returns 50,000 characters gets snipped to the most relevant portion. The full result was stored; the model sees a compressed version. This happens per-tool, automatically, without changing the overall context structure.

Micro — compresses specific parts of the conversation history. Old turns that were primarily tool-heavy (lots of tool calls and results, minimal reasoning) get summarized in place. The turn stays in context but as a brief description rather than the full content. The model knows something happened there; it just doesn’t have the details.

Context Collapse — larger-scale compression. Multiple earlier turns get collapsed into a summary paragraph that’s injected at the beginning of the context. The original turns are discarded. The model retains a high-level understanding of what happened earlier in the session but loses the specifics. This is destructive — the original content is gone from the context.

Auto Compact — the heaviest mechanism. When the context is approaching limits, Auto Compact runs a full summarization pass over the entire session history and replaces it with a compact summary. This is a fork operation: a separate agent process reads the full context and produces a summary, which then becomes the new context. The main session continues from the summary.

The four mechanisms sit on a spectrum from “trim the edges” to “compress everything.” Claude Code applies them in order, escalating only when lighter passes aren’t enough.

Reactive Compact: The 413 Fallback

If a model API call returns a 413 (context too large) error, Reactive Compact triggers. This is different from the proactive mechanisms above — it’s a failure recovery path, not a scheduled optimization.

Reactive Compact is aggressive. It doesn’t wait for a clean summarization opportunity — it needs to get the context under the limit immediately. The resulting compression is less curated than Auto Compact but sufficient to unblock the request. The session continues, possibly with some context quality loss.

This is a deliberate tradeoff: an imperfect continuation is better than a hard failure. The 413 path exists so that edge cases (unexpectedly large file reads, unusually verbose tool output, a particularly long reasoning chain) don’t terminate sessions.

Token Budget System

The token budget system is the proactive counterpart to reactive compression. It monitors token usage throughout the session and applies soft limits before hard limits are hit.

The budget has two thresholds:

Warning threshold: Lighter compression mechanisms kick in. Snip becomes more aggressive; Micro compression gets applied to older turns. The model is also informed (via the prompt) that context is limited, which influences its behavior — it becomes more concise, summarizes rather than quotes, avoids requesting large file reads when it doesn’t need them.

Critical threshold: Context Collapse or Auto Compact triggers. The session enters a recovery mode that prioritizes maintaining function over preserving detail.

Token counting happens at every loop iteration (step 2 in the main loop from Ch2). This isn’t expensive — it’s a count, not a re-processing — but it means the budget system has up-to-date information on every turn.

Lazy Injection

Not everything is loaded upfront. Several categories of context are injected lazily — only when needed:

Skills: A skill’s instruction text is only injected into the system prompt when the skill is active. Skills that aren’t triggered in a session consume zero tokens.

MCP instructions: MCP server instruction text is injected when the MCP tool is first used in a session. If a session never calls an MCP tool, that server’s instructions don’t appear in the prompt.

Memory prefetch: If the agent has access to memory files (project-specific CLAUDE.md, user-level configs), these are loaded at session start but their content is only injected into the prompt at the point where they become relevant. The system tracks whether a memory has been “activated” — referenced in a way that makes it relevant — and injects it on first activation.

Tool result budget: Tool results themselves are subject to a budget. Large results get the Snip treatment (described above). Critically, tool results from older turns are subject to Micro compression, which means the model’s memory of what a file contained a hundred turns ago is a summary, not the original content.

Why This Architecture

The alternative — aggressively compressing everything upfront to keep the context small — trades one problem for another. A context that’s been over-compressed loses the details needed for accurate reasoning. Bugs get introduced when the model doesn’t have the actual file content, only a summary of it.

Claude Code’s architecture tries to keep the most recently relevant information at full fidelity while compressing older, less relevant content. The lazy injection approach means the budget is spent on what’s actually being used. The tiered compression means degradation is gradual rather than cliff-like.

The 413 fallback exists because even with all of this, edge cases happen. The system is designed to recover gracefully rather than fail hard.

Reference: This chapter draws on Xiao Tan’s (@tvytlx) Claude Code Architecture Deep Dive V2.0 report.