Chapter 5 — The Agentic Loop
What You'll Learn
By the end of this chapter, you will be able to:
- Explain why an iterative loop — rather than a single function call — is the correct primitive for an AI agent that uses tools
- Trace a complete user prompt from entry point to terminal return value, naming every major decision point along the way
- Read the
Statestruct and explain what each of its ten fields tracks and why - Describe all four pre-iteration preparation steps (snip, microcompact, context collapse, autocompact) and the order in which they run
- Explain
deps.callModel()and what the streaming loop collects from each event - Walk through all seven
continuepaths inqueryLoop()and give a concrete real-world scenario where each one fires - Understand what
handleStopHooks()does after every turn that ends without tool calls - Distinguish between
runToolsandStreamingToolExecutorand explain when each is active - Explain the role of
QueryConfigandQueryDepsin making the loop independently testable - Read
checkTokenBudget()and explain the two stopping conditions it enforces
5.1 Why a Loop? The Fundamental Design Insight
When you interact with a large language model in its simplest form, the exchange is a single round trip. You send a prompt, you receive a text completion, the interaction is over. That model is powerful, but it cannot act on the world. It can describe a shell command; it cannot run one. It can outline a plan to read a file; it cannot open the file and report back what it found.
Claude Code's central architectural insight is that an agent is not a single API call but a process that alternates between two modes: reasoning and acting. The model reasons by producing text. It acts by requesting tool executions — read this file, run this command, search this codebase. Each set of tool results is fed back to the model as new context, enabling the next round of reasoning. This alternation continues until the model produces a final response with no tool calls, at which point the turn is complete.
That alternation is the agentic loop. It is not a recursive function (though earlier versions of this codebase used recursion). It is a while (true) engine with a single mutable State struct, seven distinct paths that call continue to restart the engine, and a small set of conditions that return a terminal value to end it permanently.
The loop lives in src/query.ts, which at 1,730 lines is the largest and most important file in the codebase. Everything else — the React UI, the tool implementations, the permission system, the compaction subsystems — exists to serve or extend this loop.
5.2 query(): The Thin Outer Wrapper
The public entry point to the loop is query() at src/query.ts:219. Its signature is worth understanding in detail:
// src/query.ts:219-239
export async function* query(
params: QueryParams,
): AsyncGenerator<
| StreamEvent
| RequestStartEvent
| Message
| TombstoneMessage
| ToolUseSummaryMessage,
Terminal
>
{
const consumedCommandUuids: string[] = []
const terminal = yield* queryLoop(params, consumedCommandUuids)
// Only reached if queryLoop returned normally. Skipped on throw and .return()
for (const uuid of consumedCommandUuids) {
notifyCommandLifecycle(uuid, 'completed')
}
return terminal
}query() is an async generator function. The yield* operator delegates to queryLoop, forwarding every yielded event to the caller and receiving the terminal return value when queryLoop finishes. This means query() is not just a wrapper — it participates in the generator protocol as a transparent conduit.
The only logic query() adds is the command lifecycle notification. When a user types a slash command that gets queued and later consumed as an attachment mid-turn, that command's UUID is tracked in consumedCommandUuids. When queryLoop completes normally (meaning the model reached a final response without being aborted or erroring), query() walks those UUIDs and fires notifyCommandLifecycle(uuid, 'completed'). The comment explains the asymmetry: if queryLoop throws, this code never runs, producing the "started but not completed" signal that the UI uses to detect interrupted command processing.
The QueryParams type that query() accepts deserves attention:
// src/query.ts:181-199
export type QueryParams = {
messages: Message[]
systemPrompt: SystemPrompt
userContext: { [k: string]: string }
systemContext: { [k: string]: string }
canUseTool: CanUseToolFn
toolUseContext: ToolUseContext
fallbackModel?: string
querySource: QuerySource
maxOutputTokensOverride?: number
maxTurns?: number
skipCacheWrite?: boolean
taskBudget?: { total: number }
deps?: QueryDeps
}messages is the conversation history up to this point — everything the model should treat as prior context. systemPrompt is the structured system prompt, not a plain string; it carries caching annotations and is assembled differently depending on the query source. userContext and systemContext are key-value maps injected at the API level: userContext values are prepended to the first human turn, systemContext values are appended to the system prompt. This lets callers inject dynamic information (current directory, git branch, memory files) without modifying the base message array.
canUseTool is a function that gates individual tool invocations; it is called before each tool execution, not at configuration time, meaning permissions can change mid-turn. toolUseContext is the large context object (covered in Chapter 4) that threads the React store, the abort controller, the agent identity, and other session-scoped values through the loop.
querySource is a discriminated string identifying which code path initiated this query: 'repl_main_thread', 'sdk', an 'agent:...' variant, and so on. Many branch decisions inside the loop check querySource to decide whether to run background side effects, drain command queues, or surface certain events.
deps is the optional dependency injection override, discussed in detail in Section 5.9. When omitted, the production implementations are used.
5.3 The Loop Skeleton: State and while(true)
queryLoop() begins by snapshotting the immutable parts of params into local constants, then constructing the initial State value:
// src/query.ts:268-279
let state: State = {
messages: params.messages,
toolUseContext: params.toolUseContext,
maxOutputTokensOverride: params.maxOutputTokensOverride,
autoCompactTracking: undefined,
stopHookActive: undefined,
maxOutputTokensRecoveryCount: 0,
hasAttemptedReactiveCompact: false,
turnCount: 1,
pendingToolUseSummary: undefined,
transition: undefined,
}The full State type is:
// src/query.ts:204-217
type State = {
messages: Message[]
toolUseContext: ToolUseContext
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean
maxOutputTokensOverride: number | undefined
pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined
stopHookActive: boolean | undefined
turnCount: number
transition: Continue | undefined // why the previous iteration continued
}Each field serves a specific role in loop coordination.
messages is the accumulated conversation history that grows across turns. It starts as params.messages and at each next_turn continuation it becomes [...messagesForQuery, ...assistantMessages, ...toolResults] — that is, the prior context plus the model's responses plus all tool results from this turn.
toolUseContext can be updated by tool execution. When a tool calls update.newContext, the loop carries the updated context forward. One practical case is the AgentTool: spawning a subagent updates the available agent definitions in the context, which must be visible on the next API call.
autoCompactTracking records whether proactive auto-compaction has fired and tracks how many turns have elapsed since it did. It is used by the autocompact subsystem to decide whether to compact again.
maxOutputTokensRecoveryCount counts how many consecutive recovery attempts have been made after the model hit its output token limit. The loop allows up to three recovery attempts before surfacing the error to the user.
hasAttemptedReactiveCompact is a boolean guard that prevents the reactive compaction path from running more than once per query. Without it, a persistent prompt-too-long error could trigger an infinite spiral of compact-and-retry cycles.
maxOutputTokensOverride controls a one-shot escalation. When a model hits the default 8,192-token output cap and a feature gate permits escalation, the loop retries once with this field set to 64,000. On the escalated retry, the field is set back to undefined.
pendingToolUseSummary is a Promise that resolves to a short human-readable summary of the tool calls that just completed. The summary is generated by a fast model (Haiku) in parallel with the next API call, so it adds zero latency to the turn. It is yielded at the start of the following iteration, just before the API call begins.
stopHookActive is a flag that tells handleStopHooks whether a stop hook already ran on a previous iteration and produced blocking errors. It prevents stop hooks from running again on what is logically a continuation of the same hook-triggered turn.
turnCount starts at 1 and increments at each next_turn continuation. It is compared against maxTurns to implement the API-level turn cap.
transition records why the previous iteration continued, using a discriminated union. On the first iteration it is undefined. On subsequent iterations it carries a reason string — 'next_turn', 'stop_hook_blocking', 'max_output_tokens_recovery', and so on. Tests can inspect transition to assert that specific recovery paths fired without parsing message contents.
The loop itself opens at src/query.ts:307:
// eslint-disable-next-line no-constant-condition
while (true) {
let { toolUseContext } = state
const {
messages,
autoCompactTracking,
maxOutputTokensRecoveryCount,
hasAttemptedReactiveCompact,
maxOutputTokensOverride,
pendingToolUseSummary,
stopHookActive,
turnCount,
} = state
// ... iteration body ...
}The destructuring at the top of each iteration serves an important purpose. The fields of state are read once into local constants and a reassignable let toolUseContext. This means that within the iteration body, code reads plain names like messages and turnCount rather than state.messages and state.turnCount. At every continue site, the code writes state = { ... } as a single atomic object construction — there are no scattered state.field = value assignments. This pattern makes it trivially checkable whether any state mutation was missed.
5.4 Pre-Iteration Preparation: Four Layers
Before making the API call, each loop iteration runs up to four preparation steps that trim or transform the message history. They run in a specific order because each one feeds into the next.
Step 1: Compact boundary extraction. getMessagesAfterCompactBoundary(messages) returns the slice of the conversation since the last auto-compaction event. Everything before the boundary has been replaced by a summary; the model should never see the pre-boundary raw messages again.
Step 2: Tool result budget. applyToolResultBudget() enforces a per-message size budget on tool result content. If a tool produced output larger than its configured maximum, the content is replaced with a truncation notice and the replacement is persisted to disk so that future resume sessions see the same compact version. This step runs before microcompact because microcompact operates on tool-use IDs and never inspects content directly.
Step 3: Snip (HISTORY_SNIP feature gate). When enabled, the snip module removes old turns that exceed a token budget. Unlike compaction, which summarizes content, snipping simply drops old messages from the context window while preserving recent ones. The number of tokens freed is forwarded to the autocompact threshold check so that autocompact's decision reflects what snip already removed.
Step 4: Microcompact. deps.microcompact() performs inline compression of recent tool results, typically by replacing verbose bash output with a condensed version. The compressed messages are cached by tool_use_id so that once a result is compressed it is never recompressed on subsequent iterations. The CACHED_MICROCOMPACT variant uses the Anthropic prompt caching API to delete cached entries server-side; its boundary message is deferred until after the API response so it can report the actual cache_deleted_input_tokens count rather than a client-side estimate.
Step 5: Context collapse projection (CONTEXT_COLLAPSE feature gate). Context collapse is a staged mechanism that marks sections of the conversation as collapsible. Before the API call, the projection step decides which staged collapses to apply, replacing expanded sections with compact placeholders. This is cheaper than full compaction and preserves the granular structure of the conversation.
Step 6: Autocompact. deps.autocompact() checks whether the accumulated token count has crossed the configured threshold. If it has, it triggers a full summarization: the conversation history is compressed into a summary message, a compact boundary is recorded, and the next API call sees only the summary plus recent turns. Autocompact has its own tracking state (autoCompactTracking) that records the turn counter and compact ID so that post-compaction analytics events carry the correct lineage.
5.5 The API Streaming Call
After preparation, the loop makes the API call via deps.callModel(). The call itself is a for await loop over an async generator — the API response streams in as a sequence of typed message events:
// src/query.ts:659-863 (condensed for clarity)
for await (const message of deps.callModel({
messages: prependUserContext(messagesForQuery, userContext),
systemPrompt: fullSystemPrompt,
thinkingConfig: toolUseContext.options.thinkingConfig,
tools: toolUseContext.options.tools,
signal: toolUseContext.abortController.signal,
options: {
model: currentModel,
fallbackModel,
querySource,
maxOutputTokensOverride,
taskBudget: params.taskBudget
? { total: params.taskBudget.total, remaining: taskBudgetRemaining }
: undefined,
// ... additional options ...
},
})) {
// Withhold recoverable errors
let withheld = false
if (contextCollapse?.isWithheldPromptTooLong(message, ...)) withheld = true
if (reactiveCompact?.isWithheldPromptTooLong(message)) withheld = true
if (reactiveCompact?.isWithheldMediaSizeError(message)) withheld = true
if (isWithheldMaxOutputTokens(message)) withheld = true
if (!withheld) yield yieldMessage
if (message.type === 'assistant') {
assistantMessages.push(message)
const msgToolUseBlocks = message.message.content.filter(
content => content.type === 'tool_use',
) as ToolUseBlock[]
if (msgToolUseBlocks.length > 0) {
toolUseBlocks.push(...msgToolUseBlocks)
needsFollowUp = true
}
}
}Several design decisions in this loop deserve careful attention.
prependUserContext wraps messagesForQuery with the dynamic key-value pairs from userContext. These are prepended to the first human turn so the model always sees current values (current working directory, memory file contents, etc.) without requiring the caller to mutate the message array.
The abort signal is threaded directly into callModel. If the user presses Ctrl+C, the abort controller fires, the HTTP request is cancelled, and the for await loop terminates immediately. The outer loop checks toolUseContext.abortController.signal.aborted after the streaming loop to decide whether to return { reason: 'aborted_streaming' }.
Withholding recoverable errors is a key pattern. When the API returns a prompt-too-long (HTTP 413), a media size error, or a max_output_tokens stop reason, the loop does not yield that error to the caller immediately. Instead it sets withheld = true, skips the yield, but still pushes the message into assistantMessages. After the streaming loop ends, the recovery logic inspects assistantMessages.at(-1) to decide whether to compact and retry. If recovery succeeds, the user never sees the error. If all recovery options are exhausted, the withheld error is finally yielded before the function returns.
needsFollowUp is the boolean that determines which branch the post-stream logic enters. If any assistant message in the stream contained at least one tool_use block, needsFollowUp is true and the loop will execute those tools and continue. If no tool calls arrived, needsFollowUp is false and the loop enters the stop-hook and terminal-check path.
StreamingToolExecutor integration. When the streamingToolExecution feature gate is enabled, tools are started while the model is still streaming. As each tool_use block arrives in the assistant message, streamingToolExecutor.addTool(toolBlock, message) is called immediately. By the time the stream ends, some tools may already have completed. This overlap reduces total latency because tool I/O happens in parallel with the remaining model output rather than sequentially after it.
Model fallback. If deps.callModel throws a FallbackTriggeredError and a fallbackModel was provided, the entire streaming attempt is discarded and retried with the fallback model. Previously yielded assistant messages receive tombstone events so the UI removes them from the display. The fallback mechanism is transparent to the caller: from the outside, the query succeeded with a different model.
5.6 The Seven Continue Paths
The agentic loop has seven distinct paths that call continue — meaning they construct a new State and restart the while (true) body. Each path encodes a specific recovery or continuation strategy. Understanding them collectively is essential to understanding what the loop does in non-trivial situations.
Path 1: Context Collapse Drain Retry
Transition reason: collapse_drain_retry
Source location: src/query.ts:1088-1116
Trigger: The model returned a prompt-too-long error, and the context collapse subsystem has staged collapses available that have not yet been committed.
Context collapse is a progressive mechanism. Over the course of a long conversation, sections of the message history are marked as candidates for collapsing — for example, a long tool output that produced a file listing the model already processed. These candidates accumulate in a "staged" queue. Normally they are committed lazily, one per iteration, to keep context granular. But when a prompt-too-long error occurs, the loop invokes contextCollapse.recoverFromOverflow(), which commits all staged collapses immediately and returns a reduced message array.
If at least one collapse was committed (drained.committed > 0), the loop restarts with the reduced messages and the transition set to collapse_drain_retry. The previous-transition check (state.transition?.reason !== 'collapse_drain_retry') ensures this path fires at most once: if a drain-retry still produced a 413, the loop falls through to the more aggressive reactive compact path on the next iteration.
Concrete scenario. A user asks Claude to analyze a repository and the model has spent ten turns reading large files. The accumulated tool results fill the context window. On the eleventh turn the API returns 413. Context collapse has staged summaries for seven of those file-reading results. The loop commits all seven immediately, the context drops below the limit, and the API call succeeds on retry without the user seeing any error message.
Path 2: Reactive Compact Retry
Transition reason: reactive_compact_retry
Source location: src/query.ts:1119-1165
Trigger: A prompt-too-long error or a media-size error (oversized image or PDF), and the context collapse drain either failed or was already attempted.
Reactive compaction is the heavier fallback. When triggered, it invokes a full summarization of the conversation history — sending it to the model and asking for a condensed summary — and replaces the accumulated messages with that summary plus recent turns. This is the same mechanism as proactive autocompaction, but initiated reactively in response to an error rather than prophylactically at a token threshold.
The hasAttemptedReactiveCompact field in State prevents the loop from triggering reactive compaction a second time. If the first compact attempt still results in a 413 (which can happen if the preserved tail itself is too large, for example because of large images in recent messages), the loop surfaces the error rather than spiraling.
Concrete scenario. A user has pasted several high-resolution screenshots into the conversation. The cumulative image data exceeds the API's media size limit. The reactive compact path strips the images from historical messages and produces a text summary of what was in them. The API call on the compacted context succeeds.
Path 3: max_output_tokens Escalation
Transition reason: max_output_tokens_escalate
Source location: src/query.ts:1199-1221
Trigger: The model hit its output token limit (the API returned stop_reason: 'max_tokens'), this is the first time in this query, the escalation feature gate is enabled, and maxOutputTokensOverride has not already been set.
The default output cap is 8,192 tokens. When the model hits this limit and the tengu_otk_slot_v1 Statsig gate is enabled, the loop retries the same request with maxOutputTokensOverride set to ESCALATED_MAX_TOKENS (64,000). The key detail is that the message array is unchanged — this is the exact same request with a higher cap, not a continuation prompt. If the escalated request also hits the cap, the loop falls through to the multi-turn recovery path on the subsequent iteration because maxOutputTokensOverride is set to undefined when constructing the escalation state.
Concrete scenario. A user asks Claude to write a comprehensive test suite for a large module. The model begins generating tests and reaches 8,192 output tokens mid-suite. Rather than delivering an incomplete file, the loop silently retries at 64,000 tokens, and the complete test suite is delivered as a single response.
Path 4: max_output_tokens Multi-Turn Recovery
Transition reason: max_output_tokens_recovery
Source location: src/query.ts:1223-1252
Trigger: The model hit its output token limit, the escalation path already fired (or is not enabled), and the recovery count is below the limit of three.
If escalation failed or is not available, the loop injects a recovery prompt as a hidden user message: "Output token limit hit. Resume directly — no apology, no recap of what you were doing. Pick up mid-thought if that is where the cut happened. Break remaining work into smaller pieces." The isMeta: true flag hides this message from the UI. The loop then continues with the partial assistant response plus the recovery prompt as context, asking the model to continue from where it left off.
The maxOutputTokensRecoveryCount counter allows up to three such recovery injections. After three attempts, the withheld error is yielded and the loop exits normally (return { reason: 'completed' }).
Concrete scenario. A model asked to produce a long migration script hits the output limit mid-generation. The loop injects the recovery prompt. The model picks up exactly where it stopped — mid-line, if needed — and continues. This can happen up to three times before the loop gives up, enabling scripts that would otherwise require four separate prompts to be produced in a single turn.
Path 5: Stop Hook Blocking Error
Transition reason: stop_hook_blocking
Source location: src/query.ts:1282-1306
Trigger: The model completed its turn without tool calls, handleStopHooks() ran, and at least one stop hook returned a blocking error.
Stop hooks are user-configurable shell scripts that run after every model turn. A hook can return a "blocking error" — output that should be shown back to the model so it can respond to or incorporate the feedback. When handleStopHooks() returns blockingErrors with at least one entry, the loop appends those errors to the message history and continues, presenting them to the model as new user messages.
Note that maxOutputTokensRecoveryCount is reset to 0 on this path, but hasAttemptedReactiveCompact is preserved. The comment in the source explains the reason: resetting hasAttemptedReactiveCompact caused an infinite loop in a real bug: compact ran, the compacted context was still too long, the API error triggered a stop hook blocking retry, and the compact guard reset allowed compact to run again, producing the same result endlessly.
Concrete scenario. A user has configured a stop hook that runs their test suite after every turn. Claude writes a file and the hook reports that three tests are failing. The loop injects the test failure output as a hidden user message and continues, giving Claude an opportunity to see the failures and fix them without requiring the user to manually paste the test output.
Path 6: Token Budget Continuation
Transition reason: token_budget_continuation
Source location: src/query.ts:1308-1341
Trigger: The TOKEN_BUDGET feature is enabled, the model completed its turn without tool calls, and checkTokenBudget() returned action: 'continue' — meaning the model has used less than 90% of its allocated token budget and diminishing returns have not been detected.
This path is part of the budget-driven long-context completion feature, discussed in detail in Section 5.10. When the model's output token count is below the configured budget threshold, the loop injects a "nudge" message encouraging the model to continue and keep working. The nudge message content is generated by getBudgetContinuationMessage() and includes the current percentage of budget used.
Concrete scenario. A user is running a background summarization agent with a token budget of 500,000 tokens. After processing fifty files, the model returns a status update. The budget checker sees that only 40% of the budget has been used, injects a continuation nudge, and the model continues processing more files until the budget is exhausted or diminishing returns are detected.
Path 7: Normal Next Turn
Transition reason: next_turn
Source location: src/query.ts:1715-1727
Trigger: The model's response contained tool calls (needsFollowUp === true), all tools have been executed successfully, and neither the abort signal nor a hook-stop flag prevented continuation.
This is the "happy path" continuation — the ordinary agentic loop cycle. After tools complete, the loop constructs the next state by merging the processed query messages, the assistant responses, and the tool results into a unified message array. The turnCount increments. If maxTurns is set and would be exceeded, the loop yields a max_turns_reached attachment and returns { reason: 'max_turns' } instead of continuing.
// src/query.ts:1715-1727
const next: State = {
messages: [...messagesForQuery, ...assistantMessages, ...toolResults],
toolUseContext: toolUseContextWithQueryTracking,
autoCompactTracking: tracking,
turnCount: nextTurnCount,
maxOutputTokensRecoveryCount: 0,
hasAttemptedReactiveCompact: false,
pendingToolUseSummary: nextPendingToolUseSummary,
maxOutputTokensOverride: undefined,
stopHookActive,
transition: { reason: 'next_turn' },
}
state = nextNotice that maxOutputTokensRecoveryCount and hasAttemptedReactiveCompact are both reset on this path. A clean tool execution implies the model produced a valid response, so any previous output-cap recovery state is no longer relevant. The pendingToolUseSummary from this iteration is carried forward so the next iteration can yield it at its start.
5.7 Stop Hooks: End-of-Turn Bookkeeping
handleStopHooks() is called at the end of every turn that ends without tool calls and without an API error. Its responsibility is far broader than its name suggests: it is the post-turn bookkeeping hub that runs a range of background side effects and coordinates the user-configurable stop hook execution.
The function is an async generator in src/query/stopHooks.ts. It yields progress events and attachment messages for the UI to display, and it returns a StopHookResult:
type StopHookResult = {
blockingErrors: Message[]
preventContinuation: boolean
}The sequence of operations inside handleStopHooks() is:
1. Save cache-safe params. For repl_main_thread and sdk query sources, the current conversation state is serialized as "cache-safe params" and saved to a module-level variable. This snapshot is read by the /btw slash command and the SDK side_question control request, which need to fork a new query from the current context without being on the main loop's call stack.
2. Job classification (TEMPLATES feature gate). When running as a dispatched job (the CLAUDE_JOB_DIR environment variable is set), the full turn history is classified and a state.json file is written. This allows claude list to show current job state without polling.
3. Prompt suggestion. executePromptSuggestion() is fired as a fire-and-forget operation that suggests follow-up prompts for the UI. It is skipped in bare mode (--bare or SIMPLE env var) where scripted callers do not want background activity.
4. Memory extraction (EXTRACT_MEMORIES feature gate). executeExtractMemories() is fired as fire-and-forget in interactive mode, and as an awaitable that is drained before shutdown in non-interactive mode. It is skipped for subagents, which would pollute the main session's memory with sub-task-specific facts.
5. Auto-dream (executeAutoDream). Background consolidation of conversation history for long-running sessions. Skipped for subagents.
6. Computer use cleanup (CHICAGO_MCP feature gate). Releases the computer use process lock and un-hides the desktop after each turn. Skipped for subagents, which never start computer use sessions.
7. Stop hooks execution. executeStopHooks() runs the user-configured stop hook scripts in parallel. Each hook receives the full turn history and can return output (shown in the UI) or a blocking error (fed back to the model). A hook can also set preventContinuation: true, which causes handleStopHooks() to return { blockingErrors: [], preventContinuation: true } and the main loop to return { reason: 'stop_hook_prevented' }.
8. Teammate hooks (isTeammate() only). In team orchestration mode, executeTaskCompletedHooks() runs for any tasks this agent was working on, and executeTeammateIdleHooks() signals that the agent is now available. These hooks follow the same blocking error and prevent-continuation contract as stop hooks.
5.8 Tool Execution: runTools and StreamingToolExecutor
Tool execution is gated on the config.gates.streamingToolExecution flag. The two code paths diverge at:
// src/query.ts:1380-1382
const toolUpdates = streamingToolExecutor
? streamingToolExecutor.getRemainingResults()
: runTools(toolUseBlocks, assistantMessages, canUseTool, toolUseContext)runTools (sequential path). When streaming tool execution is disabled, runTools at src/services/tools/toolOrchestration.ts receives the complete list of toolUseBlocks that the model requested and executes them after the streaming loop ends. It is an async generator that yields typed update events as each tool completes. The canUseTool function is called before each tool execution; if it returns false, the tool produces an error result instead of running.
StreamingToolExecutor (concurrent path). When streaming tool execution is enabled, a StreamingToolExecutor is created before the API streaming loop begins. As each tool_use block arrives in the stream, streamingToolExecutor.addTool(toolBlock, message) is called immediately. The executor starts running the tool in the background while the model continues to stream. By the time the stream ends, some tools may already have results available. The streaming loop itself periodically calls streamingToolExecutor.getCompletedResults() to yield any results that arrived while the stream was still active. After the stream ends, streamingToolExecutor.getRemainingResults() is called to collect any tools that had not yet finished.
This overlap matters for latency. A model that calls five tools sequentially in its response will have the first tool complete before the model finishes outputting tool five's call parameters. Without streaming execution, all five tools wait until the entire stream ends. With streaming execution, tool one may finish before tool three even starts streaming.
If the streaming attempt fails mid-stream and falls back to a different model, streamingToolExecutor.discard() is called to abandon in-progress tool executions, and a fresh executor is created for the retry. This prevents orphaned tool results (with IDs from the failed attempt) from being appended to the retry's message array.
Both paths yield update objects with two optional fields: update.message (a typed message to yield to the caller and append to toolResults) and update.newContext (an updated ToolUseContext if the tool execution changed session state). The hook_stopped_continuation attachment type in update.message sets shouldPreventContinuation = true, which causes the loop to return { reason: 'hook_stopped' } after all tools complete.
5.9 QueryConfig and QueryDeps: Testable Dependencies
The loop uses two injected objects — QueryConfig and QueryDeps — to separate concerns and enable testing without mocking the world.
QueryConfig: Immutable Snapshot
// src/query/config.ts
export type QueryConfig = {
sessionId: SessionId
gates: {
streamingToolExecution: boolean
emitToolUseSummaries: boolean
isAnt: boolean
fastModeEnabled: boolean
}
}
export function buildQueryConfig(): QueryConfig {
return {
sessionId: getSessionId(),
gates: {
streamingToolExecution: checkStatsigFeatureGate_CACHED_MAY_BE_STALE(
'tengu_streaming_tool_execution2',
),
emitToolUseSummaries: isEnvTruthy(
process.env.CLAUDE_CODE_EMIT_TOOL_USE_SUMMARIES,
),
isAnt: process.env.USER_TYPE === 'ant',
fastModeEnabled: !isEnvTruthy(process.env.CLAUDE_CODE_DISABLE_FAST_MODE),
},
}
}QueryConfig is snapshotted once at queryLoop() entry and never mutated. The comment in the source explains the design intent: separating the config snapshot from State and ToolUseContext makes a future "pure reducer" architecture tractable — a function that takes (state, event, config) where config is plain data and event is any of the stream events, with no side effects.
The comment also explains an important exclusion: feature() gates are explicitly kept out of QueryConfig. The feature() function is a compile-time tree-shaking boundary. For the bundler's dead-code elimination to work, the calls to feature('...') must appear inline at the guarded blocks, not be extracted into a config object. Moving them would break the external build that strips enterprise-only features.
The CACHED_MAY_BE_STALE suffix on checkStatsigFeatureGate_CACHED_MAY_BE_STALE acknowledges that Statsig values may be one fetch cycle stale. The comment notes that since these are already admitted as potentially stale, snapshotting them once per query() call stays within the existing staleness contract. Calling the gate-check function once per loop iteration (potentially hundreds of calls) versus once at entry produces no meaningful freshness improvement.
QueryDeps: I/O Dependency Injection
// src/query/deps.ts
export type QueryDeps = {
callModel: typeof queryModelWithStreaming
microcompact: typeof microcompactMessages
autocompact: typeof autoCompactIfNeeded
uuid: () => string
}
export function productionDeps(): QueryDeps {
return {
callModel: queryModelWithStreaming,
microcompact: microcompactMessages,
autocompact: autoCompactIfNeeded,
uuid: randomUUID,
}
}QueryDeps captures the four I/O dependencies that tests most commonly need to stub: the model API call, the two compaction functions, and UUID generation. The comment explains the motivation: six to eight test files each used spyOn boilerplate to intercept these module-level functions. With QueryDeps, a test can pass a deps override directly into QueryParams and provide fake implementations without touching the module system.
The typeof fn pattern is deliberate. If the real function's signature changes, the QueryDeps type changes automatically because it is derived from the actual implementation type rather than a hand-written duplicate. A type mismatch caught by the compiler is far cheaper than a runtime failure in a test that was relying on a now-stale manual type declaration.
The comment notes that the scope is "intentionally narrow (4 deps)" as proof of the pattern. The architecture leaves room for future additions — runTools, handleStopHooks, logEvent — to be added as the test coverage requirements grow, without a big-bang refactor.
5.10 The Token Budget Module
The token budget feature (TOKEN_BUDGET feature gate) allows callers to specify a maximum number of output tokens for a query and instructs the loop to continue generating output until that budget is exhausted or diminishing returns are detected. It is used for background summarization agents that should generate as much output as possible within a given cost envelope.
// src/query/tokenBudget.ts (key constants)
const COMPLETION_THRESHOLD = 0.9 // continue if under 90% of budget
const DIMINISHING_THRESHOLD = 500 // stop if marginal gain < 500 tokensThe BudgetTracker type records state across continuations:
export type BudgetTracker = {
continuationCount: number // how many times the budget path has fired
lastDeltaTokens: number // token delta in the most recent continuation
lastGlobalTurnTokens: number // total tokens at the last check
startedAt: number // wall-clock time for duration tracking
}The checkTokenBudget() function makes the continue/stop decision:
export function checkTokenBudget(
tracker: BudgetTracker,
agentId: string | undefined,
budget: number | null,
globalTurnTokens: number,
): TokenBudgetDecision {
// Subagents bypass budget continuation — they have their own turn limits
if (agentId || budget === null || budget <= 0) {
return { action: 'stop', completionEvent: null }
}
const pct = Math.round((turnTokens / budget) * 100)
const deltaSinceLastCheck = globalTurnTokens - tracker.lastGlobalTurnTokens
// Diminishing returns: continuation count >= 3 AND both the last delta
// and this delta are below 500 tokens
const isDiminishing =
tracker.continuationCount >= 3 &&
deltaSinceLastCheck < DIMINISHING_THRESHOLD &&
tracker.lastDeltaTokens < DIMINISHING_THRESHOLD
if (!isDiminishing && turnTokens < budget * COMPLETION_THRESHOLD) {
// Under 90% budget and not diminishing: continue
tracker.continuationCount++
return { action: 'continue', nudgeMessage: ..., pct, turnTokens, budget }
}
// Over 90% budget, or diminishing: stop
return { action: 'stop', completionEvent: { ... } }
}Two independent conditions stop the budget continuation.
The first is the completion threshold: when the accumulated output token count reaches 90% of the total budget, the loop stops continuing regardless of diminishing returns. This ensures the agent does not overspend the budget.
The second is the diminishing returns check: after at least three continuations, if both the most recent delta and the current delta are below 500 tokens, the agent is producing negligible additional output. Continuing would consume API quota without meaningfully increasing the result. The loop stops early.
The completionEvent on the stop decision carries analytics data (continuation count, percentage reached, whether it was a diminishing-returns stop, total duration) that is logged to Statsig for product analytics on how the feature is being used.
Subagents are explicitly excluded (if (agentId || ...) short-circuits). A subagent launched inside a budget-driven turn has its own turn limits via maxTurns; giving it an independent token budget continuation would create uncontrolled spending with no ceiling.
Key Takeaways
The agentic loop in src/query.ts is the central engine of Claude Code. Everything else in the codebase exists to serve it or extend it. Several architectural principles run through the entire implementation.
Iteration over recursion. The loop is a while (true) with a mutable State struct rather than a recursive function. This makes the call stack depth constant regardless of how many tool-use turns occur. It also makes state mutations visible as explicit state = { ... } assignments at continue sites, rather than implicit recursive arguments.
State as an atomic value. Every continue site constructs a complete new State value in a single expression. There are no scattered field mutations. This makes it straightforward to verify that no state was accidentally carried over or forgotten.
transition as observable intent. Recording why each iteration continued in state.transition.reason makes the loop's behavior testable without inspecting message contents. A test that needs to verify the max-output-tokens recovery path fired can check state.transition.reason === 'max_output_tokens_recovery' rather than parsing the injected user message.
Withholding before surfacing. Recoverable errors — prompt-too-long, max-output-tokens, media size — are withheld from the caller during streaming, so recovery can happen silently. Only when all recovery paths are exhausted is the error yielded. This design keeps error handling internal to the loop rather than forcing callers to implement retry logic.
Dependency injection at the seam. QueryDeps captures the four I/O-touching operations that tests most commonly need to stub. The typeof fn pattern keeps types synchronized with implementations automatically. The scope is narrow by design — four deps covers the vast majority of test scenarios without over-engineering the interface.
Generator protocol as the delivery mechanism. The loop is an async generator. It yields intermediate events (progress messages, tool results, system messages) as they arrive, rather than buffering everything and returning at the end. This enables the UI to update in real time: the user sees the model's output character by character, sees tool results as they complete, and sees stop-hook progress as it runs.