Prompt → Context → Harness: three layers of LLM engineering

Engineering essay

Prompt → Context → Harness

Three layers of the LLM stack, each wrapping the one below.

For two years, “prompt engineering” was the whole job. Then the field grew up. Models got better at parsing imperfect instructions, and the harder problems moved up the stack — to what the model can actually see when it’s deciding, and what happens around the call. The names that stuck for those upper layers: context engineering and harness engineering.

This essay is a side-by-side. Same task, sliced by which layer owns it. Real code (Anthropic SDK in Python), and a visual for each layer that maps directly onto what changes in the code.

The same task, three layers

At each layer, the unit of work gets larger. At Layer 1, you have a single API call and a string. At Layer 2, that call is wrapped in retrieval and history. At Layer 3, it’s wrapped in a loop that runs tools and decides when to stop.

Prompt
system
user
model
Context
sys
tools
retrieved docs
history
user
model
Harness
agent loop · model ↔ tool ↔ model ↔ …
tool
tool
done

Layer 1 · Words

Prompt Engineering

The single string the model reads on one call. Roles, instructions, examples, output format — stitched into one sequence of tokens.

The whole “prompt” is just a list of role-tagged messages. The provider serializes them and the model attends to all of them at once. There’s no memory between calls — every request is stateless. Whatever you want the model to know, it has to be in the messages array.

from anthropic import Anthropic
client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system="You are a senior code reviewer. Be terse.",
    messages=[
        {"role": "user",
         "content": "Review this PR for race conditions:\n" + diff},
    ],
)

What you control here: the system string, the shape and content of messages, sampling params (temperature, top_p), and any structured output schema. That’s it — every other layer builds on top of these knobs.

A “well-engineered prompt” today usually means: a tight system message, a couple of few-shot examples when the task underspecifies, and a strict output schema. Reasoning scaffolds (CoT, ReAct) where they pay off. Nothing exotic.

Animation: assistant tokens streaming into the messages.json panel one by one—Looks, like, a, TOCTOU, on, line, 42—with system / user / assistant role chips on the left.

Layer 2 · Information

Context Engineering

What the model can see on a single call. The prompt is part of it. Retrieved documents, tool results, summarized history, and structured data are the rest. Even a 1M-token window is a finite budget you have to allocate.

Every call you make is implicitly answering: given everything I could put in this window, what should actually go in it? Add too little and the model hallucinates. Add too much and you hit lost-in-the-middle, latency, and cost. Order matters too — models attend more strongly to what’s near the user turn.

# Context engineering = building this list, every turn
context = [
    system_prompt,                       # static
    *summarize(history, max_tokens=2000),# compaction
    *retrieve(query, k=8),               # RAG
    *recent_tool_results,                # freshest first
    user_turn,                           # anchored last
]

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    messages=context,
    tools=tool_specs,
)

The retrieve step is doing real work — embedding the query, hitting a vector store, optionally re-ranking, returning the top-K chunks that might matter. The summarize step compacts older turns. Get either wrong and the model is answering with the wrong base rate — even though the prompt and the model haven’t changed.

Animation: a 200K-token context window bar with colored segments (system, tools, docs, history, user) resizing through four phases as retrieval and history grow and shrink.
Animation: a pulse traveling through the RAG path — query, embed, vector DB (cosine top-K), top 8, ctx (+44k tokens added).

Layer 3 · The runtime

Harness Engineering

The runtime around the call: the agent loop, tool execution, sub-agents, hooks, sandboxing, evals, observability. A chat completion API plus a harness becomes Claude Code, Cursor, Aider, OpenCode.

The harness turns “one model call” into “a session that does work.” It owns the loop (model → tool → result → model), the step budget, retries, the sandbox the tools run in, the hooks that intercept every step, and the trace you can replay later. Most production complexity lives here, not in the prompt.

messages = [user_turn]
turn = 0

while turn < max_turns:
    resp = client.messages.create(
        model="claude-opus-4-7",
        messages=messages,
        tools=tool_specs,
    )
    messages.append(resp.assistant_msg)

    if resp.stop_reason == "end_turn":
        break

    for tc in resp.tool_calls:
        run_hooks("pre_tool", tc)
        try:
            result = sandbox(tools[tc.name])(**tc.args)
        except Exception as e:
            result = error_payload(e)
        run_hooks("post_tool", tc, result)
        messages.append(tool_msg(tc.id, result))

    turn += 1

Notice what the harness owns that the model doesn’t: the while, the sandbox(...), the run_hooks(...), the budget check, the error recovery. Hooks let you intercept every tool call — for permission prompts, logging, secret scrubbing, cost limits. Sub-agents are this same loop nested with isolated context.

Animation: the agent loop — a dot bouncing user → model ↔ tool / tool / tool → end_turn, with the model in the centre as the orange-bordered hub.
Animation: a claude session log fading in line by line — user request, llm thinking, tool reads/greps/edits, pytest passes, llm done.

Side-by-side

The same task, sliced by which layer owns it.

Prompt

  • Unit: one model call
  • Lever: messages.create()
  • Fails as: bad single output, format collapse, refusal
  • Mental model: writing instructions

Context

  • Unit: the token window
  • Lever: retrieve(), summarize()
  • Fails as: wrong / stale info, lost-in-the-middle
  • Mental model: filling a backpack

Harness

  • Unit: the full session
  • Lever: while turn < max_turns:
  • Fails as: loops, runaway cost, sandbox escape
  • Mental model: operating a kitchen

— Archit, May 2026

Tags:

Leave a comment