Engineering essay
Prompt → Context → Harness
Three layers of the LLM stack, each wrapping the one below.
For two years, “prompt engineering” was the whole job. Then the field grew up. Models got better at parsing imperfect instructions, and the harder problems moved up the stack — to what the model can actually see when it’s deciding, and what happens around the call. The names that stuck for those upper layers: context engineering and harness engineering.
This essay is a side-by-side. Same task, sliced by which layer owns it. Real code (Anthropic SDK in Python), and a visual for each layer that maps directly onto what changes in the code.
The same task, three layers
At each layer, the unit of work gets larger. At Layer 1, you have a single API call and a string. At Layer 2, that call is wrapped in retrieval and history. At Layer 3, it’s wrapped in a loop that runs tools and decides when to stop.
Layer 1 · Words
Prompt Engineering
The single string the model reads on one call. Roles, instructions, examples, output format — stitched into one sequence of tokens.
The whole “prompt” is just a list of role-tagged messages. The provider serializes them and the model attends to all of them at once. There’s no memory between calls — every request is stateless. Whatever you want the model to know, it has to be in the messages array.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system="You are a senior code reviewer. Be terse.",
messages=[
{"role": "user",
"content": "Review this PR for race conditions:\n" + diff},
],
)
What you control here: the system string, the shape and content of messages, sampling params (temperature, top_p), and any structured output schema. That’s it — every other layer builds on top of these knobs.
A “well-engineered prompt” today usually means: a tight system message, a couple of few-shot examples when the task underspecifies, and a strict output schema. Reasoning scaffolds (CoT, ReAct) where they pay off. Nothing exotic.

Layer 2 · Information
Context Engineering
What the model can see on a single call. The prompt is part of it. Retrieved documents, tool results, summarized history, and structured data are the rest. Even a 1M-token window is a finite budget you have to allocate.
Every call you make is implicitly answering: given everything I could put in this window, what should actually go in it? Add too little and the model hallucinates. Add too much and you hit lost-in-the-middle, latency, and cost. Order matters too — models attend more strongly to what’s near the user turn.
# Context engineering = building this list, every turn
context = [
system_prompt, # static
*summarize(history, max_tokens=2000),# compaction
*retrieve(query, k=8), # RAG
*recent_tool_results, # freshest first
user_turn, # anchored last
]
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=context,
tools=tool_specs,
)
The retrieve step is doing real work — embedding the query, hitting a vector store, optionally re-ranking, returning the top-K chunks that might matter. The summarize step compacts older turns. Get either wrong and the model is answering with the wrong base rate — even though the prompt and the model haven’t changed.


Layer 3 · The runtime
Harness Engineering
The runtime around the call: the agent loop, tool execution, sub-agents, hooks, sandboxing, evals, observability. A chat completion API plus a harness becomes Claude Code, Cursor, Aider, OpenCode.
The harness turns “one model call” into “a session that does work.” It owns the loop (model → tool → result → model), the step budget, retries, the sandbox the tools run in, the hooks that intercept every step, and the trace you can replay later. Most production complexity lives here, not in the prompt.
messages = [user_turn]
turn = 0
while turn < max_turns:
resp = client.messages.create(
model="claude-opus-4-7",
messages=messages,
tools=tool_specs,
)
messages.append(resp.assistant_msg)
if resp.stop_reason == "end_turn":
break
for tc in resp.tool_calls:
run_hooks("pre_tool", tc)
try:
result = sandbox(tools[tc.name])(**tc.args)
except Exception as e:
result = error_payload(e)
run_hooks("post_tool", tc, result)
messages.append(tool_msg(tc.id, result))
turn += 1
Notice what the harness owns that the model doesn’t: the while, the sandbox(...), the run_hooks(...), the budget check, the error recovery. Hooks let you intercept every tool call — for permission prompts, logging, secret scrubbing, cost limits. Sub-agents are this same loop nested with isolated context.


Side-by-side
The same task, sliced by which layer owns it.
Prompt
- Unit: one model call
- Lever:
messages.create() - Fails as: bad single output, format collapse, refusal
- Mental model: writing instructions
Context
- Unit: the token window
- Lever:
retrieve(),summarize() - Fails as: wrong / stale info, lost-in-the-middle
- Mental model: filling a backpack
Harness
- Unit: the full session
- Lever:
while turn < max_turns: - Fails as: loops, runaway cost, sandbox escape
- Mental model: operating a kitchen
— Archit, May 2026
Leave a comment