Prompt → Context → Harness: three layers of LLM engineering

Engineering essay · 8 min read

Prompt → Context → Harness

Why building with LLMs feels less like prompt writing and more like software engineering, every year.

TL;DR

  • Prompt is the string the model reads in one call.
  • Context is everything in the model’s window. Prompt plus retrieved docs, history, tool results.
  • Harness is the runtime around the call. Agent loop, tools, hooks, sandbox.
  • Each layer wraps the one below. Writing the prompt is roughly 20% of the work. The rest is software engineering.

Side-by-side

Same task, sliced by which layer owns it. The rest of the essay fills this in.

Prompt

  • Unit: one model call
  • Lever: prompt | model
  • Fails as: bad single output, format collapse, refusal
  • Mental model: writing instructions

Context

  • Unit: the token window
  • Lever: retriever.invoke(), summarize()
  • Fails as: wrong or stale info, lost-in-the-middle
  • Mental model: filling a backpack

Harness

  • Unit: the full session
  • Lever: while turn < max_turns:
  • Fails as: loops, runaway cost, sandbox escape
  • Mental model: operating a kitchen

In 2023, “prompt engineer” was a job title. Anthropic posted a listing in the $175K to $335K range, others followed at similar levels. Twitter filled with collections of golden prompts. Courses appeared promising the secret incantations. For a moment it looked like talking to a model was its own discipline, separate from software engineering, accessible to anyone who could write English well. The hype was real, and a lot of it was earned. The first time you saw a clever system prompt make a model behave the way you wanted, it felt like magic.

Two years on, the people I see actually shipping AI features for a living don’t spend most of their time on the prompt. They write code. They think about retrieval, context windows, tool execution, observability, evals, retries, sandboxing. The prompt is part of the work, the same way SQL is part of building a backend. Important. Worth getting right. But a slice of the practice, not the whole thing.

The same task, three layers

At each layer, the unit of work gets larger. At Layer 1, you have a single API call and a string. At Layer 2, that call is wrapped in retrieval and history. At Layer 3, it sits inside a loop that runs tools and decides when to stop.

Prompt
system
user
model
Context
sys
tools
retrieved docs
history
user
model
Harness
agent loop · model ↔ tool ↔ model ↔ …
tool
tool
done

Layer 1 · Words

Prompt Engineering

The single string the model reads on one call. Roles, instructions, examples, output format. One sequence of tokens.

This is the layer that everyone knows. The whole “prompt” is a list of role-tagged messages. The provider serializes them and the model attends to all of them at once. By default, every request is stateless: there’s no memory between calls. Whatever you want the model to know on a given turn, it has to be in the messages array for that turn.

In LangChain, this is a ChatPromptTemplate piped into a model.

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "You are a senior code reviewer. Be terse."),
("user", "Review this PR for race conditions:\n\n{diff}"),
])
model = ChatAnthropic(model="claude-opus-4-7", max_tokens=1024)
chain = prompt | model
response = chain.invoke({"diff": diff})
print(response.content)

What you control here: the prompt template (system and user messages with placeholder variables), model parameters in the ChatAnthropic constructor (temperature, top_p, max_tokens), and any structured output schema via model.with_structured_output(...). That’s it. Every other layer builds on top of these knobs.

“Prompt engineering” in the 2023 Twitter sense lives entirely inside this layer. A tight system message, a couple of few-shot examples when the task underspecifies, a strict output schema, a reasoning scaffold (CoT, ReAct) where it pays off. Roughly 20% of building anything that ships. The bigger 80% is what we get to next.

Prompts are roughly 20% of the job. The other 80% is software engineering.

Animation: assistant tokens streaming into the messages.json panel one by one with system / user / assistant role chips on the left.
A prompt is a flat token stream. The model attends to all of it at once; the visualization shows it left to right.

Layer 2 · Information

Context Engineering

What the model can see on a single call. The prompt is part of it. Retrieved documents, tool results, summarized history, structured data are the rest. Even a 1M-token window is a finite budget you have to allocate.

Every call you make is implicitly answering one question. Given everything I could put in this window, what should actually go in it? Add too little and the model hallucinates. Add too much and you hit lost-in-the-middle, latency, and cost. Order matters too. Models attend more strongly to what’s near the user turn.

This is the layer where software engineering really shows up. Retrieval pipelines, vector stores, embedding models, hybrid search, re-ranking, summarization, sliding windows, hierarchical memory, dealing with prompt injection inside retrieved content. Each of those is a sub-discipline with its own failure modes. None of them are about wording instructions. If you can build a backend, you can build this.

In LangChain, the work becomes a chain that fills a multi-message template from a retriever and a history compactor.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
# Context engineering = filling this template, every turn
prompt = ChatPromptTemplate.from_messages([
("system", "{system_prompt}"),
MessagesPlaceholder("history"), # summarized prior turns
("system", "Relevant docs:\n{retrieved}"), # RAG output
("system", "Recent tool results:\n{tool_results}"), # freshest first
("user", "{user_turn}"), # anchored last
])
def format_docs(docs):
return "\n\n".join(d.page_content for d in docs)
chain = (
RunnablePassthrough.assign(
history=lambda x: summarize(x["raw_history"], max_tokens=2000),
retrieved=lambda x: format_docs(retriever.invoke(x["user_turn"])),
)
| prompt
| model.bind_tools(tool_specs)
)
response = chain.invoke({
"system_prompt": SYSTEM,
"raw_history": history,
"tool_results": recent_tool_results,
"user_turn": user_question,
})

The retriever is doing real work. Embedding the query, hitting a vector store, optionally re-ranking, returning the top-K chunks that might matter. The summarize step compacts older turns. Get either wrong and the model is answering with the wrong base rate, even though the prompt template and the model haven’t changed.

Animation: a 200K-token context window bar with colored segments resizing through four phases as retrieval and history grow and shrink.
The 200K-token window is a budget. Doubling retrieval doesn’t double accuracy past a point.
Animation: a pulse traveling through the RAG path: query, embed, vector DB (cosine top-K), top 8, ctx (+44k tokens added).
The RAG path: a query becomes an embedding, hits the vector store, top-K chunks land in the context.

Layer 3 · The runtime

Harness Engineering

The runtime around the call. The agent loop, tool execution, sub-agents, hooks, sandboxing, evals, observability. A model API plus a harness becomes Claude Code, Cursor, Aider, OpenCode.

The harness is, in software terms, a small operating system. It has scheduling (the loop, step budget, retries), I/O (tool calls, MCP servers, external APIs), security (sandboxing, secret scrubbing, permission gates), observability (tracing, replay, eval scores), and error handling (timeouts, fallbacks, graceful degradation). If you’ve designed any production system, every one of those concerns is familiar.

It also turns “one model call” into “a session that does work.” It owns the loop (model to tool to result to model), the step budget, the trace you can replay later, and the hooks that intercept every step. Most production complexity lives here, not in the prompt. MCP is what most modern harnesses speak to wire tools in. In LangChain, bind_tools() exposes the tool-calling protocol. The loop around it is yours to write.

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, ToolMessage
model = ChatAnthropic(model="claude-opus-4-7").bind_tools(tool_specs)
messages = [HumanMessage(content=user_request)]
tools_by_name = {t.name: t for t in tool_specs}
turn = 0
while turn < max_turns:
resp = model.invoke(messages)
messages.append(resp)
if not resp.tool_calls:
break # plain-text response, done
for tc in resp.tool_calls:
run_hooks("pre_tool", tc)
try:
result = sandbox(tools_by_name[tc["name"]]).invoke(tc["args"])
except Exception as e:
result = error_payload(e)
run_hooks("post_tool", tc, result)
messages.append(ToolMessage(
content=str(result),
tool_call_id=tc["id"],
))
turn += 1

Notice what the harness owns that the model doesn’t. The while, the sandbox(...), the run_hooks(...), the budget check, the error recovery. Hooks let you intercept every tool call. Permission prompts, logging, secret scrubbing, cost limits. Sub-agents are this same loop nested with isolated context. (For a higher-level alternative, LangGraph’s create_react_agent wraps this loop into one call. Useful when you want the runtime, less useful when you want to see the runtime.)

Animation: the agent loop. A dot bounces user to model and back to tool / tool / tool, then to end_turn. The model is in the centre as the orange-bordered hub.
The agent loop: model to tool to model, until the model emits a plain-text response. The harness owns the while.
Animation: a claude session log fading in line by line. User request, llm thinking, tool reads / greps / edits, pytest passes, llm done.
What a real session looks like: the harness orchestrates each tool call and logs every step.

So what?

The reason “prompt engineer” felt like a separate discipline two years ago was that nobody had named the layers yet. We had a single bag labeled “prompt” and threw retrieval, agent loops, tool execution, and instruction wording into it together. Now we have words for the parts. The discipline turns out to be software engineering. The work is software engineering. The model is one of the dependencies. The prompts are short pieces of input data.

If you’ve been writing software for years and feeling vaguely worried that “prompt engineering” is some adjacent skill you missed, you didn’t miss it. You already have most of the muscles. You will spend a few weeks getting fluent at writing system messages and tuning retrieval. The harder skills (how to structure a retry policy, when to compact context, how to design a tool taxonomy, how to reason about end-to-end evals) are the same skills you’ve been using to build production systems. They just have a stochastic component now.

And if you came in through prompt engineering and want to push deeper into agents, the way in is to learn a little more software engineering. Think about retries, idempotency, sandboxes, traces. The discipline is welcoming, and the on-ramp is shorter than it looks.


Archit Dwivedi

About the author

Archit Dwivedi builds AI agents and ships them into production. More at /about · what I’m working on · say hi.

    Archit · May 2026

    Leave a comment