Issue 05 · Context engineering

Context engineering,
step by step.

An agent's tool-calling loop grows its message list every single turn. The context window does not grow with it. Context engineering is the craft of deciding what the model sees: what to keep, what to drop, what to compress, what to push out to disk, and what to never load in the first place. Get it wrong and the agent forgets the one fact it needed.

What you need going in: the agent loop, a model calling tools in a cycle, appending an AIMessage and one or more ToolMessages to a growing list on every turn. This essay is about the cost that growth accrues: a message history that outgrows the window. Comfort with the loop helps.

Ground truth: the framework code uses LangChain's memory and middleware APIs and the Deep Agents context tools, taken from their documentation. The interactive panel runs a small deterministic conversation with toy token counts, so every survivor, every dropped message, and every total is identical on each load. Spot something wrong? The colophon has my contact.

Step 1

1. The window is finite, and the loop keeps filling it

A model sees only its context window. Every loop iteration adds messages. The math does not end well on its own.

What is this? An LLM processes a fixed maximum number of tokens at once: its context window. Everything the model knows in a given call lives in that window: the system prompt, the whole conversation, every tool result. The agent loop appends an AIMessage and one or more ToolMessages on every pass. Left alone, the history grows without bound until it exceeds the window and the call fails.

It gets worse before the hard limit. Even models that accept 128k or 200k tokens get worse with very long contexts: slower, more expensive per call, and more easily distracted by stale content. A bloated window is not just a ceiling risk; it degrades quality and runs up the bill on every turn.

Context is a budget, not a free buffer. The loop spends it automatically; context engineering is how you spend it well.

Step 2

2. What counts as context

Five different things compete for the same window. Each is managed differently.

What is this? "Context" is not just the chat history. The Deep Agents framing names five distinct kinds, each with its own lever. Knowing which kind a piece of information is tells you how to manage it.

Context type	What it controls	When it applies
Input context	What goes into the prompt at startup: your instructions, tool descriptions, always-loaded memory files.	Static, paid on every run. Keep it small.
Runtime context	Structured data handed to tools at call time (user id, credentials), invisible to the model.	Per run; passed via a context schema, it never enters the window.
Context compression	Keeping the growing message history within the window.	As limits approach: trim, summarize, or offload (most of this essay).
Context isolation	Quarantining heavy work in a subagent that has its own window.	Per subagent, when you delegate (Step 8).
Long-term memory	Facts and preferences that persist across separate conversations.	Across threads, via a store (Step 3).

Most of this essay lives in the middle rows, compression and isolation, because that is where the agent loop's runaway history gets tamed. But the cheapest token saved is the one in the system prompt you never wrote.

Step 3

3. Short-term and long-term memory

Two different problems. One is "remember this conversation." The other is "remember this user."

Short-term memory: the checkpointer. Without it, every invoke is stateless, the agent forgets everything between messages. A checkpointer saves the full agent state after each step, keyed by a thread_id, so the next message in the same thread sees the whole history. This is what makes a conversation a conversation.

Long-term memory: the store. A checkpointer is scoped to one thread. Start a new conversation and it has nothing to load. A store persists structured facts (preferences, decisions) keyed by user_id, so they survive across sessions. The split is exact: thread_id changes per session; user_id stays the same.

The code

from langchain.agents import create_agent
from langchain.chat_models import init_chat_model
from langchain.messages import HumanMessage
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.store.memory import InMemoryStore

agent = create_agent(
    model=init_chat_model("openai:gpt-4o"),
    tools=[],
    prompt="You remember our conversation.",
    checkpointer=InMemorySaver(),   # short-term: per-thread history
    store=InMemoryStore(),          # long-term: cross-session facts
)

# thread_id links messages into one conversation
config = {"configurable": {"thread_id": "user-123"}}
agent.invoke({"messages": [HumanMessage(content="My name is Alice.")]}, config=config)
result = agent.invoke({"messages": [HumanMessage(content="What is my name?")]}, config=config)
print(result["messages"][-1].content)   # "Your name is Alice."

In production, swap the in-memory backends for persistent ones (PostgresSaver, PostgresStore), otherwise a restart wipes every conversation. A store organizes data by a namespace tuple and key, for example ("users", user_id, "facts"), and tools reach it through ToolRuntime.store.

The checkpointer gives continuity within a session; the store gives continuity across sessions. Neither solves the window problem, which is what happens when a single session's history gets too long.

Step 4

4. Three ways to keep the window lean

When the history grows too long, you trim it, or you summarize it. They are not the same trade.

What is this? The history sits in state. Before each model call you can shrink it. Three approaches, not mutually exclusive: trim by message count (keep the last N), trim by token count (keep what fits a token budget), or summarize (compress old messages into a short summary and keep the recent ones verbatim).

If conversations...	Strategy	Trade-off
Are moderately long	Trim by count	Simple, but you lose old context entirely.
Have variable-length messages	Trim by tokens	Precise budget control, slightly more code.
Hold important old context	Summarize	Retains key facts, but costs an extra (cheap) model call.

Trimming, by count and by tokens

from langchain.messages import SystemMessage

def trim_by_count(messages, max_messages=20):
    """Keep the system message plus the last N messages."""
    system = [m for m in messages if isinstance(m, SystemMessage)]
    other  = [m for m in messages if not isinstance(m, SystemMessage)]
    if len(other) <= max_messages:
        return messages
    return system + other[-max_messages:]

def trim_by_tokens(messages, max_tokens=8000):
    """Walk backward, keep messages until the token budget runs out."""
    system = [m for m in messages if isinstance(m, SystemMessage)]
    other  = [m for m in messages if not isinstance(m, SystemMessage)]
    kept, total = [], sum(len(m.content) // 4 for m in system)  # rough estimate
    for msg in reversed(other):
        cost = len(msg.content) // 4
        if total + cost > max_tokens:
            break
        kept.insert(0, msg)
        total += cost
    return system + kept

The len(content) // 4 rule is a rough token estimate for English. For production, count with a real tokenizer. The deeper problem is not the estimate: it is that both trimmers throw away whatever falls outside the window, regardless of how important it was. The next panel shows exactly what that costs.

Trimming is fast and simple, and it is blind. It cannot tell a throwaway "ok, thanks" from the one message that decided your whole architecture.

Step 5

5. Live: a filling context window

A real conversation, a tight window, and four policies. Watch what survives, and what is lost.

What is this? Below is a twelve-message coding session with toy token counts. The window holds only 80 tokens; the full history needs 125. Something must give. Pick a policy and watch which messages reach the model. Pay attention to message 2, the one that decided the database, and to the last message, where the user asks which database they chose.

Pick a policy

Start on "keep everything" to see the overflow, then trim, then summarize. The slider sets how many recent messages to keep (used by trim-by-count and summarize).

Trimming keeps the window legal but can delete the one message that mattered. Summarization keeps the window legal and carries the key decision forward. That difference is the whole point of context engineering.

Step 6

6. Summarization, in depth

Compress the old, keep the recent verbatim, and use a cheap model to do it.

What is this? Summarization asks a model to condense old messages into a short summary that preserves the key facts, decisions, and context, then replaces those old messages with the summary. Recent messages stay verbatim. The summary costs far fewer tokens than the history it replaces, and unlike trimming it does not silently lose the important parts.

Use a cheaper model. Summarizing does not need your most capable model. A faster, cheaper one keeps latency and cost low. The pattern: trigger when the history crosses a threshold, summarize everything older than the last N messages, keep those N untouched.

What summarization does to the window

A 200k window at 85% full, before and after. The full history is replaced by a compact summary; the recent messages survive intact.

before

170k / 200k

after

22k / 200k

Dark = system prompt. Faded = full history. Rose = the summary that replaces it. Rust = recent messages, kept verbatim.

The code

By hand, then with the built-in middleware that does it for you.

from langchain.chat_models import init_chat_model
from langchain.messages import HumanMessage, SystemMessage

summarizer = init_chat_model("openai:gpt-4o-mini")   # cheap model for summaries

def summarize_old(messages, keep_recent=10):
    system = [m for m in messages if isinstance(m, SystemMessage)]
    other  = [m for m in messages if not isinstance(m, SystemMessage)]
    if len(other) <= keep_recent:
        return messages
    old, recent = other[:-keep_recent], other[-keep_recent:]
    text = "\n".join(f"{type(m).__name__}: {m.content}" for m in old)
    summary = summarizer.invoke([
        SystemMessage(content="Summarize concisely. Preserve key facts and decisions."),
        HumanMessage(content=text),
    ])
    return system + [SystemMessage(content=f"[Summary]: {summary.content}")] + recent

from langchain.agents import create_agent
from langchain.agents.middleware import SummarizationMiddleware
from langgraph.checkpoint.memory import InMemorySaver

agent = create_agent(
    model="openai:gpt-4o",
    tools=[],
    middleware=[
        SummarizationMiddleware(
            model="openai:gpt-4o-mini",   # cheaper model for summaries
            trigger=("tokens", 4000),      # summarize when history passes 4000 tokens
            keep=("messages", 20),         # always keep the 20 most recent
        )
    ],
    checkpointer=InMemorySaver(),
)

Deep Agents does this automatically at 85% of the window: it generates a structured summary (session intent, artifacts, next steps) that replaces the old messages in context, and writes the complete original history to the filesystem as a canonical record. The 85% trigger leaves the last 15% for the agent's next response.

Summarization is trimming with a memory. It costs one cheap model call to avoid deleting the fact you will need three turns from now.

Step 7

7. Offloading: externalize instead of compress

Some content does not belong in the window at all. Put it on disk and fetch what you need.

What is this? Summarization compresses; offloading relocates. When a tool returns a huge result (a 148k-token analysis, a giant file), holding it in context is wasteful: the agent rarely needs all of it at once. Deep Agents automatically writes any tool result over 20,000 tokens to a virtual filesystem and replaces it in context with a short reference plus a preview. The agent then reads the slice it needs with read_file, or searches it with grep.

Why it is different from summarizing. A summary is lossy: detail is gone for good. Offloading is lossless: the full content is still there on disk, addressable, just not occupying the window. When the agent needs page 7, it reads page 7.

In context	Without offloading	With offloading
A 148k-token tool result	148,329 tokens sitting in the window	"Saved to /output/analysis_001.md (148,329 tokens). Preview: ..."
Old large tool calls	Kept verbatim until overflow	Truncated past 85% of the window, replaced by a pointer to the file on disk

If the agent might need the detail later but not now, offload it. Summarize what must stay legible; relocate what only needs to be reachable.

Step 8

8. Isolation: give heavy work its own window

The most effective technique is to keep the mess out of the main window entirely.

What is this? A subagent runs with its own fresh context window. It does the heavy work, reading fifty files, running the analysis, accumulating eighty thousand tokens, and returns only its final result to the parent. The parent never sees the mess; it sees the conclusion. This is context isolation, and for long tasks it is the single most powerful lever.

The difference, in tokens

Approach	Parent context after the task
No subagent	user request + 50 file reads + analysis = 83,000 tokens
With a subagent	user request + the subagent's summary = 3,000 tokens (the 80,000 tokens of file reads live and die in the subagent's window)

The code

from deepagents import create_deep_agent

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    subagents=[
        {
            "name": "researcher",
            "description": "Conducts research on a topic",
            "system_prompt": (
                "You are a research assistant. "
                "IMPORTANT: return only the essential summary (under 500 words). "
                "Do NOT include raw search results or detailed tool outputs."
            ),
            "tools": [web_search],
        }
    ],
)

The instruction to "return only the summary, not the raw data" is doing the real work. A subagent that dumps its raw findings back into the parent defeats the entire purpose. Isolate the work and constrain the return.

The cheapest token to manage is the one that never enters the main window. Delegate heavy work to a subagent and let its context be discarded when it finishes.

Step 9

9. Wiring it into the loop

Run your trimming before every model call with a before-model hook.

What is this? If the built-in SummarizationMiddleware from Step 6 is not enough, you can run custom logic before every model call with the @before_model hook. It receives the state, can rewrite the message list, and returns the replacement, so the model never sees a list that is too long.

The replacement trick. To swap the whole history, return a RemoveMessage(id=REMOVE_ALL_MESSAGES) followed by the messages you want to keep. The reducer clears the old list and applies the new one.

The code

from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import before_model
from langchain.messages import RemoveMessage
from langgraph.graph.message import REMOVE_ALL_MESSAGES
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.runtime import Runtime

@before_model
def trim(state: AgentState, runtime: Runtime):
    """Keep the first message and the last few; drop the rest."""
    messages = state["messages"]
    if len(messages) <= 3:
        return None                       # nothing to do
    keep = [messages[0]] + messages[-3:]
    return {"messages": [RemoveMessage(id=REMOVE_ALL_MESSAGES), *keep]}

agent = create_agent(
    "openai:gpt-4o",
    tools=[],
    middleware=[trim],
    checkpointer=InMemorySaver(),
)

Context management is not a one-time setup. It is a hook that runs on every turn of the loop, keeping the window lean before the model ever sees it.

Step 10

10. Which technique, when

Five levers. Each fits a different shape of problem.

Technique	Reach for it when	Loses detail?	Cost
Trim by count / tokens	Old turns genuinely do not matter	yes, permanently	free, fast
Summarize	Old turns hold facts you will need	lossy, but keeps key facts	one cheap model call
Offload to disk	A tool result is large but reachable later	no, fully recoverable	a file write
Isolate in a subagent	A subtask generates heavy intermediate context	no, discarded by design	a separate model run
Move to a store	A fact must survive across sessions	no, persisted	a key-value write

In order of preference for most agents: never load it (system prompt discipline), isolate it (subagents), offload it (disk), summarize it (cheap model), and only then trim it (blind deletion) as the last resort.

Step 11

11. What we left out

Real machinery, deferred to keep this essay about the core levers.

Persistence backends. PostgresSaver and PostgresStore for production memory that survives restarts and serves many nodes at once.
Selective deletion. Beyond chronological trimming, remove specific noise, for example failed tool calls, by id.
Memory consolidation. Background processes that promote facts from a conversation into long-term memory, and prune stale ones.
Skills and progressive disclosure. Load a capability's full instructions only when the task matches, instead of paying for them in every prompt.
AGENTS.md memory files. Always-loaded preference and fact files; keep them under 2,000 tokens, because they cost on every single call.
Retrieval (RAG). The largest context lever of all: pull in only the documents relevant to the current question, rather than everything you might need.

Those are the levers of context engineering: keep, drop, summarize, offload, isolate, and store. Master them and an agent can run for as long as the task needs, without its memory either overflowing or going blank.

An agent is a loop that calls tools and spends context. Engineer the tools, engineer the loop, engineer the context, and you have engineered the agent.

Context engineering,step by step.

1. The window is finite, and the loop keeps filling it

2. What counts as context

3. Short-term and long-term memory

The code

4. Three ways to keep the window lean

Trimming, by count and by tokens

5. Live: a filling context window

Pick a policy

6. Summarization, in depth

What summarization does to the window

The code

7. Offloading: externalize instead of compress

8. Isolation: give heavy work its own window

The difference, in tokens

The code

9. Wiring it into the loop

The code

10. Which technique, when

11. What we left out

Context engineering,
step by step.