OpenAI introduced native compaction in a March 2026 engineering post describing extensions to the Responses API. However, the underlying idea is not unique to OpenAI. Anthropic, Google, and other agent frameworks such as LangChain all expose or document related approaches under different names.
The core problem these systems address is familiar to anyone who has built an agentic system: context accumulates faster than it stays relevant, and eventually the model is carrying more history than signal. At that point, task quality degrades and costs climb without a corresponding improvement in output.
OpenAI’s solution was to build compaction directly into the runtime so developers would not need to build custom summarization and state-carrying systems themselves. The company noted that compaction is the mechanism Codex relies on for long-running coding tasks, which positions it as load-bearing infrastructure rather than a convenience feature.
At SentinelLABS, we set out to evaluate how well OpenAI’s compaction would work for automated binary analysis, a domain with its own particular demands on agent memory and state management.
Our evaluation harness gives a model access to a decompiler and asks it to complete the following:
We compare the model’s output against golden reference analysis and written reports across scoring metrics for correctness and completeness. To achieve a high score, the model needs to maintain a working theory for the slice of the binary it is analyzing, track evidence already collected, and hold open questions alongside provisional conclusions.
Malware analysis is an iterative process with a low-reward signal. A human analyst might inspect one function, learn something, pivot to another function, revise their theory, check a data structure, then return to update their original conclusion. Models do well in our evaluation where execution paths have straightforward continuity. They struggle when connections are unclear or require multiple rounds of investigation.
In observing model performance, we noticed that the agent tended to carry an increasing volume of tokens between tasks. The pattern is familiar to anyone who has run a ReAct-style agent on a non-trivial problem. Each turn adds more context until the model is dragging the full history of the run behind it, most of which stopped being useful several steps ago.
A human analyst working the same problem does not keep every raw observation equally active. They compress state between sessions. They remember that a function is probably the command dispatcher, that a particular object looks like transport state, that a given path was a dead end. They also write findings in a notebook, externalizing what they want to persist so they do not have to hold it all in working memory.
That distinction between working memory and durable memory is where compaction becomes architecturally useful.
Our system uses compaction to carry forward the working state: the current goal, what has already been tried, what was learned, which hypotheses remain active, what evidence changed the plan, and what questions are still open.
Specific findings and exact artifacts live outside the model context in durable storage. For malware analysis this includes logs and tool outputs, decompiled functions, intermediate artifacts, and ground-truth comparisons. When the agent needs exact evidence, it retrieves it from storage rather than relying on the compacted context to preserve it verbatim. In our use case tool use and response for binary exploration created increasingly large prompts. As the model used more tools to explore the space it added new findings – not all of them necessary. We leveraged compaction to summarize those tool calls and findings into more manageable chunks to maintain the working memory of the agent, but dramatically reduce the operational token overhead.
This split is what makes compaction measurable. A workflow that relies on compaction to preserve exact evidence will eventually produce incorrect answers when summarization or compression flattens crucial details. A workflow where compaction handles working memory and durable storage handles facts can be evaluated cleanly, because the boundary between the two is explicit.
Across several long-running malware analysis agent evaluations we, compared runs with compaction enabled against runs without it.
| Metric | Change |
| Input tokens | -86% |
| Output tokens | -31% |
| Reasoning tokens | -33% |
| Model calls | -1 (one fewer per run) |
| Aggregate evaluation score | Effectively unchanged |
The token reductions were substantial. The aggregate evaluation score holding flat is what matters. We were able to carry forward enough state for the workflow to continue correctly while dramatically reducing the context processed per run.
One metric did decrease: domain object modeling, meaning the model’s ability to recover the higher-level objects and structures that explain the malware’s behavior. This is not a minor caveat. For malware analysis, object and type recovery is often where the most analytically valuable conclusions are drawn.
Our read is that compaction occasionally flattened structural reasoning that would have been useful later, and it reinforces why exact artifacts must live in durable storage rather than the compacted context.
Nevertheless, our research found that compaction made longer-running analysis practical and preserved the main evaluation outcome while doing it.
Model providers expose compaction capabilities differently. For example, Anthropic and OpenAI both provide server-side compaction; however, OpenAI exposes an additional standalone compaction endpoint. This allows developers to solve the same problem at different points in their workflows as explained below.
This is the simpler starting point. The Responses API call includes a compaction threshold in context_management. When context length crosses that threshold, the API compacts prior context automatically during the response, with no separate call required from the application.
response = client.responses.create(
model="gpt-5.5",
input=conversation,
store=False,
context_management=[
{"type": "compaction", "compact_threshold": 200000}
],
)
This gives explicit control over when compaction happens. The application sends a context window to /responses/compact and receives a compacted context window back, which then becomes the input for the next response call.
compacted = client.responses.compact(
model="gpt-5.5",
input=long_input_items,
)
next_input = [
*compacted.output,
{
"type": "message",
"role": "user",
"content": next_user_message,
},
]
response = client.responses.create(
model="gpt-5.5",
input=next_input,
store=False,
)
For our malware analysis workflows, standalone compaction was useful at phase boundaries. For example, compacting after initial triage before entering deeper function analysis. This also lets you inspect metrics before and after compaction, which is useful for identifying where specific evidence is being compressed and whether that compression affects downstream scoring.
The important constraint with the standalone endpoint is to treat the returned compacted window as the next canonical context window. Do not prune it manually unless the workflow has a specific and well-understood reason to do so.
A practical decision rule for choosing between them:
| Use case | Better fit | Reason |
| Long-running coding agent | Server-side | Automatic, minimal architecture change |
| Multi-stage investigation workflow (e.g., SOC triage) | Standalone | Natural phase boundaries make explicit compaction useful |
| Chat assistant with occasional long sessions | Server-side | Low overhead |
| Evaluation harness measuring memory quality | Standalone | Allows direct comparison of pre- and post-compaction behavior |
| Workflow requiring citations or exact evidence | Neither alone | Keep artifacts in durable storage and retrieve when needed |
The main takeaway for us was that compaction works best when it is part of a broader context-engineering strategy.
Compaction is part of a broader shift from prompt engineering to context engineering. Prompt engineering concerns what we ask of the model in a single turn whereas context engineering concerns what the model gets to see across multiple turns: what gets compressed, what gets retrieved, what gets written to durable state, and what gets discarded.
For agents running long-horizon tasks, context engineering may be as important as model selection. A strong model with poor state management will lose the thread on a complex task. A model with better context discipline may make steadier progress and complete more tasks. That tradeoff compounds quickly across the kind of multi-step security workflows we are trying to evaluate.
Without compaction, realistic long-running security agent workflows become too large, noisy, and expensive to measure cleanly. With it, the scope of what is practical to evaluate expands. We view compaction not just as a tool for making agents cheaper, but as part of the infrastructure required to evaluate whether they actually work.