Context Engineering | Compaction & Agent Memory for Automated Malware Analysis

Context Engineering | Compaction & Agent Memory for Automated Malware Analysis
Executive SummaryCompaction is a context-management pattern used across agent systems t 2026-7-2 13:0:2 Author: www.sentinelone.com(查看原文) 阅读量:2 收藏

Executive Summary

Compaction is a context-management pattern used across agent systems to compress prior context into a denser working state for long-running tasks.
SentinelLABS evaluated OpenAI’s native Responses API implementation against our automated malware analysis evaluation harness to measure real-world impact on task quality and cost.
Compaction reduced input tokens by ~86% with no measurable change to the aggregate evaluation score.
Our analysis found that compaction can significantly reduce the cost and noise of long-running security workflows without sacrificing task quality.

OpenAI introduced native compaction in a March 2026 engineering post describing extensions to the Responses API. However, the underlying idea is not unique to OpenAI. Anthropic, Google, and other agent frameworks such as LangChain all expose or document related approaches under different names.

The core problem these systems address is familiar to anyone who has built an agentic system: context accumulates faster than it stays relevant, and eventually the model is carrying more history than signal. At that point, task quality degrades and costs climb without a corresponding improvement in output.

OpenAI’s solution was to build compaction directly into the runtime so developers would not need to build custom summarization and state-carrying systems themselves. The company noted that compaction is the mechanism Codex relies on for long-running coding tasks, which positions it as load-bearing infrastructure rather than a convenience feature.

At SentinelLABS, we set out to evaluate how well OpenAI’s compaction would work for automated binary analysis, a domain with its own particular demands on agent memory and state management.

Why Malware Analysis Is a Hard Problem for Agents

Our evaluation harness gives a model access to a decompiler and asks it to complete the following:

Identify important functions and follow code paths
Interpret strings, APIs, call relationships, and data structures
Rename functions or variables based on observed behavior
Propose types or object models and explain what the malware is doing

We compare the model’s output against golden reference analysis and written reports across scoring metrics for correctness and completeness. To achieve a high score, the model needs to maintain a working theory for the slice of the binary it is analyzing, track evidence already collected, and hold open questions alongside provisional conclusions.

Malware analysis is an iterative process with a low-reward signal. A human analyst might inspect one function, learn something, pivot to another function, revise their theory, check a data structure, then return to update their original conclusion. Models do well in our evaluation where execution paths have straightforward continuity. They struggle when connections are unclear or require multiple rounds of investigation.

In observing model performance, we noticed that the agent tended to carry an increasing volume of tokens between tasks. The pattern is familiar to anyone who has run a ReAct-style agent on a non-trivial problem. Each turn adds more context until the model is dragging the full history of the run behind it, most of which stopped being useful several steps ago.

A human analyst working the same problem does not keep every raw observation equally active. They compress state between sessions. They remember that a function is probably the command dispatcher, that a particular object looks like transport state, that a given path was a dead end. They also write findings in a notebook, externalizing what they want to persist so they do not have to hold it all in working memory.

That distinction between working memory and durable memory is where compaction becomes architecturally useful.

How We Applied Compaction

Our system uses compaction to carry forward the working state: the current goal, what has already been tried, what was learned, which hypotheses remain active, what evidence changed the plan, and what questions are still open.

Specific findings and exact artifacts live outside the model context in durable storage. For malware analysis this includes logs and tool outputs, decompiled functions, intermediate artifacts, and ground-truth comparisons. When the agent needs exact evidence, it retrieves it from storage rather than relying on the compacted context to preserve it verbatim. In our use case tool use and response for binary exploration created increasingly large prompts. As the model used more tools to explore the space it added new findings – not all of them necessary. We leveraged compaction to summarize those tool calls and findings into more manageable chunks to maintain the working memory of the agent, but dramatically reduce the operational token overhead.

This split is what makes compaction measurable. A workflow that relies on compaction to preserve exact evidence will eventually produce incorrect answers when summarization or compression flattens crucial details. A workflow where compaction handles working memory and durable storage handles facts can be evaluated cleanly, because the boundary between the two is explicit.

Results

Across several long-running malware analysis agent evaluations we, compared runs with compaction enabled against runs without it.

Metric	Change
Input tokens	-86%
Output tokens	-31%
Reasoning tokens	-33%
Model calls	-1 (one fewer per run)
Aggregate evaluation score	Effectively unchanged

The token reductions were substantial. The aggregate evaluation score holding flat is what matters. We were able to carry forward enough state for the workflow to continue correctly while dramatically reducing the context processed per run.

One metric did decrease: domain object modeling, meaning the model’s ability to recover the higher-level objects and structures that explain the malware’s behavior. This is not a minor caveat. For malware analysis, object and type recovery is often where the most analytically valuable conclusions are drawn.

Our read is that compaction occasionally flattened structural reasoning that would have been useful later, and it reinforces why exact artifacts must live in durable storage rather than the compacted context.

Nevertheless, our research found that compaction made longer-running analysis practical and preserved the main evaluation outcome while doing it.

Implementation

Model providers expose compaction capabilities differently. For example, Anthropic and OpenAI both provide server-side compaction; however, OpenAI exposes an additional standalone compaction endpoint. This allows developers to solve the same problem at different points in their workflows as explained below.

Server-side Compaction

This is the simpler starting point. The Responses API call includes a compaction threshold in context_management. When context length crosses that threshold, the API compacts prior context automatically during the response, with no separate call required from the application.

response = client.responses.create(
    model="gpt-5.5",
    input=conversation,
    store=False,
    context_management=[
        {"type": "compaction", "compact_threshold": 200000}
    ],
)

Standalone Compaction

This gives explicit control over when compaction happens. The application sends a context window to /responses/compact and receives a compacted context window back, which then becomes the input for the next response call.

compacted = client.responses.compact(
    model="gpt-5.5",
    input=long_input_items,
)
next_input = [
    *compacted.output,
    {
        "type": "message",
        "role": "user",
        "content": next_user_message,
    },
]
response = client.responses.create(
    model="gpt-5.5",
    input=next_input,
    store=False,
)

For our malware analysis workflows, standalone compaction was useful at phase boundaries. For example, compacting after initial triage before entering deeper function analysis. This also lets you inspect metrics before and after compaction, which is useful for identifying where specific evidence is being compressed and whether that compression affects downstream scoring.

The important constraint with the standalone endpoint is to treat the returned compacted window as the next canonical context window. Do not prune it manually unless the workflow has a specific and well-understood reason to do so.

A practical decision rule for choosing between them:

Use case	Better fit	Reason
Long-running coding agent	Server-side	Automatic, minimal architecture change
Multi-stage investigation workflow (e.g., SOC triage)	Standalone	Natural phase boundaries make explicit compaction useful
Chat assistant with occasional long sessions	Server-side	Low overhead
Evaluation harness measuring memory quality	Standalone	Allows direct comparison of pre- and post-compaction behavior
Workflow requiring citations or exact evidence	Neither alone	Keep artifacts in durable storage and retrieve when needed

How to Use Compaction

The main takeaway for us was that compaction works best when it is part of a broader context-engineering strategy.

Separate working memory from source-of-truth artifacts. Compaction is appropriate for the immediate state the model needs to continue working. Exact evidence belongs somewhere else. This boundary matters both for correctness and for being able to evaluate whether the compacted run behaved correctly.
Compact long-running workflows. Compaction has the most impact when a task involves many steps and repeated tool use. Short interactions have little to compress.
Start with server-side compaction. For most agent loops it is the fastest way to learn whether compaction helps. Move to standalone when compaction policy becomes part of the task or evaluation design.
Do not evaluate on cost alone. A run can become significantly cheaper while losing task quality, depending on what was compacted. Resource and outcome metrics need to be tracked together.
Preserve negative information. Long-running agents need to remember what failed, not just what worked. Failed paths carry state that informs subsequent decisions, and compaction can discard them if the workflow does not explicitly mark them as worth preserving.
Treat compaction as lossy until proven otherwise. Use evaluations, traces, and artifact comparisons to verify that the compacted run still behaves correctly. Our domain object modeling result is a reminder that what looks like clean compression can still affect specific downstream capabilities.

Conclusion

Compaction is part of a broader shift from prompt engineering to context engineering. Prompt engineering concerns what we ask of the model in a single turn whereas context engineering concerns what the model gets to see across multiple turns: what gets compressed, what gets retrieved, what gets written to durable state, and what gets discarded.

For agents running long-horizon tasks, context engineering may be as important as model selection. A strong model with poor state management will lose the thread on a complex task. A model with better context discipline may make steadier progress and complete more tasks. That tradeoff compounds quickly across the kind of multi-step security workflows we are trying to evaluate.

Without compaction, realistic long-running security agent workflows become too large, noisy, and expensive to measure cleanly. With it, the scope of what is practical to evaluate expands. We view compaction not just as a tool for making agents cheaper, but as part of the infrastructure required to evaluate whether they actually work.

References

OpenAI: Responses API + Codex https://openai.com/index/equip-responses-api-computer-environment
Anthropic: Long-Running Agent Harnesses https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
LangChain: Deep Agents + Context Compression https://docs.langchain.com/oss/python/deepagents/context-engineering
Jason Liu: Compaction As Something To Evaluate https://jxnl.co/writing/2025/08/30/context-engineering-compaction
Context Compression For Long-Horizon Agents https://arxiv.org/abs/2601.07190

文章来源: https://www.sentinelone.com/labs/context-engineering-compaction-agent-memory-for-automated-malware-analysis/
如有侵权请联系:admin#unsafe.sh