Ask your LLM for receipts: What I learned teaching Claude C++ crash triage
好的,我需要帮用户总结一篇文章,控制在100字以内。首先,我要仔细阅读文章内容,理解主要观点。 这篇文章讲的是作者用Claude Code分析和分类C++代码中的崩溃问题。一开始效果不好,Claude生成了很多不切实际的原因。经过几天的迭代,最终效果不错。作者贡献了一些.md文件到GitHub项目raptor,特别是crash-analyzer-agent的部分。 文章还提到LLM在处理无法立即验证的任务时表现不佳,比如崩溃分析需要构建叙述。作者通过让LLM提供详细的数据支持( receipts)来解决这个问题,并由另一个LLM验证这些步骤,从而提高准确性。 总结时要抓住关键点:使用Claude Code进行崩溃分析,通过迭代改进效果,提供详细数据支持并由另一个模型验证。控制在100字以内,不需要开头用“文章总结”之类的词。 最后检查一下是否符合要求,确保内容准确且简洁。 </think> 作者通过训练Claude Code分析C++代码中的崩溃问题,在ffmpeg bug tracker上测试后发现初期结果不理想。通过迭代改进,最终开发出一种基于文本数据的崩溃分析方法:利用子代理生成假设并提供详细数据支持(如指针分配、修改记录等),再由另一代理验证这些假设的正确性。该方法通过分步验证提高结果可信度,并已开源至GitHub项目raptor中。 2025-12-12 11:17:0 Author: addxorrol.blogspot.com(查看原文) 阅读量:7 收藏

I recently embarked on a small toy project/experiment: How well can I equip Claude Code to automatically analyze and triage crashes in a C++ code base?

For the experimentation, I worked on a small number of crashes in the ffmpeg bug tracker. The initial results were very discouraging, Claude hallucinated all sorts of implausible root causes and tended to write typical "AI slop" -- things that follow the form of a well-written report, but that had no bearing on reality.

I iterated for a few days, but ultimately I got things to work reasonably well, at least to the point where I was happy with the result.

The result of this little diversion are a bunch of .md files (subagents and skills) that I contributed to https://github.com/gadievron/raptor - specifically the following parts:

https://github.com/gadievron/raptor/blob/main/.claude/agents/crash-analyzer-agent.md

The task itself is not necessarily a natural fit for an LLM: I find that LLMs tend to perform better in situations where their results can be immediately verified. This is not the case here - crash triage fundamentally includes a component of "narrative building", and it is not super clear how to validate such a narrative.

There are a few things that I took from my experience in using Claude Code for C++ development in the last year which I applied:

  • Since LLMs only perceive the world through text, but their context is a scarce resource, it makes sense to provide them with effective ways of gathering extra data without wasting too much context.
  • LLMs will hallucinate arbitrary things but tend to course-correct if their context includes too much data that is obviously in contradiction with their current trajectory.

In my C++ development, I learnt to provide the LLMs with copious amount of conditionally-compiled logging, and ways of running granular tests, so gathering information about what is happening without totally swamping the context window was possible.

Anyhow, what does the crash-analysis-agent end up doing?

  1. It gathers a lot of stuff that provides text-level data about what is going on in the program that crashes: A function-level execution trace, gcov data, an ASAN build, and an rr recording that allows deterministic replay of a particular crashing execution.
  2. It launches a subagent to then formulate a hypothesis of what is going on. This subagent is instructed to "provide receipts" for each step in the reasoning: Show the precise place where the pointer that ultimately leads to the crashing deref is allocated, show all the modifications, both in the source code and in the rr trace. Show all modifications to it, including the pointer values pre/post modification in the rr trace.
  3. This hypothesis document is then validated by a separate subagent that is instructed to carefully vet each of the steps in the first document, and reject the file if any evidence is missing. On rejection, a rebuttal is written. This rebuttal is then passed to the previous agent again, until a report is generated that the validator accepts.
  4. The final output is a report that includes specific breakpoints, pointer values, pointer modifications etc. that can be manually verified by a human by stepping through the provided rr trace.

In some sense, this is "LLM as a judge", but it appears to me that the usual problem ("generating LLM is convincing enough that the judge LLM waves everything through") is side-stepped by making the judging LLM focus on the formal correctness of the individual steps.

I didn't think much of this, but when I presented this to an audience during the last week, some of the feedback I got was that the technique of "ask the LLM for detailed receipts & have a second LLM validate the receipts" was not necessarily widely known.

So here we are. If you have a task that is perhaps not verifiable on it's final output, but involves verifiable substeps, you can greatly boost performance by providing the LLM with tools/skills to "provide receipts" for the substeps - the final output might still be wrong, but it is so with a much decreased probability.


文章来源: http://addxorrol.blogspot.com/2025/12/ask-your-llm-for-receipts-what-i.html
如有侵权请联系:admin#unsafe.sh