Ask your LLM for receipts: What I learned teaching Claude C++ crash triage

I recently embarked on a small toy project/experiment: How well can I equip Claude Code to automatically analyze and triage crashes in a C++ code base?

For the experimentation, I worked on a small number of crashes in the ffmpeg bug tracker. The initial results were very discouraging, Claude hallucinated all sorts of implausible root causes and tended to write typical "AI slop" -- things that follow the form of a well-written report, but that had no bearing on reality.

I iterated for a few days, but ultimately I got things to work reasonably well, at least to the point where I was happy with the result.

The result of this little diversion are a bunch of .md files (subagents and skills) that I contributed to https://github.com/gadievron/raptor - specifically the following parts:

https://github.com/gadievron/raptor/blob/main/.claude/agents/crash-analyzer-agent.md

The task itself is not necessarily a natural fit for an LLM: I find that LLMs tend to perform better in situations where their results can be immediately verified. This is not the case here - crash triage fundamentally includes a component of "narrative building", and it is not super clear how to validate such a narrative.

There are a few things that I took from my experience in using Claude Code for C++ development in the last year which I applied:

Since LLMs only perceive the world through text, but their context is a scarce resource, it makes sense to provide them with effective ways of gathering extra data without wasting too much context.
LLMs will hallucinate arbitrary things but tend to course-correct if their context includes too much data that is obviously in contradiction with their current trajectory.

In my C++ development, I learnt to provide the LLMs with copious amount of conditionally-compiled logging, and ways of running granular tests, so gathering information about what is happening without totally swamping the context window was possible.

Anyhow, what does the crash-analysis-agent end up doing?

It gathers a lot of stuff that provides text-level data about what is going on in the program that crashes: A function-level execution trace, gcov data, an ASAN build, and an rr recording that allows deterministic replay of a particular crashing execution.
It launches a subagent to then formulate a hypothesis of what is going on. This subagent is instructed to "provide receipts" for each step in the reasoning: Show the precise place where the pointer that ultimately leads to the crashing deref is allocated, show all the modifications, both in the source code and in the rr trace. Show all modifications to it, including the pointer values pre/post modification in the rr trace.
This hypothesis document is then validated by a separate subagent that is instructed to carefully vet each of the steps in the first document, and reject the file if any evidence is missing. On rejection, a rebuttal is written. This rebuttal is then passed to the previous agent again, until a report is generated that the validator accepts.
The final output is a report that includes specific breakpoints, pointer values, pointer modifications etc. that can be manually verified by a human by stepping through the provided rr trace.

In some sense, this is "LLM as a judge", but it appears to me that the usual problem ("generating LLM is convincing enough that the judge LLM waves everything through") is side-stepped by making the judging LLM focus on the formal correctness of the individual steps.

I didn't think much of this, but when I presented this to an audience during the last week, some of the feedback I got was that the technique of "ask the LLM for detailed receipts & have a second LLM validate the receipts" was not necessarily widely known.

So here we are. If you have a task that is perhaps not verifiable on it's final output, but involves verifiable substeps, you can greatly boost performance by providing the LLM with tools/skills to "provide receipts" for the substeps - the final output might still be wrong, but it is so with a much decreased probability.