I recently embarked on a small toy project/experiment: How well can I equip Claude Code to automatically analyze and triage crashes in a C++ code base?
For the experimentation, I worked on a small number of crashes in the ffmpeg bug tracker. The initial results were very discouraging, Claude hallucinated all sorts of implausible root causes and tended to write typical "AI slop" -- things that follow the form of a well-written report, but that had no bearing on reality.
I iterated for a few days, but ultimately I got things to work reasonably well, at least to the point where I was happy with the result.
The result of this little diversion are a bunch of .md files (subagents and skills) that I contributed to https://github.com/gadievron/raptor - specifically the following parts:
https://github.com/gadievron/raptor/blob/main/.claude/agents/crash-analyzer-agent.md
The task itself is not necessarily a natural fit for an LLM: I find that LLMs tend to perform better in situations where their results can be immediately verified. This is not the case here - crash triage fundamentally includes a component of "narrative building", and it is not super clear how to validate such a narrative.
There are a few things that I took from my experience in using Claude Code for C++ development in the last year which I applied:
In my C++ development, I learnt to provide the LLMs with copious amount of conditionally-compiled logging, and ways of running granular tests, so gathering information about what is happening without totally swamping the context window was possible.
Anyhow, what does the crash-analysis-agent end up doing?
In some sense, this is "LLM as a judge", but it appears to me that the usual problem ("generating LLM is convincing enough that the judge LLM waves everything through") is side-stepped by making the judging LLM focus on the formal correctness of the individual steps.
I didn't think much of this, but when I presented this to an audience during the last week, some of the feedback I got was that the technique of "ask the LLM for detailed receipts & have a second LLM validate the receipts" was not necessarily widely known.
So here we are. If you have a task that is perhaps not verifiable on it's final output, but involves verifiable substeps, you can greatly boost performance by providing the LLM with tools/skills to "provide receipts" for the substeps - the final output might still be wrong, but it is so with a much decreased probability.