Guide · AI code review

Why Git Diffs Are Not Enough for AI-Generated Code

A diff shows the destination, not the journey. For agent-written code the journey — what it was asked, which files it actually read, why it chose the approach, what it flagged as uncertain, whether the risky part was tested — is exactly where the risk concentrates. The diff is structurally blind to all of it.

Published 2026-06-03 Reading time 11 min read Tags Code review · Git diff · Audit

Key takeaways

Review AI code by intent, context coverage, and evidence — not line-by-line alone.
Generated volume is not generated risk: a five-line auth change can outweigh a 500-line fixture.
The missing signals (prompt, observed context, test evidence, flagged uncertainty) are recordable in Git and belong in the review packet.

Coding agents have changed the economics of writing code, not of reading it. An agent can produce a clean, internally consistent patch in seconds; a reviewer still has to reconstruct the context the agent had — and then threw away — before they can trust a single line. Review throughput, not generation throughput, is now the bottleneck, and the tool we reach for first was designed for a different problem. The Git diff was built to compare two states of code written by a human who understood the system. It answers "what changed?" exactly. It was never meant to answer "what did the author know, and was it enough?" — a question that barely mattered when a colleague authored the change and could be asked directly, and that becomes the whole game when the author is a stateless model.

This is the part worth being precise about. The diff is not wrong, and it is not going away. It remains the ground truth for code changes: precise, compact, and universal. What is new is not a flaw in diffs but a shift in where risk lives. For human code, the risk is mostly in the lines, because the intent and context behind them were carried in a head you could query. For agent code, the lines are the one part you can already see; the risk has moved upstream into the journey — the prompt, the files the agent opened, the assumption it made about code it never opened, the test it ran or skipped, the uncertainty it voiced and then buried in a confident-sounding patch. A diff shows the destination. The journey is structurally absent from it.

What a diff cannot tell you, in cases you will actually see

The gap is easiest to feel through patches that pass a line-by-line read and still carry the failure with them. Each of these is a clean diff:

The plausible patch on an unverified assumption. The agent changes a serializer to emit timestamps as Unix epoch seconds because that is the common convention. The diff is tidy and the tests pass. What the diff cannot show is that the agent never opened the consumer, which has parsed ISO-8601 for two years. The code is internally correct and wrong against a constraint the agent never read.
Tests that re-prove the old happy path. The change adds a permission check; the diff also adds a test, which is reassuring until you notice the test exercises the already-passing authorized case and never constructs the unauthorized one. Coverage went up. The new behavior — the denial — was never executed. The diff shows a test was added; it cannot show that the test misses the edge the change exists to handle.
The refactor that updates the helper but not every caller. A signature changes from returning a value to returning a Result. The agent updated the three callers it had in context and the compiler was satisfied because the fourth caller lives behind a feature flag the agent never built. The diff is a coherent local change; whether it is a complete one depends on what the agent could see, which the diff does not record.
The fix that is correct locally and wrong against a system constraint. An agent shortens a retry backoff to fix a flaky test. Locally reasonable. The constraint it never read is a downstream rate limit that the original backoff was tuned to respect. The diff is three lines of obviously-better-looking code. The reason it is dangerous is not in the diff at all.

None of these are caught by reading the patch more carefully, because in each case the patch is fine in isolation. They are caught by knowing what the agent did and did not observe — which is precisely the information a diff drops.

The asymmetry that makes AI patches dangerous to review

Human diffs leak their own uncertainty. A rushed human change looks rushed: inconsistent naming, a half-finished comment, a commit message that says "quick hack, revisit." Those tells are noisy, but they are signal, and experienced reviewers read them. Agent output erases the tells. Speed and surface coherence are uniform whether the agent had the right evidence or none of it. A patch built on a careful reading of the interface, the caller, and the tests looks identical to a patch built on a confident guess. The polish is not correlated with the diligence.

That asymmetry sets a trap: the natural response to an unfamiliar, confident-looking patch is to read every line with equal care. For agent code that is the failure mode, not the discipline. Reading evenly spends the reviewer's scarcest resource — attention — uniformly across a change whose risk is wildly uneven, and it spends it on the one dimension (the lines) that is already visible while ignoring the dimensions (intent, context, evidence) that are not. The volume of generated code makes this worse, but volume is a distraction from the real point: generated volume is not generated risk. A 500-line fixture file the agent generated mechanically may warrant a skim; a five-line change to token expiry or an authorization predicate can outweigh all of it. Line count is a measure of how much there is to read, not of how much there is to get wrong.

Context coverage is a first-class review signal

If the lines do not tell you where risk concentrates, what does? The most useful single question for an agent change is one a diff cannot answer: did the agent read the things its change depends on? Did it open the interface it modified, the tests that exercise it, and at least the callers it affects? An edit made after observing the right files is ordinary engineering. The same edit made blind — correct-looking but produced without reading what it depends on — is the high-signal risk, because it is exactly how the four scenarios above are born.

This is recordable. h5i logs what the agent observed as a first-class part of the work: the context DAG (refs/h5i/context) captures OBSERVE nodes for the files the agent read, THINK nodes for its reasoning, and ACT nodes for its edits. Context coverage is then a direct comparison — the set of files the change touched against the set the agent OBSERVEd. A file edited but never read is a blind edit, and blind edits on sensitive paths are where a reviewer's first pass should land.

Uncertainty is signal, not noise to be smoothed over

Agents routinely tell you where they are unsure — "assuming the caller validates this," "this is likely correct but untested," "not sure whether the downstream service expects UTC." That text is usually the most honest risk assessment in the entire change, and it is usually the first thing lost, because the final patch is written to look finished. Treating that uncertainty as a routable signal — pulling the agent's own hedges to the front of the reviewer's attention — is one of the highest-leverage moves in AI review. h5i records flagged uncertainty as NOTE entries in the context DAG, and the uncertainty heatmap turns that into a map of where the agent itself was least confident, so review attention follows the agent's own doubt instead of fighting the patch's confident tone.

What this is not: a fair look at the alternatives

Provenance is not the only answer on offer, and it is worth being honest about what each existing approach actually catches.

Conventional human diff review works — when a human authored the change with full context, the reviewer shares that context, and the tells of a rushed change are visible. All three assumptions break when an agent authored the change. The method is not bad; its preconditions stop holding.
PR descriptions put intent next to the diff, which helps. But they are out-of-band prose, increasingly auto-generated by the same agent that wrote the code, and unverifiable: a description that says "added tests for the unauthorized path" cannot be checked against what actually ran. A description summarizes the journey; it does not record it.
Code coverage tools answer a real and adjacent question — which lines executed under the test suite. They do not answer whether the changed behavior or its edge cases were exercised. The re-proves-the-happy-path test above raises coverage while testing nothing new. Coverage measures line execution, not behavioral intent.
LLM-as-reviewer is genuinely useful and catches classes of bug a tired human skims past. But by default a review model sees the same diff the human sees, and inherits the same blind spot: it cannot tell whether the author had the right evidence unless you feed it the provenance. Give it the prompt, the observed context, and the test results, and it gets sharper for the same reason a human does — not because it is a model, but because it finally has the journey.

The throughline is that none of these are replaced by recording provenance. They are made to work on agent code by giving them the upstream signals they each, on their own, cannot see.

The minimum review packet

Put concretely, an AI-assisted change is reviewable when the reviewer can see, alongside the diff:

the prompt or task the agent was actually given;
the agent and model identity that produced it;
the files the agent observed before editing — and which edited files it did not read;
the test commands that ran and their summarized results;
the uncertainty the agent flagged, and any recorded decisions or rejected alternatives;
risk signals and the suggested review-focus files.

That packet turns review from archaeology into verification. The reviewer stops reconstructing what the agent might have known and starts checking three answerable questions: does the implementation match the stated intent, did the agent have enough context to make it safely, and do the tests cover the behavior the change introduces?

How h5i records the journey Git-natively

The missing signals do not require a new system of record; they can be stored as Git objects beside the commit. h5i keeps the diff intact and attaches structured context to it. h5i capture commit records the prompt, model, agent, token count, tests, and decisions as JSON in Git notes (in Claude Code the verbatim human prompt is captured automatically by the UserPromptSubmit hook). h5i recall context renders the goal, milestones, and the OBSERVE/THINK/ACT trace. h5i recall blame --show-prompt connects each line back to the prompt and test result at its commit. h5i share pr post assembles all of it into a reviewer-facing pull-request comment.

The contrast is concrete. Plain Git shows the change and a hand-written summary:

$ git log --oneline -1
a1b2c3d  shorten token expiry to 15m

$ git show a1b2c3d        # the diff, and nothing about how it came to be
-    expiry = Duration::hours(24);
+    expiry = Duration::minutes(15);

The same commit through h5i carries the intent, the model and agent that produced it, and whether the change was tested:

$ h5i recall log --limit 1
a1b2c3d  shorten token expiry to 15m
  prompt: "make sessions expire faster, security review flagged 24h"
  model:  claude-sonnet-4-6   agent: claude-code
  tests:  auth::token_expiry  PASS (1 added)

$ h5i recall blame src/auth/token.rs   # same lines, plus who/why/tested
a1b2c3d  (claude-code · claude-sonnet-4-6 · tests PASS)  expiry = Duration::minutes(15);

And where the earlier blocks answer "what and who," the context trace answers the coverage question — what the agent read before it touched the file, so a blind edit shows up as the absence of a matching OBSERVE:

$ h5i recall context
goal: shorten session token expiry; security review flagged 24h
  OBSERVE  src/auth/token.rs        # read the file it changed
  OBSERVE  tests/auth_token.rs      # read the tests
  THINK    24h -> 15m balances UX against the flagged risk
  ACT      edit src/auth/token.rs
  NOTE     assuming refresh flow unaffected; did not read refresh.rs

That last NOTE is the whole argument in one line: the agent told you where it did not look, and a reviewer can route straight to refresh.rs instead of re-reading the tidy three-line diff. None of this is visible to git show.

Where the provenance approach breaks down

It would be dishonest to present this as free. Recording the journey has real costs and failure modes a reviewer should keep in mind.

Discipline cost. Provenance is only as good as the capture. If commits are made with plain git commit, or hooks are disabled, the notes are thin or absent and you are back to the diff. The value depends on the capture being routine, not heroic.
It can be gamed. An OBSERVE entry proves a file was opened, not understood; a recorded prompt can be vague; "tests PASS" can mean a test that asserts nothing. Provenance raises the floor and concentrates attention, but it does not certify correctness — a determined or sloppy author can produce a clean-looking record around a bad change.
Unshared refs. h5i stores its data under refs/h5i/*, which a plain git push does not move. If the team forgets h5i share push, the reviewer on another clone sees the diff and none of the context — the packet has to actually travel with the code.
It is not a replacement for review. Provenance does not review the code. It builds a better packet so a human (or a review model with the packet) reviews the right lines first. The judgment still has to happen.

The honest claim is narrow: provenance does not make any single line safer, but it makes the reviewer's allocation of attention dramatically better, which is where AI review actually fails today.

Narrow the review, do not lengthen it

The goal is not to make reviews longer by piling on context. It is to make them shorter and better aimed. You still review the diff — that does not change. What changes is the order: provenance and context decide where the first, most careful pass lands. Blind edits on sensitive paths, files the agent flagged as uncertain, behavior changes with no matching test — those rise to the top; the mechanically generated fixture sinks. The diff stays the ground truth for what changed. The journey tells you where to look first. For agent-written code, getting that order right is most of the job — and it is the one thing a diff, on its own, can never do. For the workflow that turns this into a repeatable review loop, see how to review code written by AI agents.

FAQ

Should reviewers still read the diff?

Yes. The diff remains the ground truth for code changes. Provenance does not replace it — it tells you which parts of the diff to read first and with the most care, and supplies the context needed to interpret them correctly.

Isn't generating more code the bigger problem than reviewing it?

Generation is fast; the bottleneck has moved to review. And generated volume is not generated risk: a 500-line generated fixture can warrant a skim while a five-line change to an auth predicate can outweigh it. Ranking attention by risk matters more than keeping up with line count.

Can commit messages or PR descriptions solve this?

They help, but they are summaries — and increasingly auto-generated by the same agent that wrote the code. They rarely preserve the verbatim prompt, the files actually observed, flagged uncertainty, real test output, or multi-agent handoffs, and they are unverifiable against what actually ran. A description tells you what the author says happened; provenance records what did.

What is context coverage, and why does it matter more than line count?

Context coverage is the comparison between the files a change touched and the files the agent actually read (its OBSERVE trace). A file edited but never read is a blind edit, which is the high-signal risk for agent code — correct-looking, but produced without the evidence it depends on. It is a better predictor of where a change will be wrong than how many lines it spans.

Does an LLM reviewer make this unnecessary?

No — it makes it more useful. A review model that sees only the diff inherits the same blind spot as a human: it cannot tell whether the author had the right evidence. Feed it the prompt, the observed context, and the test results and it reviews agent code far better, for the same reason a human does.

What is the simplest first improvement?

Capture the prompt and the test result for every AI-assisted commit. That alone makes later review, triage, and rollback much more practical, and it is the foundation the context trace and review packet build on.

Sources and verification

This article avoids vendor-specific claims that were not checked against primary docs or local h5i CLI behavior.

Next in the cluster

How to Review Code Written by AI Agents

Continue with the next focused workflow in the auditable workspace series.

Give reviewers the workspace, not just the diff

Try h5i on your next AI-assisted branch: create a sandboxed workspace, capture the run, and post a review-ready PR brief.

Star on GitHub Read the guides