How does h5i help review AI-generated code?

h5i turns prompts, observation and action traces, uncertainty notes, test evidence, and integrity findings into Git-native records that reviewers can inspect before merging, via h5i recall log, h5i recall context, h5i audit review, and h5i share pr post.

Guide · AI code review

How to Review Code Written by AI Agents

Q: Should AI-generated code need stricter review?

It needs different review, not uniformly stricter. The code may be high quality, but the reviewer needs extra evidence about intent, context, and tests because the agent's working state is otherwise invisible. Provenance lets you apply normal-strength scrutiny in the right places instead of blanket suspicion everywhere.

Reviewing agent-written code is a provenance problem before it is a syntax problem. Start from intent, verify the context the agent actually observed, rank by risk instead of volume, and require evidence for the behavior that changed. The diff is the last step, not the first.

Published 2026-06-03 Reading time 9 min read Tags Review · Checklist · Risk

Key takeaways

Review in order — intent, context coverage, risk, evidence — then the lines; the diff is the last step, not the first.
Generated volume is not generated risk; triage by blast radius rather than line count.
Make provenance a merge requirement, not a nicety.

Agents now author a large and growing share of the diffs that arrive for review. That changes who you are reviewing. The author is no longer a colleague who can defend their reasoning in standup; it is a process that produced a coherent patch in seconds and has already forgotten why. The reviewer inherits the gap. The reliable question is no longer only "is this line correct?" It is "did the agent understand the task, read the right code before editing, and test the behavior it actually changed?"

The line-by-line part of review is the known problem. Humans have done it for decades, and most of those instincts still apply: read the change, reason about edge cases, check naming and error handling. What is novel is everything upstream of the diff. A human author carries the context in their head — the interface they read, the caller they checked, the assumption they made and chose not to verify. An agent's context is real but invisible the moment the patch is written. Provenance-first review is the practice of reconstructing that context from a record instead of guessing at it from the diff.

So review in this order: intent → context coverage → risk → evidence → the lines. The diff comes last because by the time you read it, you should already know what the change was supposed to do, what the agent looked at, where the blast radius is, and which behavior has proof. An auditable workspace is what makes this possible: it records every prompt, observation, command, and test result in the repo, so the reviewer reads a provable record rather than reverse-engineering one.

1. Start from intent, not the diff

Technique: read the original prompt or task summary before you open a single file. Compare the scope of the request against the scope of the change. The failure it catches: a patch that is internally correct but solves a broader or narrower problem than asked — the agent that was told "fix the off-by-one in pagination" and rewrote the pagination API, or the one told to "harden the parser" that quietly changed a default. A clean diff cannot tell you the request was misread; only the request can. The command: h5i recall log --limit 1 prints the verbatim prompt, the agent, and the model for the most recent commit, so intent is the first thing on screen, not a guess.

2. Verify the context the agent observed

Technique: before reviewing details, ask what the agent actually read. h5i records an OBSERVE entry for every file the agent opened and an ACT entry for every edit, so you can compare the two sets directly. The failure it catches: the blind edit — a change to an interface where the agent never opened the interface, or a "fix" to a function whose callers it never inspected. Agents are fluent enough to produce plausible changes without the repository evidence a human would naturally gather, and the diff looks the same whether the agent read the surrounding code or invented it. The command: h5i recall context shows the goal, milestones, and the OBSERVE/ACT trace. If a file appears under ACT but not under OBSERVE, the agent edited code it never read — the single highest-signal flag in agent review.

3. Rank files by risk, not by volume

Technique: do not give every generated line equal attention. Triage by blast radius and prioritize authentication, authorization, billing, data migrations, concurrency, persistence, security boundaries, generated configuration, and public APIs. The failure it catches: attention spent in the wrong place. A 600-line generated test fixture can be lower risk than a five-line change to an authorization check, but a volume-driven review spends its energy on the fixture. Ranking by risk keeps scrutiny proportional to consequence. The command: h5i audit review --limit 50 is a triage funnel that runs before merge — it surfaces the risky changes (secrets, blind edits, touched CI, security-sensitive paths) so the reviewer starts at the top of the funnel instead of the top of the diff.

4. Read the uncertainty the agent already admitted

Technique: agents leave their own hedges in the reasoning trace — "assuming the cache is warm", "this is likely correct", "untested edge case", "may need to handle null here". Those are not noise; they are the agent telling you where it knew it was guessing. h5i mines them into NOTE entries (flagged risks, TODOs, deferrals) alongside the trace. The failure it catches: the confident-looking patch built on a stated-but-unverified assumption that no one followed up on. The command: h5i recall context surfaces the NOTE entries; presented as an uncertainty heatmap, they point the human directly at the regions the agent itself marked as incomplete evidence, instead of leaving that signal buried in a transcript.

5. Verify tests against the behavior that changed

Technique: a green run is necessary, not sufficient. Check whether the tests exercise the new behavior, the failure mode, and the specific edge case the prompt asked about — not just the path that already worked. The failure it catches: tests that prove only the old happy path, or a "test run" that was really just formatting and type checks. The command: h5i capture commit --tests records the test metrics with the commit (tool, pass/fail counts, summary) as structured tool output rather than a wall of logs, and h5i recall blame <file> --show-prompt maps each line to the commit, model, agent, and the prompt and test result behind it — so you can ask of a specific line, "what behavior was claimed here, and was it actually exercised?"

6. Account for every touched file (scope creep)

Technique: compare the files the change touched against the files the intent justified. Agents routinely tidy nearby code, rename helpers, reformat, or "improve" unrelated branches while solving a task. Some of that is genuinely useful; all of it widens the blast radius and complicates rollback. The failure it catches: the broad refactor smuggled in under a narrow prompt, where an unrelated behavioral change rides along with a one-line fix. A plain diff shows you the edit but not whether it was asked for, which is one more reason a plain diff is not enough to judge agent work. The command: h5i recall log next to git diff --stat lets you set the files the agent edited against the prompt that authorized them; anything edited but not implied by the intent gets justified or split out.

7. Make provenance a merge requirement

Technique: treat the provenance packet as a required artifact, not a nicety. The minimum merge packet is the prompt, the agent identity and model, the changed commits, the test evidence, and the integrity findings. The failure it catches: the slow erosion of accountability — six months later, "why is this here and who decided it was safe?" has no answer because the working state was thrown away at merge. The command: h5i capture commit --audit attaches an IntegrityReport (Valid / Warning / Violation) to the commit, and h5i share pr post renders the whole record into a sticky GitHub PR comment — the review packet — using gh. Styles like review (reviewer-first triage), receipt, detective, and replay shape the presentation; h5i share pr body prints the same markdown to stdout for CI.

A worked review packet

Here is what a reviewer actually sees for one commit, and how they triage from it. Start with the record:

review

$ h5i recall log --limit 1
commit 4f9c1a2e8b7d3a05...
Author:    claude-code <agent@example.dev>
Agent:     claude-code (claude-sonnet-4-6) 󱐋
Prompt:    "reject bursts above the per-key rate ceiling"
Usage:     +18,204 tokens | model: claude-sonnet-4-6
Tests:     ✔ 42 passed [pytest]

$ h5i recall context
goal: enforce per-key burst ceiling in the rate limiter
OBSERVE  limiter.py        (read interface + window logic)
OBSERVE  test_limiter.py   (read existing tests)
ACT      limiter.py        (added burst check)
ACT      api/routes.py     (wired ceiling into the handler)
NOTE     "assuming clock is monotonic across workers — untested"

$ h5i share pr post --style review
# upserts the sticky PR comment: prompt, agent/model, files
# observed vs edited, test result, and integrity flags

The triage takes seconds and is driven entirely by provenance. The intent is narrow and clear. The agent observed limiter.py and its tests before editing them — good — but it also edited api/routes.py, which never appears under OBSERVE: a blind edit on a request path, so that file moves to the top of the read pile. Tests are green, but the prompt was about a burst ceiling and the count (42 passed) does not tell you a burst-rejection case exists; the reviewer goes to h5i recall blame api/routes.py --show-prompt to confirm the new behavior is actually exercised. And the agent's own NOTE flags a cross-worker clock assumption it never verified — exactly the kind of "plausible but unproven" risk that a diff hides. None of this required reading the diff first; the diff is now a confirmation step for three specific suspicions, not a blank search.

How this compares to other review styles

Provenance-first review is not a replacement for the two approaches teams already use; it supplies the ground truth both of them lack.

Conventional human diff review assumes a human author with full context — someone who can explain intent, who you trust to have read the callers, whose uncertainty you can ask about directly. Those assumptions are load-bearing, and they break for agent code. The author cannot be questioned, the context is not in anyone's head, and the reviewer is left inferring intent from the very artifact that intent produced. Diff review still does the irreplaceable work of line-level judgment; it just needs the upstream record to know where to point that judgment.

LLM-as-reviewer — AI PR bots that comment on pull requests — is fast and genuinely useful for surface issues: style, obvious bugs, missing null checks, naming. But most of these tools see only the diff, the same blind spot as the human reviewer. They have no ground truth about what the authoring agent observed, what it assumed, or which tests it actually ran, so they cannot tell a verified change from a confident guess, and they can hallucinate a problem (or an all-clear) that the diff alone cannot confirm or refute. The honest framing is complementary, not competitive: an LLM reviewer fed the provenance record — the prompt, the OBSERVE/ACT trace, the captured test metrics — is stronger than either provenance or an LLM reviewer alone. The record narrows where the bot should look and gives it facts to check against; the bot scales the line-level reading. Provenance-first review is the substrate; LLM review and human review are both better on top of it.

The checklist, annotated

Does the change match the prompt? h5i recall log --limit 1 — intent before diff.
Were the edited files observed before editing? h5i recall context — any ACT without a matching OBSERVE is a blind edit.
Does it touch a security, data, or API boundary? h5i audit review — triage by blast radius, not line count.
Did the agent flag its own uncertainty? h5i recall context NOTE entries — follow up every "assuming"/"untested".
Do tests exercise the new behavior? h5i capture commit --tests + h5i recall blame <file> --show-prompt — green is not the same as covered.
Is every touched file justified by the intent? h5i recall log with git diff --stat — split out smuggled refactors.
Is the provenance durable and on the PR? h5i capture commit --audit + h5i share pr post — provenance is a merge requirement, not a courtesy.

Common failure patterns, and how provenance surfaces each

The highest-signal findings in agent review come from process, not syntax. Four recur often enough to look for by default:

Plausible code on an unverified assumption. The change reads cleanly but rests on something the agent guessed — a cache being warm, a clock being monotonic, an input already validated upstream. Surfaced by: the agent's own NOTE/uncertainty entries in h5i recall context, where the hedge it wrote while reasoning is preserved instead of discarded.
Tests that prove only the old path. The suite is green because it never exercised the behavior that changed. Surfaced by: h5i capture commit --tests metrics read against the prompt — a burst-ceiling change with no burst-rejection test is visible when intent and coverage sit side by side.
A broad refactor on a narrow prompt. A one-line request returns a sweeping rename or reorganization. Surfaced by: the file set in h5i recall log compared to the prompt — edits with no basis in the intent stand out immediately.
A helper changed without checking its callers. A shared function is "fixed" while the call sites that depend on the old contract go unread. Surfaced by: ACT-without-OBSERVE in the trace — the helper appears as an edit, its callers never appear as observations.

The thread through all four is the same: generated volume is not generated risk. A large fixture can be safe; a five-line authorization change can be the whole review. The provenance packet makes that triage explicit — what the agent was asked, what it inspected, what it skipped, and what evidence supports the behavior change — so the reviewer spends judgment where consequence lives.

FAQ

Should AI-generated code need stricter review?

It needs different review, not uniformly stricter. The code may be high quality, but the reviewer needs extra evidence about intent, context, and tests because the agent's working state is otherwise invisible. Provenance lets you apply normal-strength scrutiny in the right places instead of blanket suspicion everywhere.

What is the biggest mistake?

Reviewing only the final diff and ignoring whether the agent had enough repository context to make the change. A blind edit — a file changed but never read — looks identical to a careful one in a diff.

Can an AI PR bot do this instead?

Not on its own. Most AI reviewers see only the diff, so they share the human reviewer's blind spot about what the authoring agent observed and tested. Feed them the provenance record and they get materially better; provenance is the substrate, not a competitor.

How does h5i help?

h5i turns prompts, observation/action traces, uncertainty notes, test evidence, and integrity findings into Git-native records that reviewers — human or machine — can inspect before merging, via h5i recall log, h5i recall context, h5i audit review, and h5i share pr post.

Sources and verification

This article avoids vendor-specific claims that were not checked against primary docs or local h5i CLI behavior.

Next in the cluster

The AI Pull Request Body: h5i's Review Surface for Agent Work

Continue with the next focused workflow in the auditable workspace series.

Build a review-ready auditable workspace

Try h5i on your next AI-assisted branch: create a sandboxed workspace, capture the run, and post a review-ready PR brief.

Star on GitHub Read the guides