Feature · 2026-06-21

Run an AI Agent Ensemble: h5i team

Sealed workspaces. Permissioned reviews. Auditable convergence. Point several coding agents at the same task, let each one work blind in its own sandboxed worktree, then let a neutral, sandboxed verifier — not any agent's own say-so — decide which candidate wins. The human merges with proof, hands-off.

By Koukyosyumei Reading time 12 min Tags Ensembles · Multi-agent · Verification

Running one agent on a task is now routine. Running several — three Claudes with different system prompts, or a Claude and a Codex racing the same bug — is the obvious next move, because diversity beats a single sample. But the moment you do it, the same gap that auditable workspaces closed for one agent reopens, worse, for many: Git records only the diff that finally lands. It never records who saw whose work, who influenced whom, or why this candidate won and that one lost.

h5i team is the answer: an auditable agent ensemble. It groups N isolated h5i env workspaces against one shared task and drives them through a phased, permissioned evidence-publication protocol — independence first, discussion only when you ask for it, and a neutral verifier at the end that no contestant can fake. Everything it does is an append-only event log in your Git refs.

The problem: ensembles erase their own provenance

Hand a task to three agents and merge the best result, and you have made an editorial decision with no record of how you made it. Did the agents work independently, or did the second one read the first one's branch and converge on the same wrong idea? When you picked a winner, was it because its tests genuinely passed, or because it said they did? Six months later, an incident reviewer cannot reconstruct the tournament. The diff that shipped looks exactly like a diff from a single agent. The ensemble — the most interesting part — evaporated.

The popular alternative makes it worse. "Multi-agent chat" frameworks wire the agents into a group conversation and let them talk from turn one. That sounds collaborative, but it produces premature convergence — the loudest first draft anchors everyone, diversity collapses, and you get one opinion wearing three hats. And the chat log is not an audit trail: it is unstructured, unverifiable, and disconnected from the code that actually changed.

The idea: sealed workspaces, then permissioned evidence

h5i team = N isolated h5i env workspaces working the same task, driven through a phased, permissioned evidence-publication protocol, with a neutral sandboxed verifier deciding the winner.

The unit of an ensemble is a roster, and a roster member is a persona, not a backend. You can run three Claudes with different system prompts and skills — an architect, an implementer, a skeptic — or mix a Claude and a Codex. Each persona gets its own sealed h5i env: a confined, disposable worktree with its own isolation policy and its own provenance. They cannot see each other's work until you say so.

Because each candidate lives in a real h5i env, you already get everything the sandbox gives a single agent — the worktree, the egress policy, the captured runs — and h5i team layers the tournament on top. The team itself is not a new database; it is a thin protocol over envs you already created.

The phases

A team run is a small state machine. The ordering enforces the one property an ensemble must have to be trustworthy: independence first.

team lifecycle

draft  →  dispatched  →  independent_work  →  sealed_submit  →  review
       →  discuss (opt-in)  →  improve  →  verify  →  compare  →  verdict  →  applied

Until a run is frozen it stays in draft, and in draft every candidate works blind — no agent can read another's submission, and discussion is refused on purpose. A persona freezes its work into an immutable submission with h5i team submit; h5i team freeze then moves the whole run into sealed_submit, sealing every attempt as an independent baseline.

Only after the freeze does collaboration open, and even then it is opt-in. Discussion (h5i team discuss) and cross-agent review (h5i team grant-review, h5i team review submit) are only legal post-freeze, so the first attempt is always a clean, independent sample. The protocol is explicit about this: a discuss in draft would let influence leak before anyone committed, so it is rejected.

The independence guarantee: any candidate revised after the freeze — after it could have seen another agent's work — is stamped independent = false, and the run records the exact influence edges (which discussion events and which artifact ids touched it). You can always tell an original idea from a borrowed one.

The golden path

Here is a complete tournament: fix an auth bug with an architect persona and an implementer persona, then converge hands-off. The commands below are the real flags.

team run

# each env is created with a profile that bakes its persona (PERSONA.md)
$ h5i env create claude-architect --profile architect
$ h5i env create codex-impl --profile implementer
# group the envs as a team
$ h5i team create fix-auth --base HEAD
$ h5i team add-env fix-auth env/claude-architect/fix-auth --runtime claude
$ h5i team add-env fix-auth env/codex/codex-impl --runtime codex
$ h5i team status fix-auth   # note the auto-generated agent ids

# each agent works in its own env, then freezes an immutable candidate
$ h5i team submit fix-auth --agent <agent-id>   # ids from `team status`
$ h5i team freeze fix-auth                  # seals both independent attempts

# neutral, SANDBOXED verifier re-runs each candidate at the shared base
$ h5i team verify fix-auth --agent <agent-id> -- cargo test   # repeat per candidate
$ h5i team compare fix-auth                # side-by-side + verifier metrics
$ h5i team finalize fix-auth               # rule: tests pass + applies clean + smallest diff
$ h5i team apply fix-auth                  # replays the winning patch (gated on the verdict)

add-env takes the env (as a slug, agent/slug, or env/agent/slug) and optional --runtime / --model attributes. The working-style brief comes from the env's profile (persona = [...] in .h5i/env.toml, baked into PERSONA.md at env create), so add-env just inherits it. Each agent gets an auto-generated id (shown by h5i team status; pin your own with --as if you prefer). Roster changes are only allowed while the run is still in draft — once frozen, the contestant list is part of the record.

The neutral verifier

This is the trust primitive the whole feature turns on. The finalization metrics must not come from each agent's own captures. An agent can omit a failing test, run the wrong command, or simply assert it passed — its self-reported evidence is exactly the thing you cannot trust when you are picking a winner.

h5i team verify ignores all of that. It replays each frozen candidate into a throwaway worktree at the shared base commit and runs the declared command under h5i's sandbox — the fail-closed default build/test profile. You choose the tier with --isolation workspace|process|supervised|container; left unset, it auto-picks the strongest the host can enforce and falls back to workspace. The tier that actually ran is recorded on the verification, so the audit shows not just that tests passed but how strongly the box was confined when they did.

Signal	Source	Role in the verdict
Tests pass	Neutral verifier	Hard gate.
Applies cleanly	Neutral verifier	Hard gate.
Smallest diff	Frozen candidate	Tie-breaker, gate-passers only.

The hard gates come from the verifier; the output-reducing metric (smallest diff) is a tie-breaker only among candidates that already cleared the gates. That ordering is deliberate: you cannot win by deleting the tests, because deleting tests does not survive the gate. And if candidates were verified with different commands, finalize refuses with a no_verdict — a candidate waved through with a weaker command (say, true) is not comparable to one judged by cargo test, so the protocol declines to crown it.

Hands-off finalization

h5i team finalize evaluates one conservative rule, recorded verbatim as the method string rule:VerifierTestsPass,AppliesCleanly,SmallestDiff. The verdict is explainable, never opaque: it records the method, which command decided it, and a reason line for the winner (and, when nothing wins, why). Nothing about the decision is a black box.

verdict

selected: codex#1
method:   rule:VerifierTestsPass,AppliesCleanly,SmallestDiff
reasons:  codex#1 applies cleanly
          codex#1 verifier tests passed via `cargo test`
          smallest diff among verifier-passing candidates
can_auto_apply: true

If no candidate clears the gates, the run records a no_verdict and applies nothing — it will never silently fall back to a loser. And h5i team apply is gated on verdict.can_auto_apply (a real verifier verdict plus a clean, conflict-free apply). The escape hatch is explicit: pass --force to override the gate, or --winner <id> to apply a specific submission. The default path is the safe one — a human reads the verdict, sees the proof, and merges.

Automation, conservatively

Once finalization is fully mechanical, you can let a worker drive it. The worker is deliberately timid: h5i team worker --once is a single lease-and-finalize pass — idempotent, with TTL'd leases so concurrent workers don't collide — and it never auto-applies. Applying a winner to a branch always stays a human (or explicitly forced) act.

For production, drive --once from an external scheduler. It is crash-resilient and needs no long-lived process:

cron

# finalize verifier-ready teams every 5 minutes
*/5 * * * *  cd /srv/repo && h5i team worker --once

If you would rather not run a scheduler, there is an opt-in in-process loop — h5i team worker --watch --interval 60 — that repeats the same finalize-only pass. It is a convenience, not the recommended default; the cron line above is sturdier.

It is all Git, and all auditable

A team run's state lives in refs/h5i/team/<run-id> as an append-only event log, and that log is the source of truth. Phase, roster, and verdict are not stored fields you could tamper with; they are folded from the events — deduped by id and union-merged — so two clones that each appended events reconcile cleanly. Every submission, freeze, discussion, review grant, verification, and verdict is one immutable event.

Because it is a Git ref, the whole tournament travels with h5i share push / h5i share pull. That is what powers the cross-clone review loop: one machine runs the agents and freezes, another pulls the sealed run, verifies independently, and finalizes — the verdict is reproducible from the shared evidence, not from trust. And the visual board, timeline, compare, and verdict views all render from the same events under h5i serve.

Why this matters

An ensemble is only worth running if you can defend its result. h5i team makes that defensible across every axis that matters: safety, because no agent grades its own homework and the verifier runs confined; compliance, because the entire decision — independence, influence edges, the deciding command, the explained verdict — is an immutable, shareable Git record; debugging, because when a merged change goes wrong you can replay exactly which candidate won and why; and real hands-off shipping, because a human can let the tournament run, read one verdict, and merge with proof instead of vibes.

Run your next task as an ensemble

Group a few sandboxed envs into a team, let a neutral verifier pick the winner, and merge with proof.

Read the manual Star on GitHub