Design · 2026-06-06

One Schema for Every Tool: Structured Output for AI Agents

Q: What is a structured tool output schema for AI agents?

It is one typed record — h5i's ToolResult, carrying a list of Findings — that every tool is parsed into, so an agent reads one shape instead of re-learning pytest's, cargo's, then tsc's free-text layout. The envelope carries tool, kind, status, exit_code, counts, parser_confidence and raw_oid; each finding carries a severity, location, message and a stable fingerprint.

Q: How does h5i avoid inventing structure that isn't in the output?

status is derived from the exit code, not scraped from words like 'passed', and a parser that can't find its anchors declines to a generic result — recording parser_confidence: generic — instead of guessing. The full raw bytes are always one h5i recall object away, so the structure is a view, never the only copy.

Filtering shrinks tool output. It doesn't make it machine-actionable: the agent still has to parse a different free-text shape for pytest, cargo, tsc, eslint, mypy… h5i adds a unified JSON/YAML result schema on top of reduction, one shape across every tool, so the agent can branch on a status, iterate findings, dedupe by fingerprint, and query captures. That schema is what sets h5i apart from text-only reducers.

By Koukyosyumei Reading time 9 min Tags Structured output · Schema · Tooling

Key takeaways

A failing test, a compile error, and a lint diagnostic are the same shape, so they collapse into one typed ToolResult / Finding.
The schema is honest by construction: status comes from the exit code, parser_confidence is explicit, and the raw bytes are always recoverable.
Reduction makes tool output small; the typed, fingerprinted, queryable record is what makes it actionable.

Compressed tool logs are one of the core properties of an auditable workspace, and structured output is what makes those logs machine-actionable, not just smaller. Token reduction is a real win (see the object-store post): keep the raw output out-of-band, hand the agent a small summary. Tools like rtk and headroom (both Apache-2.0, and the prior art h5i's text filters build on) do this well. But a filtered summary is still text, and every tool's text is shaped differently. The agent that wants to know "did it pass, and if not, which test, on what line?" must re-learn pytest's layout, then cargo's, then tsc's. There's nothing to branch on, nothing to dedupe, nothing to query.

Two differently shaped tool outputs (pytest and cargo test) pass through h5i capture run and come out as two ToolResults in the same structured format, same tool/status/findings fields, different values, using ~95% fewer tokens, with the raw output kept out-of-band and recoverable.

The idea: one result, every tool

Here's what an agent normally swallows — a single failing pytest run:

$ pytest -q
============================= test session starts ==============================
platform linux -- Python 3.11.4, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/dev/app
plugins: anyio-4.3.0, cov-4.1.0
collected 121 items

tests/test_auth.py .........................F.............              [ 30%]
tests/test_api.py ....................................                  [ 70%]
tests/test_db.py ..................................                     [100%]

=================================== FAILURES ===================================
______________________________ test_refresh ___________________________________

    def test_refresh():
        token = refresh(expired_session())
>       assert token.ttl == 100
E       assert 0 == 100
E        +  where 0 = Token(ttl=0).ttl

tests/test_auth.py:42: AssertionError
=============================== warnings summary ===============================
tests/test_api.py::test_list
  /home/dev/app/api.py:88: DeprecationWarning: client.get(...) is deprecated
    return client.get(url)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_auth.py::test_refresh - assert 0 == 100
1 failed, 120 passed, 1 warning in 4.31s

Run that through h5i capture run and the same run becomes the compact typed result below — roughly 95% fewer tokens, with the raw output kept out-of-band and one command away. h5i parses each tool into a single typed ToolResult. The key realization is that a failing test, a compile error, and a lint diagnostic are the same shape: a thing, somewhere, with a message and a severity. So they all become a unified Finding, under one envelope:

tool: pytest
kind: test            # test | lint | typecheck | build | vcs | generic
status: failed        # passed | ok | failed | error | unknown
exit_code: 1
counts: { failed: 1, passed: 120 }
parser_confidence: parsed     # parsed | heuristic | generic
raw_oid: sha256:934f…         # the full output, always recoverable
findings:
  - kind: test_failure        # test_failure | diagnostic | build_error | panic | generic
    severity: failure
    id: tests/test_auth.py::test_refresh
    message: assert 0 == 100
    location: tests/test_auth.py:42
    fingerprint: 0bb827e4e61a  # stable across line shifts → dedupe / track

Swap pytest for cargo test, tsc, eslint, ruff, mypy, or go test and you get the same fields: a rule for a linter (e.g. TS2322), expected/actual for an assertion, a build_error kind for a compile failure, suggestions + a fixable flag for an autofixable lint. An agent learns the schema once.

Not every tool's output is diagnostic-shaped, so the envelope keeps a small extra escape hatch for parser-specific data that doesn't fit a Finding, a coverage percentage, an install tally, a diff's +/- stats, a benchmark number, a VCS file list. The core stays a tight list of findings; the long tail still has somewhere to land without bending the common shape. The schema is versioned (schema_version), so the stored manifests stay readable as the shape evolves.

Three renders, one source of truth

The typed result drives every output format, so they never drift:

Format	Shape	For
`compact` (default)	one line per finding	token-minimal agent reading
`structured`	full YAML	inspection
`json`	canonical JSON	programmatic / the `h5i_capture_run` MCP tool

JSON is canonical, it's what's stored in the git-tracked manifest and what the MCP tool returns, and the compact text and YAML are renders of the same struct. So a capture is a record, not a one-shot log line.

What the schema buys an agent

Because every capture is the same typed object, an agent (or a human) can act on it:

Branch on a field. status: failed is a boolean decision, not a regex over prose. And it's honest, status is derived from the exit code, never guessed from text, so a passing-looking log on a nonzero exit is still failed.
Iterate findings. Each carries a precise location and message, jump straight to tests/test_auth.py:42.
Dedupe / track by fingerprint. Each finding has a stable fingerprint (the first 12 hex of a sha256 over tool + rule + digit-normalized location + message), so "the same failure" is recognizable across runs even as line numbers shift. h5i recall search --fingerprint <fp> answers "has this exact failure happened before?" across every capture.
Query the store. The result lives in the manifest, so h5i recall objects --status failed --tool pytest lists captures, and h5i recall search goes deeper, matching finding message/rule/path/severity/kind across every captured tool.
Know how much to trust it. parser_confidence says whether the result was parsed by a dedicated adapter, inferred by a heuristic, or is a generic exit-code-only fallback.

$ h5i capture run --format json -- pytest -q | jq '.status, .findings[0].location'
"failed"
"tests/test_auth.py:42"

$ h5i recall objects --status failed --tool mypy   # every mypy failure, ever
$ h5i recall search --fingerprint 0bb827e4e61a     # has this one recurred?
$ h5i recall search --severity error --rule TS2322 # one rule, across every tsc run

Honest by construction

A schema that lies is worse than text. Two rules keep it honest. First, status comes from the exit code, never from scraping words like "passed". And the mapping is more careful than exit == 0: a test runner is only passed on a clean zero exit; findings that somehow appear on a zero exit still surface as failed; and a nonzero exit with no parsed findings is error (the tool itself couldn't run) rather than failed (the tool ran and reported problems). That failed-vs-error distinction is exactly what an agent needs to decide between "fix the code" and "fix the invocation". Second, a parser that can't find its anchors declines to a generic result (status from the exit code, the reduced text in body) rather than inventing structure, and labels itself with parser_confidence: parsed when a dedicated adapter matched, heuristic when the shape was inferred, generic when only the exit code is trustworthy. An agent can lower its certainty accordingly instead of trusting every record equally. The raw bytes are always one h5i recall object away, so the structure is a view, never the only copy.

Where this sits relative to rtk / headroom

Good: rtk's declarative per-command filters and headroom's log line-folding are good at the reduction problem, and h5i reuses both for its text path (see the NOTICE); a tool with no coded adapter still gets those rtk-derived rules applied, and you can ask for that legacy free-text view directly with --format summary. Gap: a filtered summary is still text, shaped differently per tool, with nothing to branch on, dedupe, or query. The unified ToolResult / Finding schema — typed, fingerprinted, queryable, honest about confidence, and identical across tools — is the layer h5i adds on top. Reduction makes the output small; the schema makes it actionable.

Coverage

Dedicated parsers (rich findings) ship for pytest, cargo test, go test, tsc, eslint, ruff, and mypy; every other command gets a valid generic result (honest status + reduced body). Each parser carries golden tests so the schema stays faithful, and adding a tool is just another parser feeding the same shape — the agent-facing contract never changes.

FAQ

What is a structured tool output schema for AI agents?
One typed record — h5i's ToolResult, carrying a list of Findings — that every tool is parsed into, so an agent reads one shape instead of re-learning pytest's, then cargo's, then tsc's free-text layout. The envelope carries tool, kind, status, exit_code, counts, parser_confidence and raw_oid; each finding carries a severity, location, message and a stable fingerprint.

How is this different from token-reduction tools like rtk or headroom?
rtk and headroom reduce output to less text, and h5i reuses that text path for tools without a coded adapter. The schema adds a typed, queryable, fingerprinted layer on top: the agent can branch on a status, dedupe by fingerprint, and query across captures. Reduction makes the output small; the schema makes it actionable.

How does h5i avoid inventing structure that isn't in the output?
status is derived from the exit code, not scraped from words like "passed", and a parser that can't find its anchors declines to a generic result — recording parser_confidence: generic — instead of guessing. The full raw bytes are always one h5i recall object away, so the structure is a view, never the only copy.

What does the finding fingerprint enable?
The fingerprint is a hash of tool + rule + digit-normalized location + message, so it stays stable when line numbers shift. That makes "the same failure" recognizable across runs: h5i recall search --fingerprint <fp> answers "has this exact failure happened before?" across every capture, and lets an agent dedupe repeated diagnostics.

Conclusion

Token reduction solves a real problem — it keeps a 40-line pytest dump out of the context window — but it stops at making output smaller. The step that makes reduction actionable is noticing that a failing test, a compile error, and a lint diagnostic are the same shape: a thing, somewhere, with a message and a severity. Collapse them into one typed ToolResult and the agent stops parsing prose and starts branching on a status, jumping to a location, deduping by fingerprint, and querying h5i recall search across every run it has ever done. Build the schema to be honest — status from the exit code, confidence labeled rather than faked, raw bytes always recoverable — and it stays trustworthy even when a tool it has never seen shows up. That typed, fingerprinted, queryable record, identical across tools, is the layer that turns a pile of reduced logs into something an agent can actually act on.

Cutting Agent Token Usage 95% by Keeping Tool Output Out of Context

The other half: the content-addressed object store that keeps the raw bytes out of the window.

Guide: keep tool output out of your agent's context

Wrap commands, choose a format, recover and query captures, share via LFS.

Try h5i on your next AI-assisted branch

Create a sandboxed workspace, capture the run, and post a review-ready PR brief. h5i is open source, the schema, the parsers, and the object store are all in the repo, no service to subscribe to.

Star on GitHub Back to docs