Deep dive · 2026-06-17

How to measure prompt quality offline — the h5i Prompt Maturity Score

Grade the ask, not the answer. h5i already records the prompt behind every AI commit; this is the logic that turns those prompts into a single, explainable 0–100 prompt-quality score — built from seven classical-NLP signals, hardened with anti-gaming guards, and computed fully offline with no LLM call. Here is exactly how the number is made, and why every design choice is a defence against the obvious ways to fake it.

By Koukyosyumei Reading time 11 min Tags Prompt Engineering · Classical NLP · Provenance

A pull request from an AI-assisted branch tells you what was asked. Open h5i recall blame or read the PR body and you can see the prompt that triggered each commit. What it doesn't tell you is how well it was asked. Two engineers ship the same feature; one steered the agent with "fix the off-by-one in parse_range() in src/util.rs, add a test for the empty-range case, don't touch the public signature," the other typed "make it work." Same diff, wildly different craft — and nothing in the review surfaces the difference.

The Prompt Maturity Score is how h5i measures prompt quality and makes that craft a visible, trackable signal. It scores the input — the engineer's ask — not the model's output. And it does so with a hard constraint: no LLM, no network, fully deterministic, so you can score prompt quality inside a git hook, in CI, or in a PR render without an API key.

Why not just ask a model?

The obvious implementation is "send the prompt to a judge model and ask it to grade." We rejected that on purpose. Prompt-eval frameworks like PromptBench (arXiv 2312.07910), APE (arXiv 2211.01910), and Promptfoo already do something adjacent — but they score a prompt by running a model and judging its output. They need API access, a task dataset, and they cost money and latency on every evaluation. They answer "did the model do well?"

Prompt maturity answers the complementary question — "did the engineer ask well?" — purely from text features. That buys three things a model judge can't: it's free (runs on every commit with zero marginal cost), deterministic (the same prompt always scores the same, so it's reproducible in CI and won't drift when a model is deprecated), and explainable (every point is traceable to a feature you can name). The two approaches are complementary, not competing.

The load-bearing caveat: readability is not maturity

The first instinct when you hear "score the writing quality of a prompt" is to reach for readability indices — Flesch Reading Ease, Flesch–Kincaid grade level, Gunning Fog. They're well-studied, classical, and cheap. They're also a trap.

A terse, precise technical ask is an excellent prompt that scores badly on raw reading-ease. "Fix the off-by-one in parse_range() in src/util.rs, add a test" is dense with file paths and identifiers — exactly the tokens that wreck a syllable counter and tank a reading-ease score. If readability drove the number, the best prompts would be punished for being concrete. The literature on these indices is explicit about this: they are length-sensitive and penalise terse technical text.

So readability is deliberately the smallest-weighted signal, it is applied as a band (both extremes penalised, not "higher is better"), and it is computed on a code-masked copy of the prompt — every path, func() call, URL, and snake_case identifier is replaced with a neutral token before a single syllable is counted. That one decision — treating readability as a bounded sub-signal rather than the score — is the single most important guard against the metric punishing good engineering.

The composite: seven signals, fixed weights

The score is a weighted sum of seven sub-signals, each normalised to 0.0–1.0. The weights were locked deliberately and sum to exactly 1.0 (the code asserts it in a test). The ranking is the whole philosophy in one table — what a reviewer actually wants to see ranks highest; the signal most likely to mislead ranks lowest.

Signal	Weight	What it captures
Specificity	24%	Concreteness — code refs, identifiers, quoted symbols, numbers — minus a vagueness penalty for weak words.
Control	24%	Did the engineer bound the agent? Constraints, output shape, acceptance criteria, edge cases, scope, safety.
Context	18%	Background, the goal / why, the current state, grounding in real repo entities.
Structure	10%	Decomposition — bullets, numbered steps, headings, code fences, multi-sentence shape.
Diversity	10%	Lexical richness (adaptive MATTR) — non-repetitive, not phrase-farmed.
Clarity	8%	Readability inside a target band (trapezoid — both extremes penalised).
Adequacy	6%	Length in a sweet spot — not one word, not a 1,200-word wall.

Specificity and control dominate at 24% each because they are what a manager actually wants to confirm: the engineer was concrete, and they bounded the agent. Clarity is the quietest voice at 8% precisely because it's the signal most likely to mislead on technical text.

How each signal is computed

Every signal is built from counts of features in the prompt text, then squashed to 0–1. A few are worth seeing concretely:

Specificity = concreteness − vagueness. Concreteness is a capped blend of code references (40%), action verbs (22%), quoted symbols (20%), and numbers (18%). Vagueness comes from the NALABS / Femmer "requirements smells" lexicon — words like appropriate, robust, somehow, as needed — whose density per 100 words drags the signal down.
Control rewards breadth across categories: constraints (must, only, without), verification (test, assert, acceptance), output shape (json, signature, schema), edge cases, scope, and safety. Hitting many categories beats hammering one.
Diversity uses an adaptive MATTR (Moving-Average Type-Token Ratio) with a window that scales to prompt length. Short prompts fall back to a confidence-weighted type-token ratio pulled toward a neutral 0.5 — a five-word prompt has no room to demonstrate vocabulary, so it's neither rewarded nor punished for it.
Clarity maps Flesch–Kincaid grade and Flesch reading-ease through a trapezoid: full credit for clear technical English (FK grade ≈ 7–13), tapering for both a childishly-simple ask and a tangled run-on, and a neutral 0.6 for prompts too short to estimate reliably.

The cap is the trick. Every category feeds the score through a cap_ratio(count, cap) — a linear ramp that saturates at a small cap (often 3 or 4). So the fourth must in a prompt adds nothing. No single lexicon can farm a whole signal by repetition; you have to actually cover different ground.

Anti-gaming: the guards that sit on top

A naïve weighted sum of keyword counts is trivially gameable — paste the lexicons into your prompt and score 100. The interesting engineering is the layer that makes that not work. Four guards sit on top of the weighted sum, in order:

Repetition penalty. The fraction of bigrams that are exact repeats becomes a multiplier (floored at 0.6). "must test format must test format" gets multiplied down — phrase-farming is detected and punished directly.
Balance gates. You cannot look mature on one axis alone. Control is a hard gate: a prompt that sets no constraints or acceptance criteria is capped below "advanced" (≤ 69) no matter how concrete it is. Low specificity additionally caps at 79 — you can't be "exemplary" while vague.
Hard length caps. Under 8 words caps at 20; under 15 words caps at 45. A 1,200-word unstructured wall caps at 75 — that's a dump, not a mature prompt. Keyword density can't buy its way past a length floor.
The final clamp to 0–100.

The balance gates have a subtlety worth calling out, because it's where the metric earns its keep. An earlier version hard-capped any prompt with weak context. But a prompt like "run cargo test, fix the clippy warning in src/foo.rs, don't change the signature" is a perfectly legitimate tactical ask — the agent already holds the repo context, so demanding a "why" paragraph would be punishing good practice. So context became a soft gate: a prompt that is already both specific (≥ 0.6) and bounded (control ≥ 0.5) is exempt from the context cap. Weak context still surfaces as a diagnostic flag — it just no longer drags a crisp tactical prompt into mediocrity.

From one prompt to a whole branch

A PR has many commits. The branch score is a length-weighted mean of the per-prompt scores (weight = clamp(words, 20, 250)), and prompts are never concatenated. That last detail matters: if you glued every prompt into one blob, a pile of weak one-liners would pool their vocabulary and structure and read as mature. Scoring each prompt independently and taking a weighted mean means one rambling prompt can't dominate, and a disciplined engineer is rewarded for every crisp ask.

The roll-up also tracks coverage — how many AI commits actually carried a prompt to score. If fewer than 80% did, the result is flagged low-confidence, because a score built from a minority of commits shouldn't be read as a verdict on the branch.

What it looks like in a PR

The score renders right under the hero of an h5i share pr post body — a screenshot-clean headline with the per-signal detail one click away:

PR body

> [!NOTE]
> 🌳 Prompt maturity: 81/100 · advanced · 7 prompts scored (100% of AI commits)
> 🔧 Recurring weak spots: weak context.
> Heuristic signal of prompt craft — not a developer rating.

▼ 📊 heuristic breakdown
  Specificity                 ████████░░  0.82
  Control / acceptance        ████████░░  0.79
  Context grounding           ████░░░░░░  0.41
  Structure                   ███████░░░  0.68
  Lexical diversity           ███████░░░  0.71
  Clarity (readability band)  ████████░░  0.80
  Length adequacy             ██████████  1.00

Note the disclaimer baked into the render: "Heuristic signal of prompt craft — not a developer rating." That line is not decoration. The score is an offline proxy over prompt text — it is not a performance metric, not a leaderboard rank, and the diagnostic flags are deliberately descriptive ("weak context", "no acceptance criteria"), never prescriptive. We don't hand engineers a keyword list to stuff, because the moment the metric becomes a target to optimise, it stops measuring anything real.

Honesty about what this is

The weights and gate thresholds are normative — hand-tuned and locked through review, not learned from data. They're an explainable proxy, not a validated model, and the code says so in a standing TODO. The path forward is deliberately conservative: the per-prompt features are already exposed, so a future version can join them against h5i's existing per-commit outcome signals — test pass/fail, review-flag scores, diff churn, later reverts — over a real commit corpus, and learn the weights while keeping every feature explainable. What it must not do is regress from a tiny biased sample, where a senior engineer writes both good prompts and good code and the metric just rediscovers seniority.

Until then, the value isn't precision to the decimal — it's making prompt craft visible. A reviewer who sees "context grounding 0.41" across a branch learns something true and actionable about how the agent was steered, with no model call and no guesswork.

From git blame to AI blame

The score grades the prompt; AI blame attaches that prompt — plus model, agent, and test result — to every line it produced.

Make prompt craft a visible signal

Offline, deterministic, explainable. Scored on every commit, surfaced in every PR.

Star on GitHub Back to docs