What HDVO Forces You to Notice: A Skill-Creator Built in One Iteration

In a previous post, I described HDVO™ as a framework for addressing pattern retrieval — the failure mode where an AI generates responses from training-distribution shapes rather than from the specific artifact in front of it. The mechanism was abstract: anchoring, construction, crystallization. Live implementations like SMap and Groundwater Management showed the framework at work on systems the author had built. A reasonable question is whether the framework adds value when applied cold to a different domain — one where established tools already exist and the model has plenty of training-distribution patterns to retrieve from.

This post is a concrete demonstration in such a domain.

The Test Case

A skill-creator is a meta-artifact: a document that teaches a downstream AI agent how to author skill documents — purpose, triggers, process, examples, edge cases — given a prompt of the form “create a skill for X.” Three established skill-creators are in active use:

Codex $skill-creator — an interactive command that guides skill creation by asking what the skill does, when it should trigger, and whether it should be instruction-only or include scripts; skills can be packaged as plugins for distribution. (Verified against Codex docs.)
Antigravity (Google) — provides a skills system within its IDE; specific scaffolding behavior and first-party convention details were not independently verifiable at time of writing.
Claude Code’s skill creator — eval-loop-first; subagents run test cases in parallel (one with the skill, one as baseline), a grader agent benchmarks outputs, and run_loop.py iterates the skill’s description on a 60/40 train/held-out split with a blind A/B comparator agent. (Verified against the public anthropics/skills repository.)

Each is the product of a full development cycle by its respective team. Each ships with platform integrations the author of this post does not have.

The test: start from a deliberately impoverished seed for general_skill.md (four sentences, no taxonomy, no quality criteria, no risk model), run one HDVO iteration, and compare the result against the three established tools on shared structural dimensions.

The Seed and Iteration Setup

The seed:

A skill is a self-contained document that helps an AI agent perform a class of tasks consistently. When asked to create a skill for some domain X: write a clear purpose, outline the main steps, include at least one example, keep it concise.

The iteration loop:

10 test prompts spanning archetype diversity — inspection, transformation, generation, diagnostic, synthesis, interactive, and validation skills. Examples: “Create a skill for clinical chart review summarizing into SOAP notes,” “Create a skill for analyzing Kubernetes pod crash loops,” “Create a skill for renaming JavaScript variables consistently.”
Sub-agent simulation (analytical, not executed in iteration 0): in a real run, each test prompt would be given to a fresh-session agent whose only context is the prompt and the current general_skill.md, and the agent would produce a skill document for the requested X. For iteration 0, sub-agent outputs were projected analytically — predicted from what a fresh agent would most plausibly produce given only the minimal seed — rather than executed against a live model. This limitation is acknowledged again in the Boundary Conditions section.
Rubric scoring: each output is scored on a 0–10 divergence scale where 0 = perfect match against the rubric target and 10 = completely off-target. Lower is better. Axes: completeness, specificity, robustness, internal consistency, prompt alignment.

Projected cumulative baseline with the seed (estimated analytically, not from actual sub-agent runs): ~55.5 out of 100 divergence points. The seed produces generic skeletons (purpose + 3 steps + 1 example) regardless of domain. Projected per-prompt scores: simple domains (rename variables, write commit messages) ~3.0–4.5; complex domains (clinical, Kubernetes diagnostic, compromised email recovery) ~6.5–8.0. Lower is better. The specific numbers are estimates and should not be read as measurements; the substantive claim is the pattern — simple domains stay tractable with a minimal seed, while complex domains require structural support the seed does not provide.

What the Mismatches Forced into View

The mismatches weren’t randomly distributed. Lined up against each other, they clustered on seven structural absences in the seed:

No archetype classification. Every skill got the same skeleton. But a diagnostic skill needs a decision tree over evidence, not a checklist. A transformation skill needs an output spec with idempotency. The seed couldn’t differentiate, so the sub-agent flattened everything to “purpose + steps.”
No risk profile. A clinical skill touching PHI is structurally different from a typo-review skill. The seed treated them identically, so the clinical output had no anti-hallucination requirement and no PHI handling.
No anti-examples. The seed asked for “at least one example.” The sub-agent always returned the cheapest version — one positive case, never a negative one. No anti-trigger was ever surfaced.
No archetype-specific edge case catalog. Each archetype has predictable failure modes — cascade failures in interactive workflows, ambiguous mappings in transformations, transient symptoms in diagnostics. The seed had no mechanism to trigger them.
No output specification. Outputs were free-form, which made rubric scoring noisy and made downstream consumers (the supposed end-users of the generated skill) unable to know what to expect.
No cross-link constraints. Triggers didn’t have to match inputs. Process didn’t have to produce the declared output. The sub-agent could write internally inconsistent skills and pass the seed’s checks.
No concrete prototype example in the seed itself. With nothing to ground against, the sub-agent drifted toward its default skill-skeleton — which was the cheapest pattern in training distribution.

None of these seven were preconceived. They were the residue when patterns of mismatch were lined up against each other. This is what the mismatch mechanism does: it makes structural absences visible. The previous post called the equivalent step “anchoring against L2 data.” In this domain, L2 data is the sub-agent’s behavioral output against the rubric.

Why didn’t the iteration just memorize the 10 test prompts? In this experiment, abstract error codes did the work.

Abstract error codes are signals the practitioner injects to indicate failure type without revealing ground truth. In the gradient-flow domain these were numerical (-17, -18, -22, etc., each meaning a different kind of generalization failure). In the skill-creator domain they map cleanly to: “your document is overfit to a specific archetype,” “your examples aren’t prototypes,” “your output has hallucination — add a self-check criterion.” The codes don’t tell the model what to write; they tell it what it’s failing at, structurally.

If the iteration had responded to a high-mismatch test prompt by hard-coding clauses that fit that prompt, the abstract codes would have caught it. This is what stops one iteration from becoming overfitted to ten test prompts. The seven structural features that emerged are not the union of the ten prompts’ rubrics — they are the minimal taxonomy that closes the gaps the rubrics revealed.

The error codes are one instantiation of a broader principle the framework requires: the practitioner needs some mechanism to push back on overfitting without revealing the target. Held-out validation sets, diversity sampling, and theory-coverage tests can serve the same role in other configurations. The codes are how this experiment did it; they are not the only way the framework can do it.

What Emerged

After one iteration, general_skill.md (v1) contained:

7 archetypes (A1 inspection, A2 transformation, A3 generation, A4 diagnostic, A5 synthesis, A6 interactive, A7 validation), each with archetype-specific structural mandates. A4 must include a decision tree. A6 must include per-step verification and failure-path branching. A3 and A5 must include anti-hallucination self-checks.
3 risk levels (R0 passive, R1 local-write, R2 external/irreversible). R2 requires confirmation, dry-run, or human-in-the-loop in the process.
9 mandatory sections in any generated skill document: purpose, triggers + anti-triggers, inputs, process, output spec, examples (one positive + one anti), edge cases, self-check, risk profile.
5 cross-link constraints the sub-agent must verify before returning a skill: triggers ↔︎ inputs, process ↔︎ output spec, examples ↔︎ edge cases, risk profile ↔︎ self-check, archetype ↔︎ process.
A grounded prototype example inside the meta-document itself — a typo-review skill exercising A1 and A2 jointly — so the sub-agent has structural reference without domain leakage.

None of this was prescribed. It was the minimal structure needed to close the seven gaps the mismatches surfaced.

Comparing Against the Established Tools

After v1 was produced, I had an independent agent compare it against the three established skill-creators on shared dimensions. The full comparison report is detailed elsewhere; the scorecard:

Dimension	Codex	Antigravity	Claude Code	general_skill.md
Behavioral testing	–	–	✓✓	–
Structural validation	partial	?	partial	✓✓
Risk/safety modeling	–	–	–	✓✓
Anti-trigger explicitness	–	–	–	✓✓
Archetype taxonomy	–	–	–	✓
Distribution/packaging	✓✓	?	✓	–

Note on the scorecard: cells marked ? indicate features described in the secondary comparison report but not independently verified against the tool’s first-party documentation at time of writing. The “Workflow capture” row from the original comparison was dropped because its supporting evidence could not be confirmed against first-party docs. ✓✓, ✓, partial, and – entries for Codex and Claude Code are anchored to verifiable artifacts (Codex $skill-creator docs, anthropics/skills/skill-creator/).

general_skill.md is unique on four dimensions: structural validation (cross-link checks), risk/safety modeling, anti-trigger explicitness, and archetype taxonomy. None of these were in the seed. The three established tools either don’t have these as first-class features in their public documentation, or address them implicitly through conventions that the downstream agent may or may not honor.

The other tools are stronger on dimensions HDVO didn’t touch this iteration: distribution packaging (Codex via plugins; Claude Code via the public anthropics/skills repository and bundling support) and behavioral testing (Claude Code’s grader agent, comparator agent, and run_loop.py train/test split). The point of the comparison is not that HDVO produced something globally better — it’s that one iteration of HDVO produced structural features that the established tools either include only partially or do not surface in their docs.

Why HDVO Surfaced These, and the Other Tools Didn’t

The three established skill-creators are the product of teams iterating from a starting position of “what does a skill-creator look like?” That position retrieves shapes — folder conventions, naming rules, scaffold expectations, format-level validators. These shapes are visible because they’re already present in skill-creators across the ecosystem. The retrieval works.

But “what does a skill-creator look like?” doesn’t naturally surface “what fails when we run it on a hard case?” Surface form doesn’t predict structural completeness.

HDVO starts from the opposite end. The seed is deliberately incomplete. The iteration loop’s only job is to identify what makes hard cases fail. The result is not a better-looking skill-creator — it’s one whose structure is forced into completeness by the actual failure modes of the cases tested against.

This matches the previous post’s claim: pattern retrieval is structural, not a model-quality issue, and it doesn’t go away with better prompts. A skill-creator built by retrieving “what a skill-creator looks like” will match the surface form of the existing skill-creators. A skill-creator built by HDVO mismatch analysis will surface structural features the surface form didn’t include — because the mismatch signal is orthogonal to surface form. It tracks what’s missing, not what’s present.

Boundary Conditions

This is iteration 0, and the sub-agent runs in this experiment were simulated. Real iterations need actual fresh-session sub-agents producing actual skill documents that are then scored against the rubric. The seven structural features that emerged here are a hypothesis to be tested in subsequent iterations, not a finished product.

HDVO is also not behavioral testing. Claude Code’s eval loop — grader agents, held-out test sets, blind A/B comparison — is doing something genuinely different and genuinely valuable. general_skill.md doesn’t replace it. The right combination is structural completeness from HDVO (what should the skill document contain?) and behavioral validation from a Claude Code-style loop (does the skill document actually work when used?). One tells you the content is structurally sound; the other tells you it produces the right behavior.

The reference architecture presented here is the publicly disclosed framework. Production implementations may differ in specific orchestration, tooling choices, and the exact mechanisms used to surface mismatches and prevent overfitting. This case study illustrates one configuration of HDVO, not the canonical or only one.

Reference Materials

The reference architecture and the generated general_skill.md discussed here are available in the public repository at github.com/thienannguyen-cv/SMap/tools/testing/hdvo under the terms in LICENSE — academic and research use is freely permitted, including citation, non-commercial replication of experiments, and discussion in educational contexts.

Takeaway

The demonstration is not “HDVO produced something better.” It’s that HDVO produced something with structural features that the established alternatives lack — in one iteration, from a four-sentence seed.

The cost: a rubric signal across diverse test cases, abstract error codes to prevent the iteration from collapsing into memorization, and the discipline to let mismatches dictate structure rather than retrieve it. The payoff: structural completeness that doesn’t depend on the model already knowing what a “good” skill-creator looks like.

The broader claim is the same one the previous post ended on. Pattern retrieval doesn’t go away with bigger models or better prompts, because the model’s training distribution doesn’t tell it which structural elements are missing from your specific problem. Something else has to. HDVO is one shape that something else can take. Mismatch is its primary signal. Structural completeness is what the signal builds.

An Nguyen Data Scientist

Search This Blog