Skip to content

vt-c-skill-eval

Run eval test cases against a skill. Supports structural assertions, LLM-as-judge grading, and with/without benchmarking.

Plugin: core-standards
Category: Other
Command: /vt-c-skill-eval


Skill Eval Runner

Run standardized test cases against toolkit skills to measure output quality, trigger accuracy, and uplift over baseline.

Invocation

/vt-c-skill-eval vt-c-spec-from-requirements    # eval one skill
/vt-c-skill-eval --all                           # eval all skills with test suites

Execution

Step 1: Resolve Test Cases

  1. If argument is a skill name (e.g., vt-c-spec-from-requirements):
  2. Glob TOOLKIT_ROOT/tests/evals/{skill-name}/*.yaml
  3. If no test cases found: display error and stop:

    No eval test cases found for {skill-name}.
    Create test cases in tests/evals/{skill-name}/ (see tests/evals/README.md for format).
    

  4. If argument is --all:

  5. Glob TOOLKIT_ROOT/tests/evals/*/ (exclude _graders/)
  6. For each directory, glob *.yaml files
  7. Display: "Found {N} test cases across {M} skills"

  8. Parse each YAML test case file. Validate required fields:

  9. skill (string)
  10. query (string)
  11. expected (object with at least one assertion)
  12. If validation fails: warn and skip the file

Step 2: Load Target Skill Content

For each unique skill referenced in test cases:

  1. Find the skill's SKILL.md:
  2. Search plugins/*/skills/*/SKILL.md where the frontmatter name: matches the skill name
  3. If not found: warn "Skill {name} not found in toolkit" and skip its test cases

  4. Read the full SKILL.md content — this will be injected into the sub-agent prompt for "WITH skill" runs.

Step 3: Execute Test Cases

For each test case:

3a. WITH skill run

Dispatch a context: fork sub-agent with model: haiku: - Prompt: {skill SKILL.md content}\n\n---\n\nUser request:\n{test case query} - Collect the output text

3b. Structural assertions

Check the output against expected:

  • activates: If true, check that the output is non-trivial (> 100 chars and not a generic "I can't help" response). If false, check the opposite.
  • output_contains: For each entry, check if it appears as a substring in the output. Try as literal match first, then as regex.
  • output_not_contains: For each entry, check it does NOT appear in the output.

Record: structural_pass: true/false, structural_details: [list of check results]

3c. LLM-as-judge grading (if grading_prompt specified)

Dispatch the quality-grader agent (tests/evals/_graders/quality-grader.md) with: - The output from step 3a - The grading_prompt from the test case

Parse the returned JSON for score and reasoning. Record: judge_score, judge_reasoning, judge_pass: score >= min_grade

3d. Benchmark run (if benchmark: true)

Dispatch another context: fork sub-agent with model: haiku: - Prompt: {test case query} (NO skill content — baseline) - Run the same structural assertions - If grading_prompt: run LLM-as-judge on baseline output too

Record: baseline_structural_pass, baseline_judge_score

3e. Determine pass/fail

A test case passes if ALL of: - All structural assertions pass (or none specified) - LLM judge score >= min_grade (or no grading_prompt specified)

Step 4: Aggregate and Report

Group results by skill. For each skill:

  1. Count: total, passed, failed
  2. Calculate pass rate
  3. If any benchmark runs: calculate uplift delta (with-skill vs without-skill)
  4. List failed test cases with reasons

Display the report (see Report Format below).

Write machine-readable summary to TOOLKIT_ROOT/tests/evals/.last-run.json:

{
  "timestamp": "ISO-8601",
  "skills": {
    "vt-c-spec-from-requirements": {
      "total": 5, "passed": 4, "failed": 1,
      "pass_rate": 0.8,
      "benchmark": {"with_skill": 0.8, "without_skill": 0.4, "uplift": 0.4},
      "failures": [{"case": "edge-case-minimal", "reason": "output_contains: 'User Stories' not found"}]
    }
  },
  "overall": {"total": 15, "passed": 13, "failed": 2, "pass_rate": 0.87}
}

Exit description: - If all passed: "All {N} test cases passed." - If failures: "FAILURES: {N} of {M} test cases failed. See report above."

Report Format

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill Eval Report — {skill_name}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Test Cases: {total}
  Passed: {pass_count} ({pass_pct}%)
  Failed: {fail_count}

Failed Cases:
  {case_name}: {reason}

LLM Judge Scores:
  Average: {avg_score}
  Min: {min_score} ({worst_case_name})

Benchmark (with/without skill):
  Pass rate:  {with_pct}% vs {without_pct}% (Δ {delta}%)
  Avg score:  {with_score} vs {without_score} (Δ {score_delta})

Exit code: {0 if all passed, 1 if failures}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When running --all, display one report block per skill, followed by an overall summary:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Eval Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skills evaluated: {M}
Total test cases: {N}
Overall pass rate: {pct}%

Per-skill results:
  vt-c-spec-from-requirements: 5/5 (100%)
  vt-c-4-review:               4/5 (80%)
  vt-c-2-plan:                 5/5 (100%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━