vt-c-autoresearch-agent¶

Single-metric optimization loop. Iteratively modifies a target file, measures against a metric, accepts improvements (git commit) or rejects (git revert). Supports numeric benchmarks and LLM judge evaluators. Based on Karpathy's autoresearch pattern.

Plugin: core-standards
Category: Other
Command: /vt-c-autoresearch-agent

Autoresearch Optimization Agent¶

Iteratively optimize a single file against a single metric. Each iteration: modify the file, run the metric, compare to the best result so far, accept (commit) or reject (revert). Stops at convergence or budget exhaustion.

When to Use¶

Optimizing code performance (execution time, memory, bundle size)
Tuning prompts for effectiveness or clarity
Improving content quality (readability, persuasiveness)
Any measurable single-file optimization task

Invocation¶

# Numeric metric (shell command that outputs a number)
/vt-c-autoresearch-agent --target src/search.py --metric "python bench.py" --direction minimize

# LLM judge (subjective quality evaluation)
/vt-c-autoresearch-agent --target prompts/system.md --evaluator llm_judge_prompt --direction maximize

# With custom budget and plateau threshold
/vt-c-autoresearch-agent --target config.yaml --metric "python eval.py" --direction maximize --budget 20 --plateau 5

Prerequisites¶

Target file must exist and be under git control
For numeric metrics: the metric command must output a single number to stdout
Working directory must be clean (no uncommitted changes)

Execution¶

Step 0: Parse and Validate¶

Parse arguments:
--target (required): path to the file to optimize
--metric (required unless --evaluator is an LLM judge): shell command that outputs a number
--evaluator (optional): built-in evaluator name (overrides --metric)
--direction (required): minimize or maximize
--budget (optional, default 30): maximum iterations
--plateau (optional, default 3): consecutive no-improvement iterations before stopping
Validate:
Target file exists: ls {target}
Git is clean: git status --porcelain returns empty
For custom metric: run {metric_command} once, verify it outputs a parseable number
For LLM judge: verify evaluator name is recognized

Display config:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Autoresearch Optimization
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Target:    {file}
Evaluator: {metric command or judge type}
Direction: {minimize|maximize}
Budget:    {N} iterations
Plateau:   {N} consecutive no-improvement
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 1: Setup and Baseline¶

Create optimization branch:

git checkout -b optimize/{target-filename}-{YYYYMMDD-HHMM}

Measure baseline:
Run the metric command (or LLM judge) on the current file
Store as baseline_metric and best_metric
Display: Baseline metric: {value}

Initialize tracking:

iteration: 0
baseline: {value}
best: {value}
best_iteration: 0
plateau_counter: 0
history: []

Step 2: Optimization Loop¶

Repeat until budget exhausted or plateau reached:

iteration += 1

# 2a. Determine strategy phase
phase = get_phase(iteration)
  1-5:   "Focus on quick wins — parameter tuning, obvious improvements"
  6-15:  "Try alternative approaches — restructure, different algorithms"
  16-30: "Consider structural changes — architecture shifts"
  30+:   "Radical experiments — fundamental rethink"

# 2b. Agent proposes ONE change to target file
Read the current target file.
Apply one focused change guided by the phase strategy.
The change should be a single, reviewable diff.

# 2c. Measure result
Run metric command (with 30-second timeout).
Parse numeric output.

# 2d. Compare to best
IF direction == minimize: improved = (result < best_metric)
IF direction == maximize: improved = (result > best_metric)

# 2e. Accept or reject
IF improved:
  best_metric = result
  best_iteration = iteration
  plateau_counter = 0
  git add {target} && git commit -m "optimize: iteration {N} — {metric} improved to {value} ({delta})"
  Display: "✓ Iteration {N}: {value} (improved by {delta})"
ELSE:
  plateau_counter += 1
  git checkout -- {target}
  Display: "✗ Iteration {N}: {value} (no improvement, reverted)"

# 2f. Log iteration
history.append({
  iteration, phase, change_description,
  metric_value, delta, decision, plateau_counter
})

# 2g. Check stopping conditions
IF plateau_counter >= plateau_threshold:
  Display: "Plateau reached ({plateau} consecutive non-improvements). Stopping."
  BREAK

IF iteration >= budget:
  Display: "Budget exhausted ({budget} iterations). Stopping."
  BREAK

Every 10 iterations, if strategy escalation is active: - Summarize patterns observed in the history - Display: "Strategy checkpoint: {observations}"

Step 3: Generate Report¶

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Optimization Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Target:     {file}
Iterations: {used} of {budget}
Stopped by: {plateau | budget | convergence}

Results:
  Baseline:      {baseline_metric}
  Best achieved: {best_metric} (iteration {best_iteration})
  Improvement:   {delta} ({percent}%)

Accepted changes: {count} of {total iterations}

Top improvements:
  1. Iteration {N}: {description} — {delta} improvement
  2. Iteration {N}: {description} — {delta} improvement
  3. Iteration {N}: {description} — {delta} improvement

Branch: optimize/{name}
Review: git log --oneline optimize/{name}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Built-in Evaluators¶

Numeric Evaluators (no API cost)¶

Evaluator	Metric Command	Direction
`benchmark_speed`	`time -p {run_command} 2>&1 \\| grep real \\| awk '{print $2}'`	minimize
`test_pass_rate`	`{test_command} \\| grep -oP '\d+ passed' \\| awk '{print $1}'`	maximize
`file_size`	`wc -c < {target}`	minimize
`custom`	User-provided command	User-specified

LLM Judge Evaluators (uses session context)¶

Evaluator	Rubric Focus	Score
`llm_judge_content`	Readability, clarity, completeness, structure	0-10
`llm_judge_prompt`	Specificity, actionability, constraint clarity	0-10
`llm_judge_copy`	Persuasiveness, tone, engagement, conciseness	0-10

LLM judges work by reading the target file and scoring it against a fixed rubric. The rubric is defined at invocation and cannot be modified by the optimization loop.

LLM judge execution: 1. Read the current content of the target file 2. Apply the rubric prompt: "Score this content 0-10 on: {criteria}. Output ONLY the numeric score." 3. Parse the numeric score 4. Return as the metric value

Evaluator Safety¶

The evaluator configuration is locked during the optimization run: - The metric command or judge rubric is set at Step 0 and cannot change - If the agent modifies any file used by the evaluator, the skill reports an error and stops - This prevents the optimizer from gaming its own metric

Strategy Escalation (Experimental)¶

Phase boundaries guide the agent's approach:

Phase	Iterations	Strategy
Quick wins	1-5	Parameter tuning, obvious improvements, low-risk changes
Exploration	6-15	Alternative approaches, restructuring, different algorithms
Structural	16-30	Architecture changes, fundamental redesign of the approach
Radical	30+	Human checkpoint recommended before continuing

These are advisory — the agent uses them as guidance, not hard constraints. Phase boundaries are included in the iteration prompt to steer behavior.

Edge Cases¶

Scenario	Handling
Target file doesn't exist	Error at Step 0, exit
Metric command fails	Log error, treat as rejected iteration
Metric outputs non-numeric	Log warning, treat as rejected
Metric command times out (>30s)	Kill process, treat as rejected
Git not clean at start	Error at Step 0, ask user to commit or stash
All iterations rejected	Report "no improvement found", baseline is best
LLM judge returns non-numeric	Re-prompt once, then treat as rejected

Integration Points¶

Skill	Relationship
`/vt-c-ralph-wiggum-loop`	Use after optimization to verify tests still pass
`/vt-c-quality-metrics`	Track optimization improvements over time
`/vt-c-verification-before-completion`	Verify optimization claims with evidence

Anti-patterns¶

Do NOT optimize multiple files simultaneously — keep scope to one file
Do NOT let the agent modify the evaluator — metric gaming defeats the purpose
Do NOT run without a budget — unbounded optimization wastes tokens
Do NOT optimize broken code — fix bugs first, optimize after
Do NOT use for multi-objective optimization — one metric, one direction