vt-c-autoresearch-agent¶
Single-metric optimization loop. Iteratively modifies a target file, measures against a metric, accepts improvements (git commit) or rejects (git revert). Supports numeric benchmarks and LLM judge evaluators. Based on Karpathy's autoresearch pattern.
Plugin: core-standards
Category: Other
Command: /vt-c-autoresearch-agent
Autoresearch Optimization Agent¶
Iteratively optimize a single file against a single metric. Each iteration: modify the file, run the metric, compare to the best result so far, accept (commit) or reject (revert). Stops at convergence or budget exhaustion.
When to Use¶
- Optimizing code performance (execution time, memory, bundle size)
- Tuning prompts for effectiveness or clarity
- Improving content quality (readability, persuasiveness)
- Any measurable single-file optimization task
Invocation¶
# Numeric metric (shell command that outputs a number)
/vt-c-autoresearch-agent --target src/search.py --metric "python bench.py" --direction minimize
# LLM judge (subjective quality evaluation)
/vt-c-autoresearch-agent --target prompts/system.md --evaluator llm_judge_prompt --direction maximize
# With custom budget and plateau threshold
/vt-c-autoresearch-agent --target config.yaml --metric "python eval.py" --direction maximize --budget 20 --plateau 5
Prerequisites¶
- Target file must exist and be under git control
- For numeric metrics: the metric command must output a single number to stdout
- Working directory must be clean (no uncommitted changes)
Execution¶
Step 0: Parse and Validate¶
- Parse arguments:
--target(required): path to the file to optimize--metric(required unless--evaluatoris an LLM judge): shell command that outputs a number--evaluator(optional): built-in evaluator name (overrides--metric)--direction(required):minimizeormaximize--budget(optional, default 30): maximum iterations-
--plateau(optional, default 3): consecutive no-improvement iterations before stopping -
Validate:
- Target file exists:
ls {target} - Git is clean:
git status --porcelainreturns empty - For custom metric: run
{metric_command}once, verify it outputs a parseable number -
For LLM judge: verify evaluator name is recognized
-
Display config:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Autoresearch Optimization ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Target: {file} Evaluator: {metric command or judge type} Direction: {minimize|maximize} Budget: {N} iterations Plateau: {N} consecutive no-improvement ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 1: Setup and Baseline¶
-
Create optimization branch:
-
Measure baseline:
- Run the metric command (or LLM judge) on the current file
- Store as
baseline_metricandbest_metric -
Display:
Baseline metric: {value} -
Initialize tracking:
Step 2: Optimization Loop¶
Repeat until budget exhausted or plateau reached:
iteration += 1
# 2a. Determine strategy phase
phase = get_phase(iteration)
1-5: "Focus on quick wins — parameter tuning, obvious improvements"
6-15: "Try alternative approaches — restructure, different algorithms"
16-30: "Consider structural changes — architecture shifts"
30+: "Radical experiments — fundamental rethink"
# 2b. Agent proposes ONE change to target file
Read the current target file.
Apply one focused change guided by the phase strategy.
The change should be a single, reviewable diff.
# 2c. Measure result
Run metric command (with 30-second timeout).
Parse numeric output.
# 2d. Compare to best
IF direction == minimize: improved = (result < best_metric)
IF direction == maximize: improved = (result > best_metric)
# 2e. Accept or reject
IF improved:
best_metric = result
best_iteration = iteration
plateau_counter = 0
git add {target} && git commit -m "optimize: iteration {N} — {metric} improved to {value} ({delta})"
Display: "✓ Iteration {N}: {value} (improved by {delta})"
ELSE:
plateau_counter += 1
git checkout -- {target}
Display: "✗ Iteration {N}: {value} (no improvement, reverted)"
# 2f. Log iteration
history.append({
iteration, phase, change_description,
metric_value, delta, decision, plateau_counter
})
# 2g. Check stopping conditions
IF plateau_counter >= plateau_threshold:
Display: "Plateau reached ({plateau} consecutive non-improvements). Stopping."
BREAK
IF iteration >= budget:
Display: "Budget exhausted ({budget} iterations). Stopping."
BREAK
Every 10 iterations, if strategy escalation is active: - Summarize patterns observed in the history - Display: "Strategy checkpoint: {observations}"
Step 3: Generate Report¶
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Optimization Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Target: {file}
Iterations: {used} of {budget}
Stopped by: {plateau | budget | convergence}
Results:
Baseline: {baseline_metric}
Best achieved: {best_metric} (iteration {best_iteration})
Improvement: {delta} ({percent}%)
Accepted changes: {count} of {total iterations}
Top improvements:
1. Iteration {N}: {description} — {delta} improvement
2. Iteration {N}: {description} — {delta} improvement
3. Iteration {N}: {description} — {delta} improvement
Branch: optimize/{name}
Review: git log --oneline optimize/{name}
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Built-in Evaluators¶
Numeric Evaluators (no API cost)¶
| Evaluator | Metric Command | Direction |
|---|---|---|
benchmark_speed |
time -p {run_command} 2>&1 \| grep real \| awk '{print $2}' |
minimize |
test_pass_rate |
{test_command} \| grep -oP '\d+ passed' \| awk '{print $1}' |
maximize |
file_size |
wc -c < {target} |
minimize |
custom |
User-provided command | User-specified |
LLM Judge Evaluators (uses session context)¶
| Evaluator | Rubric Focus | Score |
|---|---|---|
llm_judge_content |
Readability, clarity, completeness, structure | 0-10 |
llm_judge_prompt |
Specificity, actionability, constraint clarity | 0-10 |
llm_judge_copy |
Persuasiveness, tone, engagement, conciseness | 0-10 |
LLM judges work by reading the target file and scoring it against a fixed rubric. The rubric is defined at invocation and cannot be modified by the optimization loop.
LLM judge execution: 1. Read the current content of the target file 2. Apply the rubric prompt: "Score this content 0-10 on: {criteria}. Output ONLY the numeric score." 3. Parse the numeric score 4. Return as the metric value
Evaluator Safety¶
The evaluator configuration is locked during the optimization run: - The metric command or judge rubric is set at Step 0 and cannot change - If the agent modifies any file used by the evaluator, the skill reports an error and stops - This prevents the optimizer from gaming its own metric
Strategy Escalation (Experimental)¶
Phase boundaries guide the agent's approach:
| Phase | Iterations | Strategy |
|---|---|---|
| Quick wins | 1-5 | Parameter tuning, obvious improvements, low-risk changes |
| Exploration | 6-15 | Alternative approaches, restructuring, different algorithms |
| Structural | 16-30 | Architecture changes, fundamental redesign of the approach |
| Radical | 30+ | Human checkpoint recommended before continuing |
These are advisory — the agent uses them as guidance, not hard constraints. Phase boundaries are included in the iteration prompt to steer behavior.
Edge Cases¶
| Scenario | Handling |
|---|---|
| Target file doesn't exist | Error at Step 0, exit |
| Metric command fails | Log error, treat as rejected iteration |
| Metric outputs non-numeric | Log warning, treat as rejected |
| Metric command times out (>30s) | Kill process, treat as rejected |
| Git not clean at start | Error at Step 0, ask user to commit or stash |
| All iterations rejected | Report "no improvement found", baseline is best |
| LLM judge returns non-numeric | Re-prompt once, then treat as rejected |
Integration Points¶
| Skill | Relationship |
|---|---|
/vt-c-ralph-wiggum-loop |
Use after optimization to verify tests still pass |
/vt-c-quality-metrics |
Track optimization improvements over time |
/vt-c-verification-before-completion |
Verify optimization claims with evidence |
Anti-patterns¶
- Do NOT optimize multiple files simultaneously — keep scope to one file
- Do NOT let the agent modify the evaluator — metric gaming defeats the purpose
- Do NOT run without a budget — unbounded optimization wastes tokens
- Do NOT optimize broken code — fix bugs first, optimize after
- Do NOT use for multi-objective optimization — one metric, one direction