Skip to content

vt-c-autoresearch-agent

Single-metric optimization loop. Iteratively modifies a target file, measures against a metric, accepts improvements (git commit) or rejects (git revert). Supports numeric benchmarks and LLM judge evaluators. Based on Karpathy's autoresearch pattern.

Plugin: core-standards
Category: Other
Command: /vt-c-autoresearch-agent


Autoresearch Optimization Agent

Iteratively optimize a single file against a single metric. Each iteration: modify the file, run the metric, compare to the best result so far, accept (commit) or reject (revert). Stops at convergence or budget exhaustion.

When to Use

  • Optimizing code performance (execution time, memory, bundle size)
  • Tuning prompts for effectiveness or clarity
  • Improving content quality (readability, persuasiveness)
  • Any measurable single-file optimization task

Invocation

# Numeric metric (shell command that outputs a number)
/vt-c-autoresearch-agent --target src/search.py --metric "python bench.py" --direction minimize

# LLM judge (subjective quality evaluation)
/vt-c-autoresearch-agent --target prompts/system.md --evaluator llm_judge_prompt --direction maximize

# With custom budget and plateau threshold
/vt-c-autoresearch-agent --target config.yaml --metric "python eval.py" --direction maximize --budget 20 --plateau 5

Prerequisites

  • Target file must exist and be under git control
  • For numeric metrics: the metric command must output a single number to stdout
  • Working directory must be clean (no uncommitted changes)

Execution

Step 0: Parse and Validate

  1. Parse arguments:
  2. --target (required): path to the file to optimize
  3. --metric (required unless --evaluator is an LLM judge): shell command that outputs a number
  4. --evaluator (optional): built-in evaluator name (overrides --metric)
  5. --direction (required): minimize or maximize
  6. --budget (optional, default 30): maximum iterations
  7. --plateau (optional, default 3): consecutive no-improvement iterations before stopping

  8. Validate:

  9. Target file exists: ls {target}
  10. Git is clean: git status --porcelain returns empty
  11. For custom metric: run {metric_command} once, verify it outputs a parseable number
  12. For LLM judge: verify evaluator name is recognized

  13. Display config:

    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Autoresearch Optimization
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Target:    {file}
    Evaluator: {metric command or judge type}
    Direction: {minimize|maximize}
    Budget:    {N} iterations
    Plateau:   {N} consecutive no-improvement
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    

Step 1: Setup and Baseline

  1. Create optimization branch:

    git checkout -b optimize/{target-filename}-{YYYYMMDD-HHMM}
    

  2. Measure baseline:

  3. Run the metric command (or LLM judge) on the current file
  4. Store as baseline_metric and best_metric
  5. Display: Baseline metric: {value}

  6. Initialize tracking:

    iteration: 0
    baseline: {value}
    best: {value}
    best_iteration: 0
    plateau_counter: 0
    history: []
    

Step 2: Optimization Loop

Repeat until budget exhausted or plateau reached:

iteration += 1

# 2a. Determine strategy phase
phase = get_phase(iteration)
  1-5:   "Focus on quick wins — parameter tuning, obvious improvements"
  6-15:  "Try alternative approaches — restructure, different algorithms"
  16-30: "Consider structural changes — architecture shifts"
  30+:   "Radical experiments — fundamental rethink"

# 2b. Agent proposes ONE change to target file
Read the current target file.
Apply one focused change guided by the phase strategy.
The change should be a single, reviewable diff.

# 2c. Measure result
Run metric command (with 30-second timeout).
Parse numeric output.

# 2d. Compare to best
IF direction == minimize: improved = (result < best_metric)
IF direction == maximize: improved = (result > best_metric)

# 2e. Accept or reject
IF improved:
  best_metric = result
  best_iteration = iteration
  plateau_counter = 0
  git add {target} && git commit -m "optimize: iteration {N} — {metric} improved to {value} ({delta})"
  Display: "✓ Iteration {N}: {value} (improved by {delta})"
ELSE:
  plateau_counter += 1
  git checkout -- {target}
  Display: "✗ Iteration {N}: {value} (no improvement, reverted)"

# 2f. Log iteration
history.append({
  iteration, phase, change_description,
  metric_value, delta, decision, plateau_counter
})

# 2g. Check stopping conditions
IF plateau_counter >= plateau_threshold:
  Display: "Plateau reached ({plateau} consecutive non-improvements). Stopping."
  BREAK

IF iteration >= budget:
  Display: "Budget exhausted ({budget} iterations). Stopping."
  BREAK

Every 10 iterations, if strategy escalation is active: - Summarize patterns observed in the history - Display: "Strategy checkpoint: {observations}"

Step 3: Generate Report

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Optimization Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Target:     {file}
Iterations: {used} of {budget}
Stopped by: {plateau | budget | convergence}

Results:
  Baseline:      {baseline_metric}
  Best achieved: {best_metric} (iteration {best_iteration})
  Improvement:   {delta} ({percent}%)

Accepted changes: {count} of {total iterations}

Top improvements:
  1. Iteration {N}: {description} — {delta} improvement
  2. Iteration {N}: {description} — {delta} improvement
  3. Iteration {N}: {description} — {delta} improvement

Branch: optimize/{name}
Review: git log --oneline optimize/{name}

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Built-in Evaluators

Numeric Evaluators (no API cost)

Evaluator Metric Command Direction
benchmark_speed time -p {run_command} 2>&1 \| grep real \| awk '{print $2}' minimize
test_pass_rate {test_command} \| grep -oP '\d+ passed' \| awk '{print $1}' maximize
file_size wc -c < {target} minimize
custom User-provided command User-specified

LLM Judge Evaluators (uses session context)

Evaluator Rubric Focus Score
llm_judge_content Readability, clarity, completeness, structure 0-10
llm_judge_prompt Specificity, actionability, constraint clarity 0-10
llm_judge_copy Persuasiveness, tone, engagement, conciseness 0-10

LLM judges work by reading the target file and scoring it against a fixed rubric. The rubric is defined at invocation and cannot be modified by the optimization loop.

LLM judge execution: 1. Read the current content of the target file 2. Apply the rubric prompt: "Score this content 0-10 on: {criteria}. Output ONLY the numeric score." 3. Parse the numeric score 4. Return as the metric value

Evaluator Safety

The evaluator configuration is locked during the optimization run: - The metric command or judge rubric is set at Step 0 and cannot change - If the agent modifies any file used by the evaluator, the skill reports an error and stops - This prevents the optimizer from gaming its own metric

Strategy Escalation (Experimental)

Phase boundaries guide the agent's approach:

Phase Iterations Strategy
Quick wins 1-5 Parameter tuning, obvious improvements, low-risk changes
Exploration 6-15 Alternative approaches, restructuring, different algorithms
Structural 16-30 Architecture changes, fundamental redesign of the approach
Radical 30+ Human checkpoint recommended before continuing

These are advisory — the agent uses them as guidance, not hard constraints. Phase boundaries are included in the iteration prompt to steer behavior.

Edge Cases

Scenario Handling
Target file doesn't exist Error at Step 0, exit
Metric command fails Log error, treat as rejected iteration
Metric outputs non-numeric Log warning, treat as rejected
Metric command times out (>30s) Kill process, treat as rejected
Git not clean at start Error at Step 0, ask user to commit or stash
All iterations rejected Report "no improvement found", baseline is best
LLM judge returns non-numeric Re-prompt once, then treat as rejected

Integration Points

Skill Relationship
/vt-c-ralph-wiggum-loop Use after optimization to verify tests still pass
/vt-c-quality-metrics Track optimization improvements over time
/vt-c-verification-before-completion Verify optimization claims with evidence

Anti-patterns

  • Do NOT optimize multiple files simultaneously — keep scope to one file
  • Do NOT let the agent modify the evaluator — metric gaming defeats the purpose
  • Do NOT run without a budget — unbounded optimization wastes tokens
  • Do NOT optimize broken code — fix bugs first, optimize after
  • Do NOT use for multi-objective optimization — one metric, one direction