# xdpos-autoresearch

A Karpathy-style auto-research harness for closing GP5 ↔ XDC 2.6.8 consensus parity. Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) — same minimal-harness philosophy: one `program.md` (skill), one fixed metric (parity score), one tight iteration loop.

## What this is

Automated, unattended consensus parity work. You point a Claude/Codex agent at this directory, it reads `program.md`, runs `./measure.sh` to compute the current parity score, picks the lowest-scoring check, edits the patient code to address it, re-measures, keeps the change if score improved, and repeats. Wake up to a log of attempts and (hopefully) a higher score.

```
1. Read program.md          (the lightweight skill)
2. Run ./measure.sh         (parity score, 0-100, fast — pure grep/structural checks)
3. Pick lowest-scoring check
4. Edit patient code (consensus/XDPoS/, eth/, miner/, params/)
5. Re-measure
6. Keep / discard / log
7. Repeat
```

By design, the inner loop is **fast** (each iteration ≤ 1 minute) so you get ~50 iterations per hour. The slow correctness gate (Stage 5 mining shadow + Stage 6 Apothem canary) lives in `validate.sh` and is run only after `measure.sh` is high.

## Files

| File | Edited by | Purpose |
|---|---|---|
| `program.md` | human | Agent instructions. The "lightweight skill". Edit this when you want to redirect the loop. |
| `measure.sh` | human | Runs all checks, prints JSON score. Edit to add/remove/reweight checks. |
| `checks/*.sh` | human | One bash script per finding. Each declares `# WEIGHT=N` and exits 0 (pass) or non-zero (fail). |
| `target.md` | human | Allowlist of files the agent may modify. |
| `progress.md` | both | Running state. Agent appends entries; human can reset. |
| `log/` | agent | Per-iteration history (diff, before/after metric, verdict). |
| `validate.sh` | human | Heavyweight gate — Stage 5/6 from VALIDATION_PLAYBOOK_v2.md. |
| `runner.sh` | human | Optional cron-friendly wrapper around `claude -p` for overnight runs. |

## Quick start

```bash
cd /Users/anilchinchawale/github/XDCNetwork/XDC-Geth/xdpos-autoresearch

# Confirm patient + reference paths are right
PATIENT=/Users/anilchinchawale/github/XDCNetwork/XDC-Geth \
REFERENCE=/Users/anilchinchawale/github/XDCNetwork/XDPoSChain \
./measure.sh | tee log/baseline.json

# Spin up Claude/Codex (manually, like Karpathy's flow)
claude -p --model claude-opus-4-7 < program.md

# Or for unattended overnight runs:
ITERATIONS=30 ./runner.sh
```

The agent will read `program.md`, spot the lowest-scoring check, propose a fix, run `measure.sh` again, and either commit or discard.

## Current baseline

The first `./measure.sh` produces:

```
score=35/115 (30.4%)
pass=4 fail=15
```

What's already passing: genesis hash (10), C1.1 walk-back (10), C12 singleflight (10), Author dispatch (5).
What's failing: every other mining-side dispatch, every BFT message handler, the sealing trigger pathway, and one read-side soundness gap (bulk-sync mismatch fallback at `verifyHeader.go:152-162`).

See `progress.md` for the prioritized backlog.

## Design choices (matched to Karpathy's autoresearch)

- **Single source of truth for instructions.** `program.md` is the only thing the agent reads to know *what to do*. It's intentionally short and lives in this directory, not in some `~/.claude/` skill cache.
- **Fixed metric.** `measure.sh` returns one number. Iterations either improve it or don't. No subjective "looks better" — every change is graded by the same script.
- **Fast iterations.** No chain replay in the inner loop. All checks are `grep` / `awk` / file-existence. ≤ 10 seconds end-to-end so you can iterate liberally.
- **Self-contained.** Everything to run the loop is in this directory. No external deps beyond `bash`, `awk`, `grep`, `jq` (for runner.sh), and the `claude` CLI for unattended runs.
- **Reviewable diffs.** Per `program.md`, one concern per iteration, surgical changes, explicit log entries. Reviewing 30 iterations should be 30 small PRs not one mega-PR.
- **Separation of structural and correctness gates.** `measure.sh` is structural (cheap). `validate.sh` is correctness (expensive). Both are needed before declaring done.

## Differences from karpathy/autoresearch

Karpathy's setup is for ML training: agent edits `train.py`, runs a 5-min training, gets a `val_bpb` number. The structural similarity is intentional, but our loop differs in three ways:

1. **Multi-file target.** The patient codebase is too large to fit in a single `train.py`. `target.md` enumerates the files the agent may touch. The agent can edit any of them, but it must not touch off-limits files (V1 path, RLP layout, the framework itself).
2. **Multiple discrete checks instead of a single regression metric.** ML training has a continuous metric (loss). Consensus parity is a list of binary structural alignments. Each check is one weighted bit; total score is the sum.
3. **Read-only reference repo.** Karpathy's agent has the whole world to optimize; ours has a fixed reference (XDPoSChain v2.6.8) it must align to. Most "what does this function do" questions are answered by reading the reference, not by guessing.

The high-level shape — minimal harness, fast metric, unattended loop, append-only log — is the same.

## Adding a check

When you find a new divergence not yet covered:

```bash
# Pick a number in the right band (01-09 sync ops, 10-19 mining dispatch, 20-29 BFT, 30 trigger, 40+ hygiene)
cat > checks/42_my_new_check.sh <<'SCRIPT'
#!/usr/bin/env bash
# WEIGHT=N
# CHECK: short description + reference to audit finding
# REF: file.go:line in reference
set -e
target="$PATIENT/path/to/file.go"
[[ -f "$target" ]] || exit 1
# Pass condition: grep / awk / file structure check
grep -qE 'expected pattern' "$target"
SCRIPT
chmod +x checks/42_my_new_check.sh
./measure.sh   # confirm the check fires correctly on current state
```

Try not to make checks brittle (matching exact whitespace, etc.). Aim for the structural intent, not the textual form.

## Removing dead-end iterations

If the agent gets stuck retrying the same failing approach, edit `progress.md`'s "Open dead-ends" section to record the failure mode. The agent reads this at the start of each iteration and avoids repeating.

## Stopping conditions

The loop stops when any of:

- `measure.sh` returns score >= 100 (the structural goal).
- `runner.sh` has hit its `ITERATIONS` budget (default 30).
- A check breaks compilation and the agent can't recover (rare; revert and continue manually).

Score 100 unlocks `./validate.sh`, which is where structural parity gets confirmed by actually running the patient against real chain data.

## Companion documents

- `/Users/anilchinchawale/github/XDCNetwork/XDC-Geth/CONSENSUS_PARITY_AUDIT_v3.md` — the audit that defines what each check is testing.
- `/Users/anilchinchawale/github/XDCNetwork/XDC-Geth/VALIDATION_PLAYBOOK_v2.md` — Stage 5/6 commands invoked by `validate.sh`.
- `/Users/anilchinchawale/github/XDCNetwork/XDC-Geth/skills/consensus-parity-{audit,review,fix}/` — Hermes/Claude Code skills for human-driven workflow. Different surface, same target.

## License / status

Internal research scaffold. Don't push the `log/` directory; it's gitignored.