# V6 Sync Fix Spec — Non-Restarting Body Fetch with Peer Demotion ## Problem `fetchBodiesXDC()` in v5b returns `errTimeout` when body download stalls → entire `synchronise()` restarts → 3-5s gap per cycle. Result: ~280 bl/s actual vs ~750+ bl/s potential. ## Solution Port v268's `fetchParts()` non-restarting, QoS-driven peer demotion model to `fetchBodiesXDC()`. ## Files to Change - `eth/downloader/downloader_xdc.go` — main changes - `eth/downloader/peer.go` — add throughput tracking (optional) ## Core Changes ### 1. Replace `return errTimeout` with Peer Demotion **Before (v5b):** ```go // fetchBodiesXDC line ~700 if time.Since(lastProgress) > stallTimeout { log.Warn("XDC sync: body download stalled", ...) return errTimeout // KILLS SYNC CYCLE } ``` **After (v6):** ```go // On stall: expire timed-out requests, re-queue work, continue if time.Since(lastProgress) > stallTimeout { for pid, headers := range inFlight { if time.Since(requestStart[pid]) > peerTimeout { log.Debug("XDC sync: peer body timeout, demoting", "peer", pid[:8]) d.queue.CancelBodies(pid) // re-queue their work delete(inFlight, pid) peerTimeouts[pid]++ lastProgress = time.Now() // reset stall timer } } // Only return error if ALL peers exhausted if d.peers.Len() == 0 { return errNoPeers } } ``` ### 2. Add Ticker-Driven Dispatch (100ms) **Before:** Dispatch happens in a tight loop with `select` + `default`. **After:** 100ms ticker drives dispatch like v268: ```go ticker := time.NewTicker(100 * time.Millisecond) defer ticker.Stop() for { select { case <-d.cancelCh: return errCanceled case <-ticker.C: // Dispatch to idle peers // Check for timeouts // Deliver received bodies case body := <-xdcBodyCh: // Process delivery case cont := <-d.queue.blockWakeCh: // Headers done signal } } ``` ### 3. Track Per-Peer Request Timing ```go type peerRequest struct { headers []*types.Header startTime time.Time } inFlight := make(map[string]*peerRequest) peerTimeout := 10 * time.Second // per-peer timeout (not global stall) ``` ### 4. QoS Throughput Tracking (Optional Enhancement) ```go type peerThroughput struct { blocksDelivered int totalTime time.Duration batchSize int // adjusted dynamically } // On successful delivery: pt.blocksDelivered += delivered pt.totalTime += time.Since(req.startTime) pt.batchSize = min(512, pt.blocksDelivered * 256 / max(1, int(pt.totalTime.Seconds()))) // On timeout: pt.batchSize = max(64, pt.batchSize / 2) ``` ### 5. Peer Lifecycle Events ```go // Subscribe to peer connect/disconnect peering := make(chan *peeringEvent, 64) peeringSub := d.peers.SubscribeEvents(peering) defer peeringSub.Unsubscribe() // In main loop: case event := <-peering: if event.join { // New peer — immediately dispatch pending work } ``` ## Testing Plan 1. Build canary image: `gx:fast-sync-v6-` 2. Deploy to xdc02 apothem (lowest risk, fastest to validate) 3. Compare bl/s: v5b (current) vs v6 over 10-min windows 4. If ≥500 bl/s sustained with no stalls → promote to mainnet canary 5. If regressions → rollback to v5b image ## Success Criteria - Sustained sync rate ≥500 bl/s (vs v5b's 280 bl/s) - No sync cycle restarts in logs - Peer count stable ≥5 (vs v5b's 1-6 fluctuating) - No data corruption (verify block hashes match v268 reference) ## Rollback - Keep v5b image tagged and ready - All nodes can revert with `docker stop && docker run` using v5b image