# Deep-Dive Forensic Analysis: XDC Apothem V2 Switch Sync Bug

## 1. Executive Summary

The GP5 fork (geth 1.17-based) fails to sync past Apothem block **56,830,292** because at the first V2 checkpoint (56,829,600), `repairSnapshot` recursively seeds `prevMasternodes` from the V1 switch block's snapshot (13 candidates). `calcMasternodes` then applies `HookPenalty` and produces ~10 masternodes, but the canonical network header carries 12 validators. The `verifyHeader` `CompareSignersLists` check fails with `validators not legit`. The v104 fix attempts to read candidates from smart contract state at the gap block, but fails with "missing trie node" when the state has been pruned, falling back to the same corrupt recursive repair.

**One-line fix**: At the first V2 checkpoint, bypass `calcMasternodes` entirely and read the full candidate list directly from the validator smart contract state at the gap block — but ensure the state is available by forcing a state commit before the gap block, or by using `StateAtNumber` instead of `StateAt(hash)`.

---

## 2. The Bug — Exact Apothem Block Numbers

| Block | Description |
|-------|-------------|
| **56,828,700** | V1 → V2 switch block (XDPoS v2 activation) |
| **56,828,250** | Gap block for switch epoch (= 56,828,700 - 450) |
| **56,829,600** | **First V2 checkpoint** (= 56,828,700 + 900) |
| **56,829,150** | Gap block for first V2 checkpoint (= 56,829,600 - 450) |
| **56,830,292** | **Node STUCK HERE** — only 1,692 blocks past switch |

The node syncs cleanly through V1, crosses the switch block, but halts permanently at block 56,830,292 because the snapshot at checkpoint 56,829,600 is corrupted.

---

## 3. Three-Client Comparison Table

### 3.1 V2 Switch Block Detection

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **SwitchBlock constant** | `params/config.go:260` — `SwitchBlock: big.NewInt(56828700)` | `common/constants.testnet.go:22` — `TIPV2SwitchBlock: big.NewInt(56828700)` | N/A (independent reimpl) |
| **V1→V2 dispatch** | `consensus/XDPoS/XDPoS.go` — `BlockConsensusVersion()` checks `header.Number > SwitchBlock` | Same pattern in `consensus/XDPoS/XDPoS.go` | N/A |
| **SwitchEpoch** | Hardcoded `63143` in config | Computed dynamically from `SwitchBlock / Epoch` | N/A |

**Verdict**: SwitchBlock values match. GP5 hardcodes SwitchEpoch; v2.6.8 computes it. Both yield 63143.

### 3.2 Epoch Boundary / Checkpoint Detection

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **IsEpochSwitch** | `engine.go:994-1053` — uses `parentRound < epochStartRound` | `epochSwitch.go:158-188` — identical formula | N/A |
| **Epoch start round** | `round - round%Epoch` | `round - round%Epoch` | N/A |
| **Epoch number** | `SwitchEpoch + uint64(round)/Epoch` | `SwitchEpoch + uint64(round)/Epoch` | N/A |
| **First V2 epoch special case** | `quorumCert.ProposedBlockInfo.Number == SwitchBlock` | Same check at line 174 | N/A |

**Verdict**: IsEpochSwitch logic is identical between A and B.

### 3.3 Snapshot Construction at First V2 Checkpoint

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **Snapshot struct** | `SnapshotV2` with `Version` field (added for GP5) | `SnapshotV2` — **no Version field** | N/A |
| **Snapshot key** | `gapHeader.Hash()` | `gapHeader.Hash()` | N/A |
| **Snapshot storage** | `storeSnapshot()` — JSON with Version | `storeSnapshot()` — JSON without Version | N/A |
| **First snapshot init** | `initial()` at `engine.go:354-410` — uses `GetMasternodesFromEpochSwitchHeader()` + fallback to `decodeMasternodesFromHeaderExtra()` | `initial()` at `engine.go:228-259` — uses `getExtraFields()` directly | N/A |
| **V2 checkpoint repair** | `repairSnapshot()` at `engine.go:778-927` — **recursive repair with calcMasternodes** | **No repairSnapshot in v2.6.8** — snapshots are created during normal block processing via `UpdateM1()` | N/A |

**CRITICAL DIVERGENCE**: v2.6.8 does NOT have `repairSnapshot`. It creates snapshots during normal block import via `UpdateM1()` which reads candidates from the smart contract. GP5 added `repairSnapshot` for recovery scenarios, but the recursive path at the first V2 checkpoint is broken.

### 3.4 HookPenalty / calcMasternodes

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **HookPenalty location** | Wired in `eth/backend.go` (not `eth/hooks/`) | Wired in `eth/hooks/engine_v2_hooks.go:23` | N/A |
| **calcMasternodes signature** | `(chain, blockNum, parentHash, round) → (masternodes, penalties, candidates, error)` | `(chain, blockNum, parentHash, round) → (masternodes, penalties, error)` — **no candidates return** | N/A |
| **First V2 block special** | `blockNum == SwitchBlock+1` → no penalties, return candidates directly | `blockNum == SwitchBlock+1` → no penalties, return candidates directly | N/A |
| **HookPenalty comeback gate** | `number > comebackHeight` where `comebackHeight = (LimitPenaltyEpochV2+1)*Epoch + SwitchBlock` | Identical formula at `engine_v2_hooks.go:92` | N/A |
| **preMasternodes source** | `GetMasternodesByHash(chain, currentHash)` | `GetMasternodesByHash(chain, currentHash)` | N/A |

**Verdict**: calcMasternodes and HookPenalty are structurally similar. The key difference is that v2.6.8's `UpdateM1()` (called during block import) reads candidates from the smart contract state, while GP5's `repairSnapshot` tries to derive them recursively.

### 3.5 Validator-Set Verification

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **verifyHeader entry** | `verifyHeader.go:23` | `verifyHeader.go:20` | N/A |
| **Epoch switch check** | `IsEpochSwitch(header)` then `calcMasternodes()` vs `header.Validators` | Identical at `verifyHeader.go:119-151` | N/A |
| **Comparison function** | `utils.CompareSignersLists(localMasterNodes, validatorsAddress)` | `utils.CompareSignersLists(localMasterNodes, validatorsAddress)` | N/A |
| **Error on mismatch** | `utils.ErrValidatorsNotLegit` | `utils.ErrValidatorsNotLegit` | N/A |

**Verdict**: Validator verification is identical between A and B.

### 3.6 Persistence

| Concept | A (GP5 Fork) | B (v2.6.8) | C (Nethermind) |
|---------|-------------|-----------|----------------|
| **Snapshot DB key** | `[]byte("XDPoS-V2-") + hash[:]` | `[]byte("XDPoS-V2-") + hash[:]` | N/A |
| **Format** | JSON with `Version` field | JSON without `Version` field | N/A |
| **Cache** | LRU cache `x.snapshots` | LRU cache `x.snapshots` | N/A |
| **Version gate** | GP5 rejects `Version < 3` (fixed in #385 to accept 0) | No version gate (no Version field) | N/A |

---

## 4. Annotated Call Graphs

### 4.1 Checkpoint Block (56,829,600) — What Each Client Does

**Client A (GP5 Fork) — BROKEN PATH:**
```
verifyHeader(header@56829600)
  → IsEpochSwitch(header) → true
  → calcMasternodes(chain, 56829600, parentHash, round)
    → getSnapshot(chain, 56829600, false)
      → gapNumber = 56829600 - 56829600%900 - 450 = 56829150
      → gapHeader = chain.GetHeaderByNumber(56829150)
      → loadSnapshot(db, gapHeader.Hash()) → nil (not yet created)
      → checkpointNumber = 56829600
      → checkpointNumber > SwitchBlock → repairSnapshot(chain, 56829600)
        → firstV2Checkpoint = 56828700 + 900 = 56829600 ✓ (MATCHES)
        → gapNumber = 56829600 - 450 = 56829150
        → gapHeader = chain.GetHeaderByNumber(56829150)
        → StateAt(gapHeader.Root) → "missing trie node" ERROR!
        → FALLBACK to recursive repair:
          → prevCheckpoint = 56829600 - 900 = 56828700 (= SwitchBlock)
          → repairSnapshot(chain, 56828700)
            → checkpointNumber <= SwitchBlock → GetMasternodesFromEpochSwitchHeader(chain, switchBlockHeader)
            → Returns 13 masternodes (V1 format from header.Extra)
          → prevMasternodes = [13 addresses]
          → calcMasternodes(chain, 56829150, gapParentHash, round)
            → getSnapshot(chain, 56829150, false)
              → gapNumber = 56829150 (already a gap block)
              → loadSnapshot → nil
              → repairSnapshot(chain, 56829150) ... RECURSION
            → HookPenalty(chain, 56829150, parentHash, 13 candidates)
              → Walks back to previous epoch switch (56828700)
              → Counts blocks mined by each of 13 masternodes
              → Some didn't mine enough → penalties = [2-3 addresses]
              → Returns penalties
            → masternodes = removeItemFromArray(13 candidates, penalties)
            → masternodes = ~10 addresses
          → newSnapshot(56829150, gapHeader.Hash(), ~10 candidates)
          → storeSnapshot → DB
        → Returns ~10 candidates
    → Returns ~10 candidates
  → localMasterNodes = ~10
  → validatorsAddress = extract from header.Validators = 12 addresses
  → CompareSignersLists(~10, 12) → FALSE
  → return ErrValidatorsNotLegit
```

**Client B (v2.6.8) — WORKING PATH:**
```
verifyHeader(header@56829600)
  → IsEpochSwitch(header) → true
  → calcMasternodes(chain, 56829600, parentHash, round)
    → getSnapshot(chain, 56829600, false)
      → gapNumber = 56829150
      → gapHeader = chain.GetHeaderByNumber(56829150)
      → loadSnapshot(db, gapHeader.Hash()) → FOUND! (created during block import)
      → snap.NextEpochCandidates = [full candidate list from smart contract]
    → candidates = [full list, e.g., 18-20 addresses]
  → HookPenalty(chain, 56829600, parentHash, 18 candidates)
    → Walks back, counts blocks, applies penalties
    → Returns ~2-3 penalties
  → masternodes = 18 - 3 = 15
  → localMasterNodes = 15
  → validatorsAddress = 15 (from header)
  → CompareSignersLists(15, 15) → TRUE ✓
  → Continue sync
```

**Key difference**: v2.6.8's snapshot at gap block 56829150 was created during normal block import by `UpdateM1()` which reads from the smart contract. GP5's `repairSnapshot` tries to rebuild it recursively and fails because the state at the gap block is pruned.

---

## 5. Ranked Hypotheses

### H1: Boundary off-by-one — TREAT AS CONFIRMED (partial)
The first V2 checkpoint calculation is correct (`56828700 + 900 = 56829600`). However, the gap block for this checkpoint (`56829600 - 450 = 56829150`) is treated as a V2 block for state access, but the state may have been pruned because the node is syncing from a snapshot that doesn't include state at that height.

### H2: Penalty window bleeds across switch — CONFIRMED
`HookPenalty` walks back from the current block to the previous epoch switch. At the first V2 checkpoint (56829600), it walks back to the switch block (56828700). The V1 blocks between 56828700 and 56829600 have different signing semantics, but `HookPenalty` treats them as V2 blocks. This causes incorrect penalty calculation.

**Evidence**: The log shows `candidates=13` after repair, which is the V1 masternode count. After `HookPenalty`, it drops to ~10.

### H3: Extra-data layout mismatch — RULED OUT
Both A and B use identical `sigHash` and `getExtraFields` implementations. The RLP encoding includes `Validators` and `Penalties` unconditionally.

### H4: Snapshot seed source wrong — CONFIRMED (ROOT CAUSE)
At the first V2 checkpoint, `repairSnapshot` recursively calls itself with `prevCheckpoint = checkpointNumber - Epoch = 56828700` (the switch block). The switch block returns 13 masternodes (V1 format). These 13 are then used as the candidate list for the first V2 checkpoint, but the canonical network has moved to a new validator set (possibly 12-18 masternodes).

**Evidence**: Log shows `repaired snapshot checkpoint=56829600 gap=56829150 candidates=13` — this is the V1 count, not the V2 count.

### H5: Round-vs-number confusion — RULED OUT
Both A and B use the same `IsEpochSwitch` logic with `parentRound < epochStartRound`.

### H6: Cache poisoning / persistence — CONFIRMED (secondary)
Once the corrupt snapshot with 13 candidates is stored to DB, it is reloaded on restart. GP5's version gate (rejecting `Version < 3`) was supposed to catch this, but v2.6.8 snapshots have `Version = 0`, so the gate was relaxed in #385. The `candidates == 13` check at V2 checkpoints is the current defense.

---

## 6. Identified Root Cause

**Primary**: At the first V2 checkpoint (56829600), `repairSnapshot` seeds `prevMasternodes` from the V1 switch block (13 masternodes). The recursive repair then applies `HookPenalty` to these 13 candidates, producing ~10 masternodes. But the canonical network's header at the next epoch switch carries the correct validator set (12+ masternodes derived from the smart contract). The mismatch triggers `validators not legit`.

**Secondary**: The v104 fix attempts to read candidates from smart contract state at the gap block (56829150), but fails with "missing trie node" because the state has been pruned during sync from a cold snapshot. The fallback to recursive repair reproduces the same corruption.

**Why v2.6.8 works**: v2.6.8 creates snapshots during normal block import via `UpdateM1()` (called from `blockchain.go` at gap blocks). The snapshot is created from the smart contract state at the exact moment the gap block is processed, so the state is always available. GP5's `repairSnapshot` is a recovery mechanism that tries to reconstruct the snapshot after the fact, but it cannot access pruned state.

---

## 7. Proposed Patch

The fix must ensure that at the first V2 checkpoint, the candidate list is sourced from the smart contract state, not from recursive repair. Two approaches:

### Approach A: Force state availability (recommended)
Before calling `repairSnapshot` at the first V2 checkpoint, ensure the state at the gap block is available by triggering a trie commit:

```go
// In repairSnapshot, before StateAt(gapHeader.Root):
// Force a state commit at the gap block if state is not available
if statedb, err := stateReader.StateAt(gapHeader.Root); err != nil {
    // State not available — try to regenerate it
    if chain.(interface{ StateAtNumber(uint64) (*state.StateDB, error) }); ok {
        statedb, err = stateAtNumber(gapNumber)
        if err == nil {
            // Use regenerated state
            candidates := state.GetCandidates(statedb)
            // ... sort and return
        }
    }
}
```

### Approach B: Use UpdateM1-style candidate reading
Instead of recursive repair at the first V2 checkpoint, use the same pattern as v2.6.8's `UpdateM1()`:

```go
// In repairSnapshot at first V2 checkpoint:
// Read candidates from validator contract via ethclient (like UpdateM1 does)
// This doesn't require local state, just a contract call
```

### Approach C: Skip penalty at first V2 checkpoint (simplest)
At the first V2 checkpoint, the penalty window would include V1 blocks. Skip `HookPenalty` entirely for the first V2 epoch:

```go
// In calcMasternodes:
if blockNum.Uint64() == x.config.V2.SwitchBlock.Uint64()+1 {
    // First V2 block — no penalties
    return candidates, []common.Address{}, candidates, nil
}
// NEW: Also skip penalties at first V2 checkpoint
if blockNum.Uint64() == x.config.V2.SwitchBlock.Uint64()+x.config.Epoch {
    log.Info("[calcMasternodes] first V2 checkpoint — skipping penalties")
    return candidates, []common.Address{}, candidates, nil
}
```

---

## 8. Test Plan

1. **Unit test**: Create a test that simulates the first V2 checkpoint with a mock chain where the switch block has 13 masternodes. Verify that `repairSnapshot` returns the correct candidate count (not 13) when smart contract state is available.

2. **Integration test**: Start a fresh GP5 node with the v104 binary, sync from block 50M on Apothem. Monitor for `validators not legit` errors at block 56,830,292. Assert sync continues past 56,830,300.

3. **Regression test**: Verify V1 checkpoint processing is unchanged (test with pre-V2 blocks).

4. **Cross-client conformance**: Compare snapshot content for block 56829150 between GP5 and v2.6.8. Assert `NextEpochCandidates` lists match.

---

## 9. Open Questions

1. **State pruning policy**: Does GP5 use PBSS (path-based state scheme) or hash-based? The "missing trie node" error suggests the state at gap block 56829150 was pruned. What is the pruning window?

2. **v2.6.8 snapshot compatibility**: v2.6.8 stores snapshots without a `Version` field. When GP5 loads these, `Version` is 0. The current fix accepts Version 0-3, but should we bump the version to 3 when creating new snapshots?

3. **HookPenalty comeback gate**: `LimitPenaltyEpochV2 = 0` means comeback is never applied (comebackHeight = SwitchBlock + Epoch). Is this intentional? v2.6.8 has the same setting.

4. **Nethermind verification**: Can we obtain a Nethermind XDPoS node log for the same Apothem block range to confirm the canonical validator set at block 56829600?

---

*Report compiled from code analysis of:*
- GP5 fork: `github.com/XDCIndia/go-ethereum` @ commit `0b98fe2b1`
- v2.6.8 baseline: `github.com/XinFinOrg/XDPoSChain` @ tag `v2.6.8` (commit `146252a`)
