## Comprehensive Analysis: Issue #391 — GP5 Panic at V2 Switch

### 1. Root Cause Analysis

**Two-issue model confirmed:**

- **Issue A (root cause)**: Cold snapshots only export chainDB (blocks/state), not the pebble keys holding XDPoS V2 consensus snapshots (`"XDPoS-V2-" + gapBlockHash[:]`). On restart at ~56,827,375, `loadSnapshot()` returns nil for every gap-block lookup near the V2 switch.
- **Issue B (trigger)**: In `initial()` at `engine.go:350`, when `loadSnapshot` returns nil, the code falls through to `chain.GetHeaderByNumber(V2.SwitchBlock)` → also nil during cold sync (block 56,828,700 not yet inserted), then hands the nil header to `getExtraFields()` → `GetMasternodesFromEpochSwitchHeader` dereferences it at `engine.go:837`.

**Exact panic sequence:**
1. `verifyHeader()` → `initial()` (engine.go:25)
2. `loadSnapshot(lastGapHeader)` at engine.go:350 returns nil (Issue A)
3. `chain.GetHeaderByNumber(56,828,700)` at engine.go:352 returns nil (block not yet synced)
4. `getExtraFields(chain, nil)` at engine.go:365 (pre-fix)
5. `GetMasternodesFromEpochSwitchHeader` at engine.go:837 dereferences nil → panic matching the reported trace (`engine.go:1219`/`engine.go:357` in the historical line numbering).

### 2. Code Walkthrough

**`initial()` (engine.go:343–396)**
- Line 345–347: `lastGapNum = V2.SwitchBlock - Gap` = 56,828,250
- Line 348: `lastGapHeader` fetched from chainDB (succeeds — blocks are present)
- Line 350: `loadSnapshot(x.db, lastGapHeader.Hash())` → nil (pebble missing)
- Line 352: `checkpointHeader = chain.GetHeaderByNumber(V2.SwitchBlock)` → nil
- Line 353–361 (**commit abddd87fc**): nil check; falls back to `header` parameter if it IS the switch block, else returns structured error instead of panicking
- Line 365: `getExtraFields(chain, checkpointHeader)` to build masternode list
- Line 376–378: `newSnapshot` + `storeSnapshot` to persist

**`getExtraFields()` (engine.go:1065–1106)**
- Line 1069: Switch-block special case — decodes masternodes from V1-format `header.Extra[32:len-65]` via `decodeMasternodesFromHeaderExtra`
- Line 1075: For post-switch blocks, delegates to `GetMasternodesFromEpochSwitchHeader`

**`GetMasternodesFromEpochSwitchHeader` (engine.go:836–857)** — original panic site; now has `if header == nil { return []common.Address{} }` guard

**`snapshot()` walk-back (xdpos.go:1422)** is V1-only; `snapshotMaxWalkBack = 2_000_000` (xdpos.go:75). V2 uses `getSnapshot()` + `repairSnapshot()` (engine.go:654–831) which walks one epoch (900 blocks) at a time with no hard bound — but **fails when no pebble snapshot exists anywhere within reachable history**, which is exactly the cold-snapshot situation.

**Snapshot storage model:**
- chainDB (pebble file): blocks, bodies, state trie
- Same pebble instance, separate key namespace: `"XDPoS-V2-" + hash[:]` — only written at gap blocks `(num+Gap)%Epoch == 0`
- Cold-snapshot tarballs commonly exclude or fail to include these consensus keys — this is the infrastructural root cause

### 3. NextEpochCandidates vs Masternodes

`SnapshotV2` (snapshot.go:19) stores `NextEpochCandidates` with JSON tag `masterNodes` for v2.6.8 compatibility. These are the **candidate list** (pre-penalty). The **actual signer set** is computed per-epoch by `calcMasternodes()` applying `HookPenalty` to candidates.

At the V1→V2 switch block 56,828,700:
- `header.Extra[32:len-65]` holds 13 V1-era masternode addresses
- First V2 block bypasses `HookPenalty` (engine.go:955–961)
- So the 13 candidates become the signer set directly
- If the live signer producing 56,828,701 isn't in those 13 (late-registered validator in the contract), verification fails — this is why **commit 843c73a77** changed `UpdateMasternodesFromHeader` to read candidates from contract state rather than inheriting from the previous snapshot. The 13-count mismatch is a symptom of the stale-candidate path.

### 4. v2.6.8 Delta

Key GP5 additions vs v2.6.8:
- **verifyHeader.go fallback** (26e9d5343): `GetMasternodesWithParents` fallback when `calcMasternodes` fails during bulk sync — v2.6.8 hard-errors here
- **XDC gas limit bound** (verifyHeader.go:66–73): 50% allowance vs Ethereum's 1/1024
- **Snapshot version gate** (engine.go:687–717): originally rejected v2.6.8 snapshots (version 0); later loosened (576871c11) to accept versions 0/1/2/3 and reject only exact-13 corruption
- **Contract-backed candidate update** (843c73a77): aligns with v2.6.8's `UpdateM1()` behavior
- **nil-checkpointHeader fix** (abddd87fc): new in GP5 — not present in v2.6.8 because v2.6.8 reaches this path less often (its snapshot handling is more permissive)

### 5. Recommended Fix

**Tier 1 — Panic prevention (done, abddd87fc)** at engine.go:353–361. Pure mitigation.

**Tier 2 — Snapshot recovery (partial, `repairSnapshot` engine.go:750–831)**. Recursive per-epoch rebuild. **Fails when every pebble snapshot in range is absent**, which is the cold-sync case.

**Tier 3 — Real fix (required, not in consensus code)**: cold-snapshot generation must include the XDPoS consensus keys. Two options:
- **Export-side**: modify snapshot tooling to preserve pebble keys with prefix `XDPoS-V2-` (and V1 equivalents)
- **Boot-side recovery**: on detecting missing snapshot at/near V2 switch, replay votes/headers from chain data to reconstruct `NextEpochCandidates` (bounded walk from most recent checkpoint block forward, using `decodeMasternodesFromHeaderExtra` at the switch block + contract state read for candidates)

**Verdict on abddd87fc**: it's a correct bandaid — it converts a panic into a typed error and, when `initial()` is called via `verifyHeader()` with the switch-block header itself, it successfully bootstraps. It does **not** solve Issue A. A node restoring from a cold snapshot that lands strictly before the switch block, without the switch-block header pre-fetched, will still fail to bootstrap — just cleanly instead of crashing.

**Recommended patch strategy (in-code):**
1. Keep abddd87fc.
2. Extend `initial()`: when `checkpointHeader == nil` and `header` is past the switch, **fetch the switch-block header via peer sync** rather than returning error, or defer snapshot creation until the switch header is locally available.
3. Add an **on-boot snapshot-integrity check**: if `V2.SwitchBlock` is in range `[currentBlock - Gap, currentBlock + Gap]` and no V2 snapshot exists in pebble, log a warning pointing operators to the cold-snapshot incompatibility and trigger a rebuild attempt from contract state.
4. Coordinate with infra to fix the snapshot export.

### 6. Open Questions

- Does the production Apothem snapshot artifact at 56,827,375 include *any* `XDPoS-V2-` keys? Needs offline pebble inspection with `ldb scan --key_hex` to confirm.
- How far back does `repairSnapshot` actually reach in practice on a GP5 cold-restored node? Worth instrumenting.
- Whether the stack-trace line numbers (1219/357) correspond to `GetMasternodesFromEpochSwitchHeader` in an older commit or to a since-moved `decodeMasternodesFromHeaderExtra` call — mapping is not exact against current HEAD and would benefit from the exact crashing binary's commit SHA.
- Whether the 843c73a77 contract-read path is safe for all historical chain states at the V2 switch (contract storage must be available before block execution finalizes).

**Bottom line**: abddd87fc stops the crash. The real fix is fixing cold-snapshot artifacts to include pebble-stored consensus snapshots; add a boot-time rebuild path from chain + contract state as defense-in-depth.
