# GP5 V2 Switch Boundary Sync Failure — Root Cause Analysis & Fix Recommendations

## Problem Summary

The GP5 node (v150, commit `81beca0`) stalls at Apothem block **56,829,310** (just past the V2 switch boundary at 56,828,700) when syncing from checkpoint `56,700,000` using `--syncfromblock`. The sync loop repeats infinitely with these critical errors:

1. `empty masternode list`
2. `failed to get epoch switch info for QC verification`
3. `parent header not in DB (bulk sync?)`
4. `missing trie node`
5. `retrieved hash chain is invalid: empty masternode list`

---

## Root Cause Analysis

### 1. Checkpoint Sync Without State = No Gap Snapshots

When `--syncfromblock 56700000` is used, `InsertHeadersBeforeCutoff` inserts headers into the ancient store **without state** (line 3468-3475 of `core/blockchain.go`). It logs:

> *"checkpoint state not available, skipping snapshot pre-seed (expected for syncfromblock)"*

This sets `checkpointSyncNoState = true`. The `UpdateMasternodesFromHeader` call (which normally creates the gap-block snapshot by reading smart-contract state) is **skipped** because `StateAt(cpHeader.Root)` fails with a missing trie node.

**Result:** The snapshot DB has **no V2 gap-block snapshots** for any epoch after the checkpoint. When the first post-checkpoint epoch switch block arrives (56,829,600 + subsequent switches), `getSnapshot` finds nothing and must call `repairSnapshot`.

### 2. `repairSnapshot` Fails When State Is Pruned

`repairSnapshot` (engine.go:840-949) has three strategies:

| Strategy | Condition | Outcome at 56,829,310 |
|---|---|---|
| A. First V2 checkpoint header direct read | `checkpointNumber <= firstV2Checkpoint` | **Not triggered** — the stalled block is past the first checkpoint |
| B. Smart contract state at gap block | `StateAt(gapHeader.Root)` | **Fails with "missing trie node"** — state was never downloaded for pre-checkpoint gap blocks |
| C. Checkpoint header `Validators` field | `chain.GetHeaderByNumber(checkpointNumber)` | **May succeed if header is in DB**, but only if `Validators` is populated and valid |

When **Strategy B** fails and **Strategy C** either fails or returns empty, `repairSnapshot` returns an error. That error propagates to `getEpochSwitchInfo` → `verifyQC` → `verifyHeader` → `errInvalidChain`, terminating the sync batch.

### 3. `getEpochSwitchInfoWithParents` (v152/v153) Is Not Enough

The v152/v153 fixes (commit `1891d0735`) added fallback logic in `getEpochSwitchInfoWithParents` to find the epoch switch block when it is **not in the parents slice**:

- v152: round-0 detection on the target header itself  
- v153: `round2epochBlockInfo` cache lookup, then DB walk-back  

**Why it still fails:**

- The epoch switch block **is** found (e.g., via DB walk-back or round2epochBlockInfo cache).
- But `GetMasternodesFromEpochSwitchHeader(chain, esHeader)` returns **empty** masternodes because the epoch switch header’s `Validators` field is empty for V2-era blocks (validators are stored in the gap snapshot, not the header).
- The code then tries `repairSnapshot` → fails → returns `empty masternode list`.

### 4. The V1→V2 Switch Block Special Case Is Handled, But Subsequent Checkpoints Are Not

The code correctly handles the **first** V2 checkpoint (56,829,600) by reading from the switch block header directly (engine.go:845-865). However, the node is stalling at **56,829,310**, which is **after** the first checkpoint but still within the first V2 epoch. The next epoch switch is at 56,830,500. The gap block for that epoch is 56,830,050 — which is **before the checkpoint** (56,700,000). Since the node started sync from 56,700,000, it never processed the gap block at 56,830,050 with state, so the snapshot was never created.

---

## Why v2.6.8 Works

v2.6.8 does **not** have `repairSnapshot`. It creates snapshots **during normal block import** via `UpdateM1()` (equivalent to GP5's `UpdateMasternodesFromHeader`). Every gap block is processed with full state, and the snapshot is stored immediately. There is no "catch-up" scenario where a gap snapshot is missing.

GP5 added `repairSnapshot` for recovery scenarios, but it is **fundamentally unreliable** when state is pruned or never downloaded (which is exactly what happens in `--syncfromblock` mode).

---

## Fix Recommendations

### Option 1: Sync From the V2 Switch Block or Later (Operational Fix)

**Recommended immediate workaround.**

Instead of `--syncfromblock 56700000`, use:

```bash
--syncfromblock 56828700   # V2 switch block
# or even later, e.g.:
--syncfromblock 56829600   # first V2 checkpoint
```

**Why this works:**
- Starting from the V2 switch block ensures the node processes **all V2 gap blocks** with state.
- `UpdateMasternodesFromHeader` will be called for every gap block, creating snapshots correctly.
- No need for `repairSnapshot` because snapshots are never missing.

**Trade-off:** The node must download and process more blocks (an additional ~128,700 blocks before the checkpoint), but this is negligible compared to an infinite sync loop.

---

### Option 2: Download State for Checkpoint Gap Blocks (Binary Fix)

Modify the downloader so that when `--syncfromblock` is used, the node **downloads state** for the gap block of the first post-checkpoint epoch **before** validating epoch switch blocks.

**Implementation sketch:**

1. In `downloader.go`, after inserting the checkpoint anchor, detect the next epoch switch block number.
2. Trigger a **state sync** (snap sync) for the gap block preceding that epoch switch.
3. Only resume full header/body sync after the gap-block state is available.

**Trade-off:** Complex, touches the downloader pipeline, and may introduce new race conditions.

---

### Option 3: Harden `repairSnapshot` to Use Remote State or Header-Only Fallback (Binary Fix)

If `repairSnapshot` cannot read local state, add a **remote fallback**:

1. **Remote state via RPC:** Query a trusted peer or RPC endpoint for `eth_getBalance` / contract state at the gap block. This is fragile and requires infrastructure.
2. **Header-only epoch validation:** For the first few epochs after checkpoint sync, skip `CompareSignersLists` validation if the snapshot is missing. This is **dangerous** (security risk) and should only be used with a trusted checkpoint.

**Trade-off:** Security vs. convenience. Not recommended for production without strict trusted-checkpoint enforcement.

---

### Option 4: Pre-Seed Snapshots from a Trusted Snapshot File (Operational Fix)

Provide a **snapshot export** from a healthy v2.6.8 or GP5 node that contains the gap-block snapshots for epochs after the checkpoint. The syncing node can import these snapshots before starting sync.

**Implementation:**
- Export `XDPoS-V2-*` keys from LevelDB on a healthy node.
- Import them into the new node's datadir before starting.

**Trade-off:** Requires manual intervention and a healthy reference node.

---

## Recommended Immediate Action

1. **Stop the stuck node.**
2. **Clear the datadir** (or at least the ancient store and snapshot DB) to remove any poisoned caches.
3. **Restart with a later sync-from block:**
   ```bash
   --syncfromblock 56828700
   ```
   or
   ```bash
   --syncfromblock 56829600
   ```
4. **Verify** that `UpdateMasternodesFromHeader` logs appear for gap blocks (e.g., 56829250, 56830150, etc.) and that no `repairSnapshot` warnings appear.
5. **Monitor** until the node passes 56,830,500 (the next epoch switch after the first checkpoint).

---

## Code Changes Needed (If We Want to Support Checkpoint Sync from Pre-V2)

If the requirement is to support `--syncfromblock 56700000` (or any pre-V2 checkpoint), the binary needs one of the following:

1. **State download for post-checkpoint gap blocks** (Option 2 above).
2. **A new `repairSnapshot` that can reconstruct candidates without state** by walking forward from the V2 switch block header and applying penalty logic incrementally. This is complex and risky.
3. **Skip validator-set comparison during the first post-checkpoint epoch** when in checkpoint-sync-no-state mode, with a strict trusted-checkpoint hash check. This is the simplest binary fix but reduces security for that epoch.

---

## Summary

| Question | Answer |
|---|---|
| Why does checkpoint sync fail at V2 boundary? | Because `--syncfromblock` skips state download, so gap-block snapshots are never created. `repairSnapshot` fails when state is missing. |
| What were v153 fixes supposed to do? | v153 added fallbacks to find the epoch switch block when it is not in the parents slice. But it does **not** fix the missing snapshot / missing state problem. |
| Is the fix in the binary or sync approach? | **Primarily the sync approach.** Using `--syncfromblock 56828700` (or later) avoids the issue entirely. A binary fix would require state download or relaxed validation. |
| Do we need to sync from an earlier block? | **No — the opposite.** Sync from a **later** block (at or after the V2 switch) so all V2 gap blocks are processed with state. |

---

*Analysis based on GP5 codebase at `/Users/anilchinchawale/github/XDCNetwork/XDC-Geth`, commit `81beca0f6` (v150) and `1891d0735` (v152+v153).*