# ADR-363: PBSS Snapshot Restore Fix

**Status:** Accepted  
**Date:** 2026-04-21  
**Author:** XDC Core Team  
**Deciders:** XDCIndia/go-ethereum maintainers  
**Related PR:** [#363](https://github.com/XDCIndia/go-ethereum/pull/363)

---

## Context

XDC uses a fork of go-ethereum 1.17 (GP5) with XDPoS consensus and custom state management optimizations. The fork introduced Path-Based State Scheme (PBSS) support early, but with a critical bug: PBSS nodes could not reliably restore from cold snapshots.

This ADR documents the decision to remove the `XdcBulkSyncMode.Load()` gate from P1 checkpoint trie commits.

---

## Problem Statement

### Symptoms

- PBSS nodes fail to start from cold snapshots with "snapshot walk-back exceeded safe limit"
- `diskRoot` (the on-disk trie root marker) never advances from genesis during normal operation
- Crash recovery unreliable — nodes that lose power often require full resync

### Technical Root Cause

The P1 checkpoint trie commit in `core/blockchain.go` was gated:

```go
if checkpoint > 0 && (checkpoint%c.config.XDPoS.XDPoSConfig.Epoch) == 0 && XdcBulkSyncMode.Load() {
    c.blockchain.TrieDB().Commit(block.Root(), false)
}
```

`XdcBulkSyncMode` is an atomic flag that is:
- `true` during initial bulk sync (downloading historical blocks)
- `false` during normal operation (importing new blocks as they arrive)

During normal operation, `triedb.Commit()` was never called, so `diskRoot` remained at genesis. When the node restarted, pathdb's `loadLayers()` failed to reconstruct state from the stale `diskRoot`.

---

## Decision

Remove the `XdcBulkSyncMode.Load()` gate, allowing `triedb.Commit()` to fire every 900 blocks (XDPoS epoch) regardless of sync mode.

### Change

```diff
- if checkpoint > 0 && (checkpoint%c.config.XDPoS.XDPoSConfig.Epoch) == 0 && XdcBulkSyncMode.Load() {
+ if checkpoint > 0 && (checkpoint%c.config.XDPoS.XDPoSConfig.Epoch) == 0 {
      tstart := time.Now()
      c.blockchain.TrieDB().Commit(block.Root(), false)
      // ...
  }
```

---

## Consequences

### Positive

1. **Reliable snapshot restore** — PBSS nodes can now use cold backups
2. **Crash recovery works** — power loss no longer requires full resync
3. **Fleet operations simplified** — cross-server migrations are safe
4. **diskRoot advances predictably** — every 900 blocks

### Neutral

1. **Slightly more disk I/O** — trie commit every 900 blocks (acceptable)
2. **XDC-specific divergence** — upstream geth does not have this logic

### Negative

None identified.

---

## Alternatives Considered

### Option 1: Add explicit disk flush in `statedb.Commit()`

**Rejected:** Would require modifying state/trie code paths, higher risk of regression.

### Option 2: Reduce pathdb buffer threshold

**Rejected:** Would cause excessive I/O and doesn't guarantee `diskRoot` alignment with P1 checkpoints.

### Option 3: Keep the gate, add new persistence mechanism

**Rejected:** Overly complex. The existing P1 checkpoint logic is the correct place for persistence.

---

## Validation

### Test Matrix

| Test | Scheme | Source Block | Result |
|------|--------|--------------|--------|
| Same-server restore | HBSS | 255,423 | ✅ Pass |
| Same-server restore | PBSS | 255,423 | ✅ Pass |
| Cross-server restore | PBSS | 1,845,090 | ✅ Pass |
| Clean shutdown restore | PBSS | 4,184,400 | ✅ Pass |
| Crash-resume | PBSS | 32,214,369 | ✅ Pass (previously failed) |

### Verification Steps

```bash
# 1. Build v40 image with PR #363
docker buildx build --platform linux/amd64 -t anilchinchawale/gp5-xdc:v40 -f Dockerfile.gp5-deploy . --push

# 2. Start node, sync to > 900 blocks
docker run ... anilchinchawale/gp5-xdc:v40

# 3. Verify diskRoot advancement in logs
docker logs <container> | grep "diskRoot\|trie commit"

# 4. Stop node gracefully
docker stop <container>

# 5. Create snapshot
tar czf snapshot.tar.gz xdcchain/

# 6. Restore to new datadir
rm -rf xdcchain/ && tar xzf snapshot.tar.gz

# 7. Start new container, verify resume block
docker run ... anilchinchawale/gp5-xdc:v40
docker logs <container> | grep "Loaded most recent local block"
```

---

## References

- [Issue #362](https://github.com/XDCIndia/go-ethereum/issues/362): PBSS diskRoot stuck at genesis
- [PR #363](https://github.com/XDCIndia/go-ethereum/pull/363): Remove XdcBulkSyncMode gate
- [go-ethereum pathdb docs](https://github.com/ethereum/go-ethereum/tree/master/triedb/pathdb)
