# XDC GP5 Consensus Cache Poisoning & Security Audit Validation

**Repository:** XDCNetwork/XDC-Geth  
**Branch:** feat/trusted-checkpoint-sync  
**Commit:** 6c0150479 (v145)  
**Auditor:** Senior Blockchain Consensus Engineer  
**Date:** 2025-05-04

---

## Executive Summary

This document validates every claim from the security audit comment on GitHub issue #511 against the v145 codebase (commit 6c0150479). The audit raised **8 claims** (3 P0, 5 P1). 

**Verdicts at a glance:**

| # | Claim | Verdict | Fixed in v145? |
|---|-------|---------|----------------|
| 1 | P0 – Cache Poisoning at Creation Time | **PARTIALLY TRUE** | Partially |
| 2 | P0 – Singleflight Amplification | **TRUE** | No |
| 3 | P0 – getEpochSwitchInfoWithParents Caches Empty Data | **TRUE** | No |
| 4 | P1 – round == 0 Fallback Is Semantically Wrong | **TRUE** | Partially |
| 5 | P1 – Parents-slice cache poisoning from peers | **TRUE** | No |
| 6 | P1 – Missing Penalties / EpochSwitchParentBlockInfo | **TRUE** | No |
| 7 | P1 – getSnapshot V1-count heuristic fragility | **TRUE** | No |
| 8 | P1 – prevRound tracking bug | **TRUE** | No |

**v145 (commit 6c0150479) did NOT fully fix the audit findings.** It added cache-eviction guards for empty masternodes (Claims 1 & 3), but the root causes remain exploitable, and Claims 2, 4-8 are largely unaddressed.

---

## Detailed Claim-by-Claim Validation

### Claim 1 – P0: Cache Poisoning at Creation Time

> `getEpochSwitchInfoInner` caches empty masternodes when `repairSnapshot` fails. The error is logged but not returned, and the code falls through to create `EpochSwitchInfo{Masternodes: []}` and caches it permanently.

**Evidence (engine.go lines 1413-1455):**

```go
masternodes := x.GetMasternodesFromEpochSwitchHeader(chain, h)
// V2 epoch switch blocks (except the V1->V2 switch block) may have empty
// header.Validators. Rebuild the snapshot on-the-fly if needed.
if len(masternodes) == 0 && h.Number.Uint64() > x.config.V2.SwitchBlock.Uint64() {
    repaired, err := x.repairSnapshot(chain, h.Number.Uint64())
    if err == nil {
        masternodes = repaired
    } else {
        log.Error("[getEpochSwitchInfo] snapshot repair failed", "checkpoint", h.Number.Uint64(), "err", err)
    }
}

// ... falls through regardless of err ...

info := &types.EpochSwitchInfo{
    Masternodes:    masternodes,   // may still be [] if repairSnapshot failed
    ...
}
x.epochSwitches.Add(hash, info)   // CACHED even when masternodes == []
```

**Verdict: PARTIALLY TRUE**

- **Root cause is real:** If `repairSnapshot` fails, `masternodes` remains empty and the constructed `EpochSwitchInfo` is cached (line 1454).
- **v145 mitigation:** The fast-path in `getEpochSwitchInfo` (lines 1340-1350) now evicts cache entries with `len(info.Masternodes) == 0` and recomputes. The inner function also has the same eviction check (lines 1366-1373).
- **Why partially fixed:** The eviction only helps on **subsequent** lookups. The **first** call still caches the empty entry, and any concurrent waiters (via singleflight) receive the poisoned result immediately. Also, if the empty entry is consumed by the same goroutine that created it (before the eviction check on the next call), the damage is done.

---

### Claim 2 – P0: Singleflight Amplification

> The `singleflight.Group` means one goroutine's broken result (empty masternodes) is shared with ALL concurrent waiters.

**Evidence (engine.go lines 1352-1361):**

```go
result, err, _ := x.epochSwitchSF.Do(hash.Hex(), func() (interface{}, error) {
    return x.getEpochSwitchInfoInner(chain, header, hash)
})
if err != nil {
    return nil, err
}
return result.(*types.EpochSwitchInfo), nil
```

**Verdict: TRUE**

- `singleflight.Group` deduplicates concurrent calls. If the winning goroutine produces a broken result (e.g., empty masternodes due to `repairSnapshot` failure), **all waiters receive the same broken `*EpochSwitchInfo`**.
- v145 does **not** change this behavior. The singleflight result is returned as-is; there is no post-hoc validation of `Masternodes` length after `singleflight.Do` returns.
- The v145 cache-eviction guards (Claims 1/3) run **before** entering singleflight, but once inside, the broken result is broadcast to all waiters.

---

### Claim 3 – P0: getEpochSwitchInfoWithParents Also Caches Empty Data

> The function constructs `EpochSwitchInfo` and calls `x.epochSwitches.Add(hash, info)` even when `masternodes` is empty.

**Evidence (engine.go lines 1647-1658):**

```go
_, round, _, _ := x.getExtraFields(chain, epochSwitchHeader)
info = &types.EpochSwitchInfo{
    Masternodes:    masternodes,   // may be [] after failed repairSnapshot
    MasternodesLen: len(masternodes),
    EpochSwitchBlockInfo: &types.BlockInfo{...},
}
x.epochSwitches.Add(hash, info)   // ALWAYS cached, even if masternodes == []
return info, nil
```

**Verdict: TRUE**

- The function unconditionally caches the result at line 1657. There is **no** check for `len(masternodes) == 0` before caching.
- The v145 fast-path eviction in `getEpochSwitchInfo` (lines 1340-1350) will catch this on the **next** lookup, but the first call still poisons the cache.
- Additionally, `getEpochSwitchInfoWithParents` is called with a **different hash** than the epoch switch header's hash (it uses the queried block's hash), so the poisoned entry may be keyed under a different hash and never re-evaluated.

---

### Claim 4 – P1: round == 0 Fallback Is Semantically Wrong

> Only the V1→V2 switch block has round 0. Normal V2 epoch switch blocks do NOT have round 0.

**Evidence (engine.go lines 1043-1052 and 1542-1548):**

```go
// In GetMasternodesWithParents deep fallback (v142):
_, round, err := x.getExtraFieldsNoChain(p)
if err == nil && round == 0 {
    // Round 0 indicates a new epoch in V2   // <-- COMMENT IS WRONG
    masternodes := x.GetMasternodesFromEpochSwitchHeader(chain, p)
    ...
}

// In getEpochSwitchInfoWithParents (v145):
if round <= 1 {
    // v145: Round 0 or 1 indicates epoch start
    // Round 0 = switch block itself, Round 1 = first V2 epoch switch
    ...
}
```

**Verdict: TRUE**

- The v142 comment "Round 0 indicates a new epoch in V2" is **semantically incorrect**. Round 0 is exclusively the V1→V2 switch block. Normal V2 epoch switches start at round 1 (or higher after timeouts).
- v145 **partially** fixed this by changing `round == 0` to `round <= 1` in `getEpochSwitchInfoWithParents` (line 1544), but the deep fallback in `GetMasternodesWithParents` (line 1044) still uses `round == 0`.
- The `round <= 1` heuristic is still fragile: it will mis-identify the switch block (round 0) as an epoch switch, which is technically correct because the switch block IS an epoch switch, but it may also mis-identify non-epoch blocks if they happen to have round 1 (which is the first block of a new epoch, but not all round-1 blocks are epoch switches — the epoch switch block itself has round 1 only if its parent was round 0).

---

### Claim 5 – P1: Parents-slice cache poisoning from peers

> `getEpochSwitchInfoWithParents` caches constructed `EpochSwitchInfo` for arbitrary input hashes based on unvalidated `parents` slice from peers.

**Evidence (engine.go lines 1496-1658):**

```go
func (x *XDPoS_v2) getEpochSwitchInfoWithParents(..., parents []*types.Header, ...) {
    // parents come from peer during bulk sync — NOT validated against DB
    ...
    x.epochSwitches.Add(hash, info)  // caches under the queried block's hash
}
```

**Verdict: TRUE**

- The `parents` slice is provided by the downloader/p2p layer and is **not** cryptographically validated before being used to derive the epoch switch header.
- A malicious peer could craft a `parents` slice that points to an incorrect epoch switch header, causing `getEpochSwitchInfoWithParents` to construct and cache a wrong `EpochSwitchInfo` under the queried block's hash.
- The v145 cache-eviction guards only check for empty masternodes, not for **wrong** masternodes. A poisoned entry with non-empty but incorrect masternodes would be accepted.

---

### Claim 6 – P1: Missing Penalties / EpochSwitchParentBlockInfo

> The constructed info in parents path is missing `Penalties` and `EpochSwitchParentBlockInfo`.

**Evidence (engine.go lines 1647-1656):**

```go
info = &types.EpochSwitchInfo{
    Masternodes:    masternodes,
    MasternodesLen: len(masternodes),
    EpochSwitchBlockInfo: &types.BlockInfo{
        Hash:   epochSwitchHeader.Hash(),
        Round:  round,
        Number: epochSwitchHeader.Number,
    },
}
```

**Verdict: TRUE**

- Compare with `getEpochSwitchInfoInner` (lines 1440-1452) which sets:
  - `Penalties: extractAddressesFromBytes(h.Penalties)`
  - `EpochSwitchParentBlockInfo = quorumCert.ProposedBlockInfo`
- `getEpochSwitchInfoWithParents` sets **neither**. The constructed `EpochSwitchInfo` is incomplete.
- This can cause `verifyQC` (which uses `epochInfo.Masternodes` and `epochInfo.MasternodesLen`) to behave differently when the info came from the parents path vs. the DB path. `Penalties` are used by downstream penalty logic; missing them may cause incorrect penalty calculations.

---

### Claim 7 – P1: getSnapshot V1-count heuristic fragility

> Hardcoded check for 13 candidates is fragile.

**Evidence (engine.go lines 750-752 and 770-776):**

```go
} else if checkpointNumber > x.config.V2.SwitchBlock.Uint64() && len(snap.NextEpochCandidates) == 13 {
    log.Warn("[getSnapshot] evicting cached snapshot with V1 candidate count (13) at V2 checkpoint", ...)
    x.snapshots.Remove(gapHeader.Hash())
}
```

**Verdict: TRUE**

- The number `13` is hardcoded as the exact V1 masternode count. This is fragile because:
  1. Different networks (mainnet vs. Apothem) may have different V1 counts.
  2. If the V1 count ever changes (e.g., via a hard fork), this heuristic breaks silently.
  3. A snapshot with 12 or 14 candidates that is still corrupt would pass the check.
- The comment at line 773 acknowledges this: "Apothem may legitimately have <20 masternodes at early V2 checkpoints, so only reject the exact V1 count of 13." This is a heuristic band-aid, not a robust fix.

---

### Claim 8 – P1: prevRound tracking bug

> If `getExtraFieldsNoChain` fails, `prevRound` is NOT updated, causing stale comparisons.

**Evidence (engine.go lines 1526-1576):**

```go
for i := len(parents) - 1; i >= 0; i-- {
    p := parents[i]
    pNum := p.Number.Uint64()
    if pNum > x.config.V2.SwitchBlock.Uint64() {
        _, round, err := x.getExtraFieldsNoChain(p)
        if err == nil {
            if round <= 1 { ... }
            if i == len(parents)-1 {
                prevRound = round
            } else if round < prevRound {
                epochSwitchHeader = p
                break
            }
            prevRound = round   // only updated on err == nil
        }
        // BUG: if err != nil, prevRound is NOT updated.
        // Next iteration may compare against a stale prevRound from a much
        // earlier (or later) parent, causing false-positive or false-negative
        // epoch switch detection.
    }
}
```

**Verdict: TRUE**

- When `getExtraFieldsNoChain` returns an error (e.g., V1-format Extra field in a V2-era header, or corrupted header), the `prevRound` variable is **not** updated for that iteration.
- On the next iteration, `round < prevRound` compares the new round against a stale `prevRound` from a potentially distant parent. This can:
  - **False positive:** A non-epoch block appears to have a round reset because `prevRound` is from an unrelated header.
  - **False negative:** A real epoch switch is missed because `prevRound` was already lower than the current round due to a previous error.
- v145 did not fix this. The loop structure is identical to v144.

---

## Additional Issues Not Mentioned in the Audit

### A1. `getEpochSwitchInfoWithParents` ignores `getExtraFields` error when constructing info

At line 1647:
```go
_, round, _, _ := x.getExtraFields(chain, epochSwitchHeader)
```
The error is silently discarded. If `getExtraFields` fails, `round` is 0 (default), and the constructed `EpochSwitchInfo` has an incorrect `Round` field. This is cached under the queried block's hash.

### A2. `getEpochSwitchInfoWithParents` does not validate `hash` argument against `header`

Unlike `getEpochSwitchInfoInner` (line 1381-1383), `getEpochSwitchInfoWithParents` does not check `h.Hash() != hash`. A caller could pass a mismatched hash/header pair, causing cache poisoning under an arbitrary key.

### A3. `getEpochSwitchInfoInner` recursive path caches parent info under child hash without validation

At line 1472:
```go
x.epochSwitches.Add(hash, parentInfo)
```
The parent `EpochSwitchInfo` is cached under the **child's** hash. If the child is not actually in the same epoch as the parent (e.g., due to a fork or corrupted chain), the cache maps the child's hash to the wrong epoch info. This is a known pattern in the original v2.6.8 code, but the GP5 implementation does not validate the parent-child epoch relationship before caching.

### A4. `repairSnapshot` can return empty masternodes without error

In `repairSnapshot` (lines 840-949), if the smart contract returns zero candidates and the checkpoint header has no validators, the function returns `nil, fmt.Errorf(...)`. However, if the checkpoint header exists but has `len(Validators) == 0`, the function falls through to the error path. The caller (`getEpochSwitchInfoInner`) logs the error but does not prevent caching. This is related to Claim 1 but worth noting as a separate root cause.

### A5. `IsEpochSwitch` has a debug logging block with hardcoded block numbers

Lines 1108-1118 contain hardcoded debug logging for blocks near `56828000-56831000` (Apothem-specific). This is harmless but indicates Apothem-specific debugging code left in production.

---

## Consolidated Fix Plan

### Immediate Fixes (P0)

#### Fix 1.1 – Prevent caching empty masternodes in `getEpochSwitchInfoInner`

**Location:** `engine.go` lines 1440-1455

**Change:** After constructing `info`, validate `len(masternodes) > 0` before caching. If empty, return an error instead of caching.

```go
info := &types.EpochSwitchInfo{...}
if len(masternodes) == 0 {
    return nil, fmt.Errorf("epoch switch block %d has empty masternodes", h.Number.Uint64())
}
x.epochSwitches.Add(hash, info)
return info, nil
```

#### Fix 1.2 – Prevent caching empty masternodes in `getEpochSwitchInfoWithParents`

**Location:** `engine.go` lines 1647-1658

**Change:** Same as Fix 1.1. Also add `Penalties` and `EpochSwitchParentBlockInfo`.

```go
if len(masternodes) == 0 {
    return nil, fmt.Errorf("parents path: epoch switch block %d has empty masternodes", epochSwitchHeader.Number.Uint64())
}
quorumCert, round, masternodesFromHeader, err := x.getExtraFields(chain, epochSwitchHeader)
// ... construct info with Penalties and EpochSwitchParentBlockInfo ...
x.epochSwitches.Add(hash, info)
```

#### Fix 1.3 – Validate singleflight result before returning

**Location:** `engine.go` lines 1354-1361

**Change:** After `singleflight.Do` returns, validate the result's masternodes length. If empty, do NOT return it to waiters; instead, return an error.

```go
result, err, _ := x.epochSwitchSF.Do(hash.Hex(), func() (interface{}, error) {
    return x.getEpochSwitchInfoInner(chain, header, hash)
})
if err != nil {
    return nil, err
}
info := result.(*types.EpochSwitchInfo)
if len(info.Masternodes) == 0 {
    return nil, fmt.Errorf("singleflight produced empty masternodes for %s", hash.Hex())
}
return info, nil
```

### Short-Term Fixes (P1)

#### Fix 2.1 – Remove or generalize the hardcoded `13` candidate check

**Location:** `engine.go` lines 750, 770

**Change:** Replace the hardcoded `13` with a config-driven value or a more robust heuristic (e.g., check `Version` field plus a minimum candidate threshold). At minimum, document the assumption and add a TODO to generalize.

#### Fix 2.2 – Fix `prevRound` tracking in `getEpochSwitchInfoWithParents`

**Location:** `engine.go` lines 1526-1576

**Change:** Reset `prevRound` to a sentinel value (e.g., `types.Round(^uint64(0))`) when `getExtraFieldsNoChain` fails, so the next successful iteration starts fresh. Alternatively, skip the comparison entirely on error.

```go
_, round, err := x.getExtraFieldsNoChain(p)
if err != nil {
    prevRound = types.Round(^uint64(0)) // sentinel: "unknown"
    continue
}
// ... rest of logic ...
```

#### Fix 2.3 – Fix `round == 0` deep fallback in `GetMasternodesWithParents`

**Location:** `engine.go` lines 1043-1052

**Change:** Change `round == 0` to `round <= 1` to match v145's `getEpochSwitchInfoWithParents` logic, or better yet, use `IsEpochSwitch` exclusively and remove the round-based fallback.

#### Fix 2.4 – Add `Penalties` and `EpochSwitchParentBlockInfo` to parents path

**Location:** `engine.go` lines 1647-1656

**Change:** Extract `quorumCert` from `getExtraFields` and populate the missing fields:

```go
quorumCert, round, _, err := x.getExtraFields(chain, epochSwitchHeader)
info = &types.EpochSwitchInfo{
    Masternodes:    masternodes,
    MasternodesLen: len(masternodes),
    Penalties:      extractAddressesFromBytes(epochSwitchHeader.Penalties),
    EpochSwitchBlockInfo: &types.BlockInfo{...},
}
if quorumCert != nil {
    info.EpochSwitchParentBlockInfo = quorumCert.ProposedBlockInfo
}
```

#### Fix 2.5 – Validate `parents` slice integrity before trusting it

**Location:** `engine.go` lines 1496+

**Change:** At the start of `getEpochSwitchInfoWithParents`, verify that the `parents` slice forms a valid chain (each parent's hash matches the next header's parent hash). If not, reject the parents slice and return error.

```go
for i := 1; i < len(parents); i++ {
    if parents[i].Hash() != parents[i-1].ParentHash {
        return nil, fmt.Errorf("parents slice broken at index %d", i)
    }
}
```

#### Fix 2.6 – Do not cache parents-path results under arbitrary hashes

**Location:** `engine.go` lines 1657

**Change:** Consider NOT caching `getEpochSwitchInfoWithParents` results at all, or cache them under the **epoch switch header's hash** rather than the queried block's hash. This prevents a poisoned parents slice from permanently corrupting the cache for unrelated blocks.

---

## Conclusion

The v145 release (commit 6c0150479) made a good-faith effort to mitigate empty-masternode cache poisoning by adding eviction guards, but it **did not address the root causes** identified in the audit:

1. **Singleflight amplification** remains unpatched — one broken result still poisons all concurrent waiters.
2. **Empty masternodes are still cached** on the first call; eviction only helps on subsequent lookups.
3. **`getEpochSwitchInfoWithParents` still caches incomplete/wrong data** without validation.
4. **Semantic errors** (`round == 0`, missing `Penalties`, fragile `13` heuristic, `prevRound` tracking) are largely untouched.
5. **Peer-supplied `parents` slices are trusted without cryptographic validation**, enabling cache poisoning attacks.

**Recommendation:** Apply the consolidated fix plan above before deploying to mainnet. The P0 fixes (prevent caching of empty/invalid data, validate singleflight results) are critical. The P1 fixes (generalize heuristics, fix tracking bugs, validate parents) should follow in the next release.
