Erigon Archive
Status: Design phase. Depends on: SharedDomains flush path, existing gRPC direct adapter pattern (node/direct/).
Erigon Archive decouples block execution from persistent storage. Today, execution and storage are tightly coupled — SharedDomains.Flush() writes directly to a local MDBX transaction and there is no clean way to attach external indexers, snapshot builders, or remote storage backends without modifying the execution path.
The design introduces a StorageSink interface at the SharedDomains.Flush() boundary, where accumulated execution state transitions from in-memory overlay to persistent storage. The executor produces BlockBatch changesets; the sink consumes them. This single interface change enables three deployment modes and an open-ended BatchProcessor framework for post-processing.
Key Capabilities
Three deployment modes: Embedded (current behavior, wrapped in the interface), Attached (executor streams changesets to a separate archive service via gRPC), and Detached (archive reads pre-built snapshots from the BitTorrent network via Sparse Snapshots, no live chain connection required).
BatchProcessor framework:
After each BlockBatch is durably stored, pluggable post-processors run in parallel — receipt indexers, snapshot builders, metrics collectors. Adding a new indexer is registering a new BatchProcessor; the executor is never touched.
Composable sinks:
A TeedSink composes multiple sinks — for example, writing to local MDBX and a remote archive simultaneously, with the local write as primary and the remote as best-effort.
Design: StorageSink Interface
The split point is SharedDomains.Flush(). The existing method becomes a thin wrapper:
// db/kv/storage_sink.go
type StorageSink interface {
ApplyBlockBatch(ctx context.Context, batch *BlockBatch) error
Close() error
}
type BlockBatch struct {
FromBlock uint64
ToBlock uint64
BlockOverlay *MemoryDiff // headers, bodies, canonical hashes, TD
DomainDiffs [DomainLen][]DomainEntryDiff // accounts, storage, code, commitment
IndexUpdates map[InvertedIdx][]IndexEntry
StateRoot common.Hash
}
The MemoryDiff struct already exists (kv/membatchwithdb/memory_mutation_diff.go) as a serializable changeset. ExportDiffs() and ExportIndexUpdates() are added to TemporalMemBatch to export accumulated domain state.
The existing Flush(ctx, kv.RwTx) becomes:
func (sd *SharedDomains) Flush(ctx context.Context, tx kv.RwTx) error {
return sd.FlushToSink(ctx, &LocalSink{tx: tx})
}
Deployment Modes
LocalSink writes directly to the local MDBX transaction. This is the current behavior, wrapped in the StorageSink interface. Zero overhead — same writes, same path.
archive:
mode: embedded
RemoteSink sends BlockBatch to a separate archive service via gRPC. The executor can optionally keep its own local state (tee: true) or discard it (tee: false).
# Executor node:
archive:
mode: attached
addr: "archive.internal:9095"
tee: false
# Archive node:
archive:
mode: embedded
processors:
- otterscan-indexer
- snapshot-builder
rpc:
enabled: true
SnapshotSink writes directly to snapshot-format files. No executor, no live chain connection. The archive node reads pre-built snapshots from the BitTorrent network via Sparse Snapshots.
archive:
mode: detached
processors:
- otterscan-indexer
rpc:
enabled: true
A TeedSink composes multiple sinks — for example, local MDBX plus a remote archive simultaneously:
type TeedSink struct {
primary StorageSink // must succeed (local)
secondary StorageSink // best-effort (remote archive)
}
gRPC Protocol
The Archive service follows Erigon's existing gRPC direct/remote pattern (same as KV, Sentry, TxPool).
The direct adapter (ArchiveClientDirect) calls archive server methods in-process with zero serialization — used when running the archive embedded. The remote adapter uses a real gRPC connection — used when the archive runs as a separate process.
Key RPC methods:
service Archive {
rpc ApplyBlockBatch(BlockBatchRequest) returns (BlockBatchReply);
rpc StreamBlockBatches(StreamRequest) returns (stream BlockBatchRequest);
rpc GetSyncStatus(google.protobuf.Empty) returns (SyncStatusReply);
}
BatchProcessor Framework
After each BlockBatch is durably stored, the archive server runs pluggable post-processors in parallel. Processors cannot block or fail execution — they run after the commit:
type BatchProcessor interface {
Name() string
Process(ctx context.Context, batch *BlockBatch) error
}
Built-in processors:
| Processor | Output |
|---|---|
OtterScanIndexer | Address→tx mapping, internal transfer index |
LogIndexer | Bloom filters for topic queries |
TraceIndexer | Address→trace bitmap |
SnapshotBuilder | .seg files for BitTorrent seeding |
MetricsCollector | Prometheus metrics (gas stats, tx type distribution) |
This is the map-reduce pattern: the executor maps blocks to BlockBatch changesets; the archive reduces them through multiple processors in parallel. Adding a new indexer is registering a new BatchProcessor — no changes to the executor.
Adding New Storage Targets
Once StorageSink exists, adding a new target requires implementing one interface method and zero executor changes:
| Sink | Purpose | Estimated effort |
|---|---|---|
PostgresSink | Relational DB for SQL queries | ~200 LOC |
KafkaSink | Stream changesets to Kafka | ~150 LOC |
S3Sink | Cloud storage for segment files | ~200 LOC |
FilteredSink | Store only specific domains | ~50 LOC |
Implementation Phases
| Phase | Scope |
|---|---|
| 1 | Define StorageSink + BlockBatch; implement LocalSink; add FlushToSink() to SharedDomains alongside existing Flush() — zero behavior change |
| 2 | Add ExportDiffs() and ExportIndexUpdates() to TemporalMemBatch; protobuf definitions; round-trip serialization test |
| 3 | gRPC service + ArchiveClientDirect zero-serialization adapter; ArchiveServer with local DB backend |
| 4 | Remote mode + TeedSink; CLI flags (--archive.mode, --archive.addr, --archive.tee) |
| 5 | BatchProcessor framework; port OtterScanIndexer and SnapshotBuilder as processors |
| 6 | Detached mode via SnapshotSink; BitTorrent snapshot discovery and download |