Skip to main content

Erigon Archive

note

Status: Design phase. Depends on: SharedDomains flush path, existing gRPC direct adapter pattern (node/direct/).

Erigon Archive decouples block execution from persistent storage. Today, execution and storage are tightly coupled — SharedDomains.Flush() writes directly to a local MDBX transaction and there is no clean way to attach external indexers, snapshot builders, or remote storage backends without modifying the execution path.

The design introduces a StorageSink interface at the SharedDomains.Flush() boundary, where accumulated execution state transitions from in-memory overlay to persistent storage. The executor produces BlockBatch changesets; the sink consumes them. This single interface change enables three deployment modes and an open-ended BatchProcessor framework for post-processing.


Key Capabilities

Three deployment modes: Embedded (current behavior, wrapped in the interface), Attached (executor streams changesets to a separate archive service via gRPC), and Detached (archive reads pre-built snapshots from the BitTorrent network via Sparse Snapshots, no live chain connection required).

BatchProcessor framework: After each BlockBatch is durably stored, pluggable post-processors run in parallel — receipt indexers, snapshot builders, metrics collectors. Adding a new indexer is registering a new BatchProcessor; the executor is never touched.

Composable sinks: A TeedSink composes multiple sinks — for example, writing to local MDBX and a remote archive simultaneously, with the local write as primary and the remote as best-effort.


Design: StorageSink Interface

The split point is SharedDomains.Flush(). The existing method becomes a thin wrapper:

// db/kv/storage_sink.go

type StorageSink interface {
ApplyBlockBatch(ctx context.Context, batch *BlockBatch) error
Close() error
}

type BlockBatch struct {
FromBlock uint64
ToBlock uint64
BlockOverlay *MemoryDiff // headers, bodies, canonical hashes, TD
DomainDiffs [DomainLen][]DomainEntryDiff // accounts, storage, code, commitment
IndexUpdates map[InvertedIdx][]IndexEntry
StateRoot common.Hash
}

The MemoryDiff struct already exists (kv/membatchwithdb/memory_mutation_diff.go) as a serializable changeset. ExportDiffs() and ExportIndexUpdates() are added to TemporalMemBatch to export accumulated domain state.

The existing Flush(ctx, kv.RwTx) becomes:

func (sd *SharedDomains) Flush(ctx context.Context, tx kv.RwTx) error {
return sd.FlushToSink(ctx, &LocalSink{tx: tx})
}

Deployment Modes


LocalSink writes directly to the local MDBX transaction. This is the current behavior, wrapped in the StorageSink interface. Zero overhead — same writes, same path.

archive:
mode: embedded

RemoteSink sends BlockBatch to a separate archive service via gRPC. The executor can optionally keep its own local state (tee: true) or discard it (tee: false).

# Executor node:
archive:
mode: attached
addr: "archive.internal:9095"
tee: false

# Archive node:
archive:
mode: embedded
processors:
- otterscan-indexer
- snapshot-builder
rpc:
enabled: true

SnapshotSink writes directly to snapshot-format files. No executor, no live chain connection. The archive node reads pre-built snapshots from the BitTorrent network via Sparse Snapshots.

archive:
mode: detached
processors:
- otterscan-indexer
rpc:
enabled: true

A TeedSink composes multiple sinks — for example, local MDBX plus a remote archive simultaneously:

type TeedSink struct {
primary StorageSink // must succeed (local)
secondary StorageSink // best-effort (remote archive)
}

gRPC Protocol

The Archive service follows Erigon's existing gRPC direct/remote pattern (same as KV, Sentry, TxPool).

The direct adapter (ArchiveClientDirect) calls archive server methods in-process with zero serialization — used when running the archive embedded. The remote adapter uses a real gRPC connection — used when the archive runs as a separate process.

Key RPC methods:

service Archive {
rpc ApplyBlockBatch(BlockBatchRequest) returns (BlockBatchReply);
rpc StreamBlockBatches(StreamRequest) returns (stream BlockBatchRequest);
rpc GetSyncStatus(google.protobuf.Empty) returns (SyncStatusReply);
}

BatchProcessor Framework

After each BlockBatch is durably stored, the archive server runs pluggable post-processors in parallel. Processors cannot block or fail execution — they run after the commit:

type BatchProcessor interface {
Name() string
Process(ctx context.Context, batch *BlockBatch) error
}

Built-in processors:

ProcessorOutput
OtterScanIndexerAddress→tx mapping, internal transfer index
LogIndexerBloom filters for topic queries
TraceIndexerAddress→trace bitmap
SnapshotBuilder.seg files for BitTorrent seeding
MetricsCollectorPrometheus metrics (gas stats, tx type distribution)

This is the map-reduce pattern: the executor maps blocks to BlockBatch changesets; the archive reduces them through multiple processors in parallel. Adding a new indexer is registering a new BatchProcessor — no changes to the executor.

Adding New Storage Targets

Once StorageSink exists, adding a new target requires implementing one interface method and zero executor changes:

SinkPurposeEstimated effort
PostgresSinkRelational DB for SQL queries~200 LOC
KafkaSinkStream changesets to Kafka~150 LOC
S3SinkCloud storage for segment files~200 LOC
FilteredSinkStore only specific domains~50 LOC

Implementation Phases

PhaseScope
1Define StorageSink + BlockBatch; implement LocalSink; add FlushToSink() to SharedDomains alongside existing Flush() — zero behavior change
2Add ExportDiffs() and ExportIndexUpdates() to TemporalMemBatch; protobuf definitions; round-trip serialization test
3gRPC service + ArchiveClientDirect zero-serialization adapter; ArchiveServer with local DB backend
4Remote mode + TeedSink; CLI flags (--archive.mode, --archive.addr, --archive.tee)
5BatchProcessor framework; port OtterScanIndexer and SnapshotBuilder as processors
6Detached mode via SnapshotSink; BitTorrent snapshot discovery and download