Erigon Archive

Decouple block execution from persistent storage via a StorageSink interface

circle-info

Status: Design phase. Depends on: SharedDomains flush path, existing gRPC direct adapter pattern (node/direct/).

Erigon Archive decouples block execution from persistent storage. Today, execution and storage are tightly coupled — SharedDomains.Flush() writes directly to a local MDBX transaction and there is no clean way to attach external indexers, snapshot builders, or remote storage backends without modifying the execution path.

The design introduces a StorageSink interface at the SharedDomains.Flush() boundary, where accumulated execution state transitions from in-memory overlay to persistent storage. The executor produces BlockBatch changesets; the sink consumes them. This single interface change enables three deployment modes and an open-ended BatchProcessor framework for post-processing.


Key Capabilities

Three deployment modes: Embedded (current behavior, wrapped in the interface), Attached (executor streams changesets to a separate archive service via gRPC), and Detached (archive reads pre-built snapshots from the BitTorrent network via Sparse Snapshots, no live chain connection required).

BatchProcessor framework: After each BlockBatch is durably stored, pluggable post-processors run in parallel — receipt indexers, snapshot builders, metrics collectors. Adding a new indexer is registering a new BatchProcessor; the executor is never touched.

Composable sinks: A TeedSink composes multiple sinks — for example, writing to local MDBX and a remote archive simultaneously, with the local write as primary and the remote as best-effort.


Design: StorageSink Interface

The split point is SharedDomains.Flush(). The existing method becomes a thin wrapper:

// db/kv/storage_sink.go

type StorageSink interface {
    ApplyBlockBatch(ctx context.Context, batch *BlockBatch) error
    Close() error
}

type BlockBatch struct {
    FromBlock    uint64
    ToBlock      uint64
    BlockOverlay *MemoryDiff                      // headers, bodies, canonical hashes, TD
    DomainDiffs  [DomainLen][]DomainEntryDiff     // accounts, storage, code, commitment
    IndexUpdates map[InvertedIdx][]IndexEntry
    StateRoot    common.Hash
}

The MemoryDiff struct already exists (kv/membatchwithdb/memory_mutation_diff.go) as a serializable changeset. ExportDiffs() and ExportIndexUpdates() are added to TemporalMemBatch to export accumulated domain state.

The existing Flush(ctx, kv.RwTx) becomes:

Deployment Modes

LocalSink writes directly to the local MDBX transaction. This is the current behavior, wrapped in the StorageSink interface. Zero overhead — same writes, same path.

A TeedSink composes multiple sinks — for example, local MDBX plus a remote archive simultaneously:


gRPC Protocol

The Archive service follows Erigon's existing gRPC direct/remote pattern (same as KV, Sentry, TxPool).

The direct adapter (ArchiveClientDirect) calls archive server methods in-process with zero serialization — used when running the archive embedded. The remote adapter uses a real gRPC connection — used when the archive runs as a separate process.

Key RPC methods:

BatchProcessor Framework

After each BlockBatch is durably stored, the archive server runs pluggable post-processors in parallel. Processors cannot block or fail execution — they run after the commit:

Built-in processors:

Processor
Output

OtterScanIndexer

Address→tx mapping, internal transfer index

LogIndexer

Bloom filters for topic queries

TraceIndexer

Address→trace bitmap

SnapshotBuilder

.seg files for BitTorrent seeding

MetricsCollector

Prometheus metrics (gas stats, tx type distribution)

This is the map-reduce pattern: the executor maps blocks to BlockBatch changesets; the archive reduces them through multiple processors in parallel. Adding a new indexer is registering a new BatchProcessor — no changes to the executor.

Adding New Storage Targets

Once StorageSink exists, adding a new target requires implementing one interface method and zero executor changes:

Sink
Purpose
Estimated effort

PostgresSink

Relational DB for SQL queries

~200 LOC

KafkaSink

Stream changesets to Kafka

~150 LOC

S3Sink

Cloud storage for segment files

~200 LOC

FilteredSink

Store only specific domains

~50 LOC

Implementation Phases

Phase
Scope

1

Define StorageSink + BlockBatch; implement LocalSink; add FlushToSink() to SharedDomains alongside existing Flush() — zero behavior change

2

Add ExportDiffs() and ExportIndexUpdates() to TemporalMemBatch; protobuf definitions; round-trip serialization test

3

gRPC service + ArchiveClientDirect zero-serialization adapter; ArchiveServer with local DB backend

4

Remote mode + TeedSink; CLI flags (--archive.mode, --archive.addr, --archive.tee)

5

BatchProcessor framework; port OtterScanIndexer and SnapshotBuilder as processors

6

Detached mode via SnapshotSink; BitTorrent snapshot discovery and download

Last updated