Erigon Archive

note

Status: Design phase. Depends on: SharedDomains flush path, existing gRPC direct adapter pattern (node/direct/).

Erigon Archive decouples block execution from persistent storage. Today, execution and storage are tightly coupled — SharedDomains.Flush() writes directly to a local MDBX transaction and there is no clean way to attach external indexers, snapshot builders, or remote storage backends without modifying the execution path.

The design introduces a StorageSink interface at the SharedDomains.Flush() boundary, where accumulated execution state transitions from in-memory overlay to persistent storage. The executor produces BlockBatch changesets; the sink consumes them. This single interface change enables three deployment modes and an open-ended BatchProcessor framework for post-processing.

Key Capabilities

Three deployment modes: Embedded (current behavior, wrapped in the interface), Attached (executor streams changesets to a separate archive service via gRPC), and Detached (archive reads pre-built snapshots from the BitTorrent network via Sparse Snapshots, no live chain connection required).

BatchProcessor framework: After each BlockBatch is durably stored, pluggable post-processors run in parallel — receipt indexers, snapshot builders, metrics collectors. Adding a new indexer is registering a new BatchProcessor; the executor is never touched.

Composable sinks: A TeedSink composes multiple sinks — for example, writing to local MDBX and a remote archive simultaneously, with the local write as primary and the remote as best-effort.

Design: StorageSink Interface

The split point is SharedDomains.Flush(). The existing method becomes a thin wrapper:

// db/kv/storage_sink.go

type StorageSink interface {
    ApplyBlockBatch(ctx context.Context, batch *BlockBatch) error
    Close() error
}

type BlockBatch struct {
    FromBlock    uint64
    ToBlock      uint64
    BlockOverlay *MemoryDiff                      // headers, bodies, canonical hashes, TD
    DomainDiffs  [DomainLen][]DomainEntryDiff     // accounts, storage, code, commitment
    IndexUpdates map[InvertedIdx][]IndexEntry
    StateRoot    common.Hash
}

The MemoryDiff struct already exists (kv/membatchwithdb/memory_mutation_diff.go) as a serializable changeset. ExportDiffs() and ExportIndexUpdates() are added to TemporalMemBatch to export accumulated domain state.

The existing Flush(ctx, kv.RwTx) becomes:

func (sd *SharedDomains) Flush(ctx context.Context, tx kv.RwTx) error {
    return sd.FlushToSink(ctx, &LocalSink{tx: tx})
}

Deployment Modes

LocalSink writes directly to the local MDBX transaction. This is the current behavior, wrapped in the StorageSink interface. Zero overhead — same writes, same path.

archive:
  mode: embedded

RemoteSink sends BlockBatch to a separate archive service via gRPC. The executor can optionally keep its own local state (tee: true) or discard it (tee: false).

# Executor node:
archive:
  mode: attached
  addr: "archive.internal:9095"
  tee: false

# Archive node:
archive:
  mode: embedded
  processors:
    - otterscan-indexer
    - snapshot-builder
  rpc:
    enabled: true

SnapshotSink writes directly to snapshot-format files. No executor, no live chain connection. The archive node reads pre-built snapshots from the BitTorrent network via Sparse Snapshots.

archive:
  mode: detached
  processors:
    - otterscan-indexer
  rpc:
    enabled: true

A TeedSink composes multiple sinks — for example, local MDBX plus a remote archive simultaneously:

type TeedSink struct {
    primary   StorageSink  // must succeed (local)
    secondary StorageSink  // best-effort (remote archive)
}

gRPC Protocol

The Archive service follows Erigon's existing gRPC direct/remote pattern (same as KV, Sentry, TxPool).

The direct adapter (ArchiveClientDirect) calls archive server methods in-process with zero serialization — used when running the archive embedded. The remote adapter uses a real gRPC connection — used when the archive runs as a separate process.

Key RPC methods:

service Archive {
    rpc ApplyBlockBatch(BlockBatchRequest) returns (BlockBatchReply);
    rpc StreamBlockBatches(StreamRequest) returns (stream BlockBatchRequest);
    rpc GetSyncStatus(google.protobuf.Empty) returns (SyncStatusReply);
}

BatchProcessor Framework

After each BlockBatch is durably stored, the archive server runs pluggable post-processors in parallel. Processors cannot block or fail execution — they run after the commit:

type BatchProcessor interface {
    Name() string
    Process(ctx context.Context, batch *BlockBatch) error
}

Built-in processors:

Processor	Output
`OtterScanIndexer`	Address→tx mapping, internal transfer index
`LogIndexer`	Bloom filters for topic queries
`TraceIndexer`	Address→trace bitmap
`SnapshotBuilder`	`.seg` files for BitTorrent seeding
`MetricsCollector`	Prometheus metrics (gas stats, tx type distribution)

This is the map-reduce pattern: the executor maps blocks to BlockBatch changesets; the archive reduces them through multiple processors in parallel. Adding a new indexer is registering a new BatchProcessor — no changes to the executor.

Adding New Storage Targets

Once StorageSink exists, adding a new target requires implementing one interface method and zero executor changes:

Sink	Purpose	Estimated effort
`PostgresSink`	Relational DB for SQL queries	~200 LOC
`KafkaSink`	Stream changesets to Kafka	~150 LOC
`S3Sink`	Cloud storage for segment files	~200 LOC
`FilteredSink`	Store only specific domains	~50 LOC

Implementation Phases

Phase	Scope
1	Define `StorageSink` + `BlockBatch`; implement `LocalSink`; add `FlushToSink()` to SharedDomains alongside existing `Flush()` — zero behavior change
2	Add `ExportDiffs()` and `ExportIndexUpdates()` to `TemporalMemBatch`; protobuf definitions; round-trip serialization test
3	gRPC service + `ArchiveClientDirect` zero-serialization adapter; `ArchiveServer` with local DB backend
4	Remote mode + `TeedSink`; CLI flags (`--archive.mode`, `--archive.addr`, `--archive.tee`)
5	`BatchProcessor` framework; port `OtterScanIndexer` and `SnapshotBuilder` as processors
6	Detached mode via `SnapshotSink`; BitTorrent snapshot discovery and download

Key Capabilities​

Design: StorageSink Interface​

Deployment Modes​

gRPC Protocol​

BatchProcessor Framework​

Adding New Storage Targets​

Implementation Phases​