Skip to main content

Nodes in Distributed Systems

This chapter defines the concept of a node from first principles, surveys the role-differentiation strategies adopted by major distributed systems, and explains why Zentalk consolidates all infrastructure functions into a single unified Full Node type.

What Is a Node?

Definition

In the study of distributed systems, the term node refers to any independently operating participant that can send, receive, process, and store data within a network. The concept is deliberately abstract: a node may be a physical machine in a data center, a virtual server running in a cloud environment, a process executing on a personal computer, or a mobile device connected over a cellular radio link. What distinguishes a node from a passive network element (such as a switch or a cable) is agency -- a node executes application logic, maintains local state, and makes autonomous decisions about how to handle the data it encounters. A router that merely forwards IP packets according to a static table is infrastructure; a Bitcoin full node that independently validates every transaction against a consensus ruleset, decides whether to propagate it, and stores it in a local copy of the blockchain is a node in the distributed systems sense.

Formally, a distributed system can be modeled as a graph G = (V, E) where V is the set of nodes and E is the set of communication links between them. Each node v in V possesses four fundamental capabilities:

Communication

The node can send messages to, and receive messages from, other nodes in V to which it is connected by edges in E. Communication may be synchronous (bounded delivery time) or asynchronous (no delivery time guarantee). Real-world networks, including the internet, are asynchronous -- a property with profound implications for consensus, consistency, and fault tolerance, as established by the FLP impossibility result [Fischer, Lynch, and Paterson 1985] and the CAP theorem discussed in Section 1.6.2.

Computation

The node can execute algorithms on the data it receives. This distinguishes a node from a mere conduit. A relay that decrypts a relay encryption layer, inspects a routing header, and forwards the payload to the next hop is performing computation. A storage node that encodes incoming data with Reed-Solomon erasure coding is performing computation. The computational capability of a node determines the complexity of the protocols it can participate in.

Storage

The node can persist data locally across time. This capability ranges from ephemeral in-memory buffering (sufficient for a relay that queues messages for milliseconds) to durable disk-backed storage (required for a mesh node that holds erasure-coded shards for weeks or months). The durability guarantees a node provides -- whether data survives a power failure, a process restart, or a hardware replacement -- are a critical dimension of its role in the system.

Autonomy

The node operates under its own authority. It may follow a protocol, but no external entity can compel it to do so at the hardware level. This autonomy is the source of both the resilience and the complexity of distributed systems: resilience, because the failure or misbehavior of any individual node does not require the consent or participation of other nodes to detect and route around; complexity, because the system must function correctly even when some nodes deviate from the expected behavior, whether through failure, misconfiguration, or deliberate malice (the Byzantine fault model of Lamport, Shostak, and Pease [1982], discussed in Section 1.6.3).

The power of a distributed system emerges not from any individual node but from the interactions among nodes. A single Bitcoin node is a database; ten thousand Bitcoin nodes running the same consensus protocol constitute a globally replicated, censorship-resistant ledger. A single Zentalk Full Node is a message relay with a storage backend; hundreds of Full Nodes participating in a Kademlia DHT constitute a self-healing, fault-tolerant mesh network with no single point of failure. The transition from individual capability to collective property is the central phenomenon of distributed systems, and the design of node roles is the mechanism by which system architects shape that transition.

Node Design

The choice of how many node types a distributed system defines, what responsibilities each type carries, and how the types interact determines the system's resilience, performance, operational complexity, and incentive structure. A system with a single node type that performs all functions is simple to reason about but may waste resources (every participant must provision for the most demanding function). A system with many specialized node types can optimize resource allocation but introduces coordination overhead, creates differential trust requirements, and complicates incentive design (how should a relay be compensated relative to a storage node?). Every deployed distributed system of consequence has made a deliberate architectural choice along this spectrum, and the following section surveys the most instructive examples.

Node Roles in Established Distributed Systems

Bitcoin

The Bitcoin network, introduced by Nakamoto [2008], defines three principal node types that illustrate the spectrum from full participation to minimal verification.

Full Nodes

Full nodes download, validate, and store the entire blockchain -- every block header, every transaction, and every unspent transaction output (UTXO) from the genesis block in January 2009 to the present. A full node independently verifies every consensus rule: proof-of-work difficulty, transaction signature validity, input-output balance, and script execution. It trusts no other node; its security guarantee is derived entirely from local verification. The cost is substantial: as of 2025, the Bitcoin blockchain exceeds 600 gigabytes, and initial synchronization from the genesis block requires hours to days of CPU-intensive validation. The benefit is equally substantial: a full node operator has a cryptographically verifiable, independently confirmed copy of the entire ledger, immune to lies or omissions by any other participant.

Light Nodes

Light nodes (also called SPV clients, after the Simplified Payment Verification described in Section 8 of the Bitcoin whitepaper) store only block headers -- an 80-byte summary per block, totaling approximately 70 megabytes for the entire chain history. A light node verifies that a transaction is included in a block by requesting a Merkle proof from a full node: the proof demonstrates that the transaction's hash is a leaf of the Merkle tree whose root is committed in the block header, without requiring the light node to possess the full block. Light nodes sacrifice trust-minimization for efficiency: they trust that the block header chain with the most cumulative proof-of-work represents the valid chain, but they do not independently verify every transaction in every block. An adversary who controls the light node's network connections could, in principle, present a fraudulent chain (though the cost of generating sufficient proof-of-work makes this impractical for Bitcoin's current difficulty).

Mining Nodes

Mining nodes perform the additional function of creating new blocks. A mining node assembles a candidate block from pending transactions in its mempool, computes the SHA-256 hash of the block header, and iterates through nonce values until the hash falls below the current difficulty target. Mining nodes must also be full nodes (or trust a full node for transaction validation), because including an invalid transaction in a block causes the block to be rejected by the network, wasting the energy expended on proof-of-work. Mining is the economic engine of Bitcoin: miners earn block subsidies and transaction fees, creating the incentive structure that sustains the network's infrastructure.

The Bitcoin node taxonomy illustrates a recurring pattern in distributed systems: full participation provides the strongest guarantees but imposes the highest costs, and lighter participation modes trade trust assumptions for accessibility. This trade-off reappears in every system surveyed below.

Tor

The Tor anonymity network [Dingledine, Mathewson, and Syverson 2004] employs a fundamentally different node taxonomy, organized not by data completeness but by position in a communication circuit and the trust properties that follow from that position.

A Tor circuit typically consists of three relays selected by the client from the Tor directory:

Entry Nodes

Entry nodes (also called guard nodes) are the first hop in a Tor circuit. The entry node knows the client's true IP address (because the client connects to it directly) but does not know the ultimate destination of the traffic. Guard nodes are selected from a subset of long-lived, high-bandwidth relays with established reputations, because the entry position is the most sensitive: an adversary who controls the entry node learns the client's identity (IP address), even though it cannot determine the destination or read the content. By restricting entry node selection to a small, persistent set, Tor limits the probability that a client will select an adversary-controlled entry over the course of many circuits.

Relay Nodes

Relay nodes (also called middle nodes) occupy intermediate positions in the circuit. A middle relay knows only the identity of the immediately preceding and following nodes in the circuit; it cannot determine either the original sender or the final destination. The middle relay decrypts one layer of relay encryption, revealing the address of the next hop, and forwards the payload. It sees neither the client's IP address (hidden behind the entry node) nor the destination (hidden behind the exit node). Middle relays have the weakest trust requirement of any position: even a compromised middle relay learns essentially nothing about the communication.

Exit Nodes

Exit nodes are the final hop. The exit node decrypts the last encryption layer and forwards the traffic to the destination server on the open internet. The exit node knows the destination address and, if the underlying connection is not separately encrypted (e.g., HTTP rather than HTTPS), can observe the traffic content. Exit nodes do not know the client's IP address (hidden behind the entry and middle relays). Exit node operation carries legal and reputational risk, because the exit node's IP address appears as the source of the traffic to the destination, leading to abuse complaints and, in some jurisdictions, legal liability for the traffic it forwards.

The Tor taxonomy demonstrates that node roles can be defined by position in a communication path rather than by data storage responsibility, and that different positions carry fundamentally different trust and risk profiles. This insight directly informs Zentalk's multi-hop relay routing design (Chapter 6, Section 6.3), where relay nodes at different circuit positions observe different subsets of metadata.

BitTorrent

BitTorrent, the peer-to-peer file distribution protocol, defines node roles based on the completeness and direction of data transfer.

Seeders

Seeders are nodes that possess a complete copy of the file being distributed. A seeder uploads data to other participants but does not need to download anything (it already has the full file). The health of a BitTorrent swarm depends critically on the number and bandwidth of seeders: a file with no seeders and no peer holding all pieces becomes permanently unavailable, even if the combined pieces across all participants form a complete copy (a situation that requires careful piece selection algorithms to avoid).

Leechers

Leechers are nodes that are in the process of downloading a file. A leecher downloads pieces from seeders and other leechers, and simultaneously uploads pieces it has already received to other leechers. The BitTorrent protocol's "tit-for-tat" incentive mechanism encourages leechers to upload: peers that upload more receive faster downloads from their neighbors, while peers that only download without contributing ("free-riders") are deprioritized. A leecher that completes its download and remains in the swarm transitions to a seeder.

Trackers

Trackers (in the original BitTorrent design) are centralized servers that maintain a directory of which peers are participating in each swarm. A client contacts the tracker to obtain a list of peers sharing a particular file, then connects directly to those peers for data transfer. The tracker is a single point of failure and a point of censorship: shutting down the tracker prevents new peers from discovering the swarm, even though the data itself is fully distributed. The Mainline DHT (based on Kademlia [Maymounkov and Mazieres 2002]) was introduced to eliminate this centralization: peers discover each other through the distributed hash table, removing the tracker dependency entirely.

The BitTorrent taxonomy illustrates two principles relevant to Zentalk's design: first, the distinction between nodes that hold complete data and nodes that hold partial data is fundamental to availability and fault tolerance; and second, centralized coordination elements (trackers) represent architectural vulnerabilities that can be eliminated through DHT-based peer discovery -- a lesson Zentalk applies directly in its Kademlia-based mesh.

Zentalk's Unified Full Node Model

Architecture

Unified Full Node Architecture
Every Full Node combines Relay, Mesh Storage, and Kademlia DHT modules in a single process. Dashed lines indicate cross-module coordination. All modules share a common database backend and are secured by the same 5,000 CHAIN stake.

Zentalk departs from the specialization strategies described above by defining a single infrastructure node type: the Full Node. Every Full Node in the Zentalk network combines three functional modules:

Relay Module

Handles real-time message delivery, offline message queuing with durability guarantees, multi-hop relay routing for sender anonymity (1-5 hops with per-layer encryption), group and channel message fan-out, and end-to-end encryption key bundle distribution. The Relay Module is the real-time backbone of the network: it ensures that messages reach their recipients with sub-second latency when both parties are online, and that messages are durably queued when the recipient is temporarily disconnected. The detailed architecture of relay routing is presented in Chapter 6.

Mesh Storage

Provides fault-tolerant, encrypted data persistence. Incoming data is encrypted client-side with AES-256-GCM (the mesh node never receives plaintext), encoded with Reed-Solomon (10,5) erasure coding into 15 shards, and distributed across the Kademlia DHT. The Mesh Storage Module also performs anti-entropy repair (detecting and regenerating lost shards automatically), capacity management (monitoring local storage utilization and announcing capacity state to the network), and shard rebalancing (migrating data away from nodes approaching storage limits). The detailed architecture of mesh storage is presented in Chapter 5.

DHT Participation

Every Full Node maintains a Kademlia routing table organized into k-buckets spanning the 256-bit address space. The node participates in DHT key lookups (responding to queries from other nodes seeking the closest nodes to a given key), stores DHT records (capacity announcements, peer routing information), and performs periodic routing table maintenance. DHT participation is the connective tissue of the network: it enables nodes to discover each other, locate stored data, and route messages without any centralized directory.

These three modules execute within a single process and are deployed as a single binary. A Full Node operator provisions a single machine with sufficient compute, memory, and storage resources, and stakes 5,000 CHAIN tokens; the software handles relay, storage, and DHT functions automatically.

Unified Architecture

The decision to consolidate all infrastructure functions into a single node type -- rather than defining separate relay nodes, storage nodes, and DHT nodes as distinct roles -- was a deliberate architectural choice motivated by four considerations.

Operational Simplicity

A network with multiple specialized node types requires operators to choose which type to run, provision hardware appropriate to each type's demands, and manage potentially different software packages with different update cycles. This complexity discourages participation, particularly from smaller operators who lack dedicated infrastructure teams. A single node type reduces the barrier to entry: every operator runs the same software, follows the same deployment guide, and meets the same hardware requirements. The Zentalk network benefits from a larger, more diverse validator set as a result.

Lower Latency

When the relay module and the mesh storage module coexist on the same machine, several operations that would otherwise require network round-trips become local procedure calls. When a relay receives a message for an offline recipient, it can store the message in the local mesh storage module without a network hop to a separate storage node. When a user comes online and requests their queued messages, the relay can retrieve them from local storage without querying a remote mesh node. This co-location reduces offline message storage and retrieval latency from the hundreds of milliseconds typical of a network round-trip to the single-digit milliseconds of a local operation.

Fair Rewards

In a system with specialized node types, the economic incentive design must solve the relative pricing problem: how much should a relay earn per message forwarded relative to what a storage node earns per shard stored? If relaying is more profitable than storage, operators will run relay nodes and the network will have insufficient storage capacity. If storage is more profitable than relaying, the inverse occurs. This balancing problem is inherently difficult because the relative demand for relay versus storage changes with usage patterns, time of day, and geographic distribution. A unified node type eliminates this problem entirely: every node performs both functions, earns rewards proportional to total work performed (messages relayed plus data stored plus uptime), and the protocol need not determine the relative economic value of different service types. The incentive structure analyzed in Chapter 14 is correspondingly simpler and more robust.

Uniform Security

When all nodes perform all functions, the security analysis need not consider differential trust between node types. Every node sees the same categories of data (encrypted message payloads, encrypted storage shards, DHT routing records), operates under the same staking and slashing rules, and presents the same attack surface. There is no "weaker" node type that an adversary might preferentially target. The threat model is homogeneous, which simplifies both formal analysis and operational monitoring.

What Each Module Does: A Conceptual Summary

It is useful to describe the three modules at a conceptual level, independent of implementation details (which are covered in Chapters 5 and 6), to convey how they collaborate within a single node.

The Relay Module is the real-time nervous system of the network. When a user sends a message, the encrypted payload arrives at the user's connected relay. The relay checks whether the intended recipient is connected to the same relay (in which case delivery is immediate) or to a different relay (in which case the message is forwarded through the federation protocol). If the recipient is offline, the relay writes the message to a durable offline queue, ensuring that the message survives relay restarts and will be delivered when the recipient reconnects. For users who have enabled enhanced privacy, the relay participates in multi-hop relay circuits, decrypting a single encryption layer to reveal the next hop and forwarding the message without knowledge of the ultimate sender or recipient (depending on the relay's position in the circuit).

The Mesh Storage Module is the long-term memory of the network. It accepts encrypted data shards produced by the Reed-Solomon erasure coding performed client-side, stores them on local disk, and serves them on request. It periodically verifies the health of data it is responsible for -- querying other nodes to confirm that sibling shards (the other 14 shards of each 15-shard set) remain available -- and initiates repair when health degrades. It monitors its own storage capacity, publishes capacity announcements to the DHT so that other nodes can make informed placement decisions, and migrates shards to less-loaded nodes when its own storage approaches capacity limits.

The DHT layer is the addressing and discovery substrate. Without a centralized directory, nodes must be able to find each other and locate stored data. The Kademlia DHT provides this: each node maintains a routing table of peers organized by XOR distance, and a key lookup proceeds by iteratively querying the closest known nodes until the target key is found, converging in O(logn)O(\log n) hops for a network of nn nodes. The DHT also serves as the publication medium for ephemeral network metadata: which nodes are online, what their current capacity state is, which relay a given user is currently connected to (enabling cross-relay message delivery).

The Relationship Between Nodes and the Network

Node Discovery: DHT Bootstrap

A distributed network with no centralized directory faces a bootstrapping problem: a new node, upon starting for the first time, knows no other nodes. It cannot participate in the DHT because it has no routing table entries; it cannot receive messages because no one knows its address; it cannot store data because no one knows it exists. The bootstrap process resolves this circular dependency.

Zentalk maintains a set of well-known bootstrap nodes -- lightweight infrastructure components that participate in the Kademlia DHT but do not perform relay or storage functions. Their sole purpose is to serve as initial contact points. A new Full Node is configured with the network addresses of one or more bootstrap nodes. Upon starting, it connects to a bootstrap node and executes the Kademlia FIND_NODE operation using its own node identifier as the target key. This causes the bootstrap node to return the closest nodes in its routing table to the new node's identifier. The new node then contacts those nodes, which return their own closest-known nodes, and the process cascades. Within seconds, the new node has populated its routing table with a representative sample of the network, spanning all regions of the 256-bit address space.

The bootstrap nodes are not a single point of failure. Once a node has populated its routing table, it no longer needs the bootstrap nodes; it discovers new peers through the normal DHT refresh process (random key lookups every 10 minutes). Even if all bootstrap nodes were to fail simultaneously, the existing mesh would continue to operate -- only the onboarding of entirely new nodes would be affected, and only until an alternative bootstrap mechanism (such as manual peer exchange or a DNS seed list) was provided.

Automatic Mesh Formation

Once a Full Node has discovered a sufficient set of peers through the DHT bootstrap process, it automatically integrates into the mesh network. This integration proceeds in three phases.

Peer Connection

The node establishes persistent connections to a subset of its DHT-discovered peers, preferring those with low latency, high uptime, and geographic diversity. The connection layer supports multiple transport protocols with automatic NAT traversal, ensuring connectivity even for nodes behind firewalls or network address translators.

Capacity Announcement

The node computes its current storage utilization and publishes its capacity state to the DHT. Other nodes in the network cache this announcement and use it when selecting targets for shard placement. This ensures that the new node begins receiving storage shards within minutes of joining, contributing immediately to the network's aggregate storage capacity and earning rewards from the moment it starts serving data.

Relay Registration

The node announces itself as an available relay through the DHT. Clients seeking a relay (either for initial connection or as a fallback when their current relay becomes unreachable) discover the new node through the same DHT lookup mechanism used for all peer discovery. The geographic routing system (Chapter 6, Section 6.4) factors the new node's region and latency into relay selection scores, directing nearby clients toward it.

The mesh formation process is entirely automatic. No human coordination is required to assign the node a role, allocate it a portion of the address space, or direct traffic toward it. The Kademlia DHT's XOR distance metric deterministically assigns responsibility for key ranges based on node identifiers, and the network rebalances organically as nodes join and depart.

Node Join and Leave Behavior

A distributed network in production is never static. Nodes join when new operators deploy infrastructure. Nodes leave when operators shut down, when hardware fails, when network connectivity is lost, or when staked tokens are withdrawn. The network must handle these transitions gracefully, without data loss, service interruption, or human intervention. This property -- self-healing -- is the operational manifestation of the fault tolerance analyzed mathematically in Section 5A.4.

Node Departure

The health monitoring system detects the departure through heartbeat failure and marks the node as unhealthy. The DHT routing table is updated, removing the departed node from all k-buckets and replacing it with the next most recently seen peer in the same distance range. For storage, the anti-entropy service detects the lost shards during its next check cycle and initiates Reed-Solomon repair: the minimum required sibling shards are retrieved from healthy nodes, all 15 shards are reconstructed, and the missing shards are placed on newly selected healthy nodes. For relay, clients connected to the departed node detect the disconnection, query the DHT for an alternative relay, and reconnect -- typically within seconds. Messages queued in the departed node's offline queue are protected by durable storage guarantees; under normal shutdown or process crash, the queue is recoverable.

Node Arrival

The DHT routing tables of existing nodes are updated automatically as the new node participates in DHT operations (lookups, stores, pings). The Kademlia protocol's preference for recently-seen peers ensures that active new nodes are rapidly incorporated into routing tables. The new node begins receiving storage shards as other nodes' shard placement algorithms include it among the closest peers for new storage keys. Over time, the shard distribution rebalances as new data is placed on the new node and old data on other nodes expires through TTL. No explicit rebalancing migration is required for new nodes; the natural churn of data creation and expiration, combined with Kademlia's distance-based placement, achieves an approximately uniform distribution.

Self-Healing

The Reed-Solomon (10,5) erasure coding tolerates the simultaneous loss of up to 5 of the 15 nodes holding shards of any given data object. The anti-entropy repair cycle restores full redundancy promptly after detecting shard loss. For data to become irrecoverably lost, more than 5 of the 15 shard-holding nodes must fail simultaneously and remain failed for the duration of the repair cycle -- a scenario whose probability is quantified in Section 5A.4.3, yielding durability exceeding 99.999% under realistic failure assumptions for economically staked operators.

Network Growth

Unlike centralized systems where adding servers provides diminishing marginal returns (because the central coordinator becomes the bottleneck), a well-designed distributed network improves along multiple dimensions as new nodes join.

Security

Each additional node increases the cost of a Sybil attack. An adversary seeking to control a fraction f of the network must stake f * N * 5,000 CHAIN tokens, where N is the total number of nodes. Doubling the network size doubles the capital required to maintain the same proportional influence. Moreover, a larger node set increases the diversity of operators, jurisdictions, hardware configurations, and network providers, making coordinated compromise more difficult.

Availability

The Reed-Solomon erasure coding distributes each data object across 15 nodes. In a network of 100 nodes, a given data object's shards span 15% of the network; in a network of 1,000 nodes, they span 1.5%. A localized failure (a data center outage, a regional network partition) affects a smaller fraction of any given object's shards in a larger network, improving the probability that at least 10 of 15 shards remain available for reconstruction.

Geographic Distribution

A larger node count enables finer-grained geographic coverage. Users in a given region experience lower latency when a relay node is nearby. With 50 nodes, coverage may be limited to major population centers on two or three continents. With 500 nodes, coverage extends to smaller regions, providing sub-100ms relay latency to a larger fraction of the global user population. The geographic routing system (Chapter 6, Section 6.4) exploits this distribution by directing users to the lowest-latency relay in their region.

Throughput

Because Zentalk's architecture does not require global consensus (unlike Bitcoin, where every node must process every transaction), message throughput scales approximately linearly with the number of nodes. Each node handles the relay and storage traffic for a subset of users. Adding a node adds relay capacity (more concurrent connections) and storage capacity (more disk space for shards). The DHT lookup cost grows only logarithmically -- O(logn)O(\log n) hops for nn nodes -- ensuring that routing overhead does not negate the throughput benefit of additional nodes.

Censorship Resistance

A larger, more geographically distributed node set is harder for any single authority to suppress. Blocking Zentalk traffic requires identifying and blocking connections to every Full Node, a task that scales linearly with the number of nodes and geometrically with the number of jurisdictions they span. Economic incentives ensure that nodes in permissive jurisdictions continue to operate, providing connectivity for users in restrictive jurisdictions through the mesh's multi-hop routing.