Comparison of communication backends: Zenoh vs. torch.distributed

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Last updated: January 26, 2026

Author: Masahiro Aoki

Copyright: 2026 Moonlight Technologies Inc. All Rights Reserved.

This document details the technical comparison and decision-making background for architectural changes to the communication backend in EvoSpikeNet's distributed brain simulation.

Purpose and use of this document

Purpose: Comparison material to quickly share Zenoh migration background and design decisions.
Target audience: Distributed infrastructure personnel, robotics collaboration personnel, PMs.
First reading order: Comparison table in Chapter 2 → Conceptual diagram in Chapter 3 → Current status in Chapter 4.
Related links: Distributed implementation details are implementation/PFC_ZENOH_EXECUTIVE.md, execution script is examples/run_zenoh_distributed_brain.py.

1. Background: Why was it necessary to review the communication architecture?

In the early days of the project, Distributed Brain was built on top of PyTorch's standard distributed computing library, torch.distributed. It is the de facto standard for data-parallel and model-parallel learning of machine learning models, and is especially optimized for high-throughput tensor exchange between GPUs.

However, as the project progressed from simple simulation to implementation on physical robots and autonomous decision-making in real time, the following limitations of torch.distributed became a major challenge.

Synchronous/blocking communication: Operations such as send/recv and all_reduce are essentially synchronous and require all participating processes to be aligned. This caused delays in some nodes to lead to delays in the entire system.
Static process groups: world_size (number of participating processes) was fixed at startup, making it extremely difficult to dynamically add or leave nodes during simulation. It cannot handle situations where robot modules fail or new sensors are added.
Single Point of Failure: When one process crashes, the entire process group often hangs or crashes, making the entire system less resilient.
HPC-centric design: Optimized for training on GPU clusters and is not necessarily suitable for environments with edge devices with uneven resources (such as small computers mounted in various parts of a robot).

To overcome these challenges and achieve a more dynamic, robust, and scalable distributed system, the decision was made to migrate the communications backend to Zenoh.

2. Technical comparison

Item	`torch.distributed` (old architecture)	Zenoh (new architecture)	Reason for selection
Communication model	Synchronous type (blocking `send`/`recv`, synchronous barrier)	Asynchronous Pub/Sub (Publish/Subscribe)	In real-time systems, an asynchronous model in which each module can operate independently is overwhelmingly advantageous.
Process management	Static (`world_size` and `rank` are fixed at startup)	Dynamic (automatic detection of nodes, free join/leave)	Can dynamically respond to failures or additions of robot modules. Increased system flexibility and fault tolerance.
Topology	Tightly coupled (all nodes are aware of each other's connections)	Loosely coupled (each node only communicates with the Zenoh router)	Significantly reduces system complexity. The impact of adding or changing a node on other nodes can be minimized.
Fault Tolerance	Low (Failure in one node tends to spread to the entire system)	High (Failure in one node does not directly affect other nodes)	Prevents failure of one sensor or actuator from causing the entire brain to stop functioning.
Performance	Optimized for high-throughput tensor transfer between GPUs	Optimized for low-latency messaging	Because in robot control, the latency of individual messages is more important than throughput.
Scalability	Proven track record in HPC clusters with hundreds of nodes	Proven track record in IoT/Robotics with tens of thousands to millions of devices	Selection with an eye on future collaboration of large-scale robot groups and integration of numerous sensors/actuators.
Data format	Specialized in direct transmission and reception of `torch.Tensor`	Supports any serialization format (JSON, Pickle, Protobuf, etc.)	Easily exchanges structured data other than tensors, such as brain states and intentions.
Ecosystem	Limited to within the PyTorch ecosystem	High affinity with Robotics standards such as DDS and ROS2	Low barriers to collaboration with robot middleware such as ROS2 in the future.
Security ⭐ NEW	Basic security features only	PSK/DH key exchange, AES-256-GCM, Forward Secrecy (SECURE_DISTRIBUTED_BRAIN.md, DISTRIBUTED_BRAIN_SYSTEM.md#36)	Confidentiality and integrity are guaranteed by encrypted communication between distributed brain nodes (MT25-EV015 patent implementation), session-based key management, and TLS integration. Encryption overhead <5%.
Implementation Complexity	Rank management and synchronization processes tend to be complex	Pub/Sub model simplifies the implementation of each node	Developers of each module can focus on their own logic rather than communication details.

3. Architecture conceptual diagram

3.1. Old: torch.distributed architecture

graph TD
    subgraph "固定されたプロセスグループ world_size=4"
        PFC["Rank 0: PFC"]
        Lang["Rank 1: Language"]
        Vision["Rank 2: Vision"]
        Motor["Rank 3: Motor"]
    end

    PFC -- send/recv --> Lang
    PFC -- send/recv --> Vision
    Vision -- send/recv --> Motor
    Lang -- send/recv --> PFC

    linkStyle 0 stroke-width:2px,fill:none,stroke:red;
    linkStyle 1 stroke-width:2px,fill:none,stroke:red;
    linkStyle 2 stroke-width:2px,fill:none,stroke:red;
    linkStyle 3 stroke-width:2px,fill:none,stroke:red;

    note["Note: All nodes are tightly coupled, failure of one node affects all"]

3.2. New: Zenoh Architecture

graph TD
    subgraph "動的な分散システム"
        PFC["PFC Node"]
        Lang["Language Node"]
        Vision["Vision Node"]
        Motor["Motor Node"]
        NewSensor["New Sensor dynamic addition"]
    end

    Router{{"Zenoh Router"}}

    PFC -- "Publish: pfc/control" --> Router
    Router -- "Subscribe: pfc/control" --> Lang
    Router -- "Subscribe: pfc/control" --> Vision

    Lang -- "Publish: lang/features" --> Router
    Vision -- "Publish: vision/objects" --> Router
    Motor -- "Publish: motor/status" --> Router
    NewSensor -- "Publish: sensor/new_data" --> Router

    Router -- "Subscribe: vision/*, lang/*" --> PFC
    Router -- "Subscribe: vision/objects" --> Motor
    Router -- "Subscribe: sensor/new_data" --> PFC

    linkStyle 0 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 1 stroke-width:2px,fill:none,stroke:green;
    linkStyle 2 stroke-width:2px,fill:none,stroke:green;
    linkStyle 3 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 4 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 5 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 6 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 7 stroke-width:2px,fill:none,stroke:green;
    linkStyle 8 stroke-width:2px,fill:none,stroke:green;
    linkStyle 9 stroke-width:2px,fill:none,stroke:green;

    note["Note: All nodes are loosely coupled via routers, making it easy to add and remove nodes"]

4. Current situation and future outlook

4.1. Implementation Status

Main communication backend has been fully migrated to Zenoh.
run_zenoh_distributed_brain.py is the official execution script for the current distributed brain simulation.
docker-compose.yml integrates the zenoh-router service, allowing you to easily build a Zenoh network in a container environment.

4.2. Handling torch.distributed

The legacy run_distributed_brain_simulation.py is kept in the repository for backward compatibility and specific research purposes (e.g. performance evaluation of highly efficient tensor parallelism between tightly coupled modules). For implementation details of PFC/Zenoh/ExecutiveControl, refer to implementation/PFC_ZENOH_EXECUTIVE.md.
However, all functions related to new feature development and implementation on physical robots will be done on the Zenoh-based architecture.
In the future, the torch.distributed version may be deprecated and archived.

5. Conclusion

The transition from torch.distributed to Zenoh is a strategically critical architectural change as the EvoSpikeNet project moves from the research phase to the production/production phase. This change significantly increases the system's robustness, flexibility, and scalability, and establishes the technological foundation for a "true distributed brain" that operates autonomously on a physical robotic platform.