Comparison of communication backends: Zenoh vs. torch.distributed
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Last updated: January 26, 2026
Author: Masahiro Aoki
Copyright: 2026 Moonlight Technologies Inc. All Rights Reserved.
This document details the technical comparison and decision-making background for architectural changes to the communication backend in EvoSpikeNet's distributed brain simulation.
Purpose and use of this document
- Purpose: Comparison material to quickly share Zenoh migration background and design decisions.
- Target audience: Distributed infrastructure personnel, robotics collaboration personnel, PMs.
- First reading order: Comparison table in Chapter 2 → Conceptual diagram in Chapter 3 → Current status in Chapter 4.
- Related links: Distributed implementation details are implementation/PFC_ZENOH_EXECUTIVE.md, execution script is
examples/run_zenoh_distributed_brain.py.
1. Background: Why was it necessary to review the communication architecture?
In the early days of the project, Distributed Brain was built on top of PyTorch's standard distributed computing library, torch.distributed. It is the de facto standard for data-parallel and model-parallel learning of machine learning models, and is especially optimized for high-throughput tensor exchange between GPUs.
However, as the project progressed from simple simulation to implementation on physical robots and autonomous decision-making in real time, the following limitations of torch.distributed became a major challenge.
- Synchronous/blocking communication: Operations such as
send/recvandall_reduceare essentially synchronous and require all participating processes to be aligned. This caused delays in some nodes to lead to delays in the entire system. - Static process groups:
world_size(number of participating processes) was fixed at startup, making it extremely difficult to dynamically add or leave nodes during simulation. It cannot handle situations where robot modules fail or new sensors are added. - Single Point of Failure: When one process crashes, the entire process group often hangs or crashes, making the entire system less resilient.
- HPC-centric design: Optimized for training on GPU clusters and is not necessarily suitable for environments with edge devices with uneven resources (such as small computers mounted in various parts of a robot).
To overcome these challenges and achieve a more dynamic, robust, and scalable distributed system, the decision was made to migrate the communications backend to Zenoh.
2. Technical comparison
| Item | torch.distributed (old architecture) |
Zenoh (new architecture) | Reason for selection |
|---|---|---|---|
| Communication model | Synchronous type (blocking send/recv, synchronous barrier) |
Asynchronous Pub/Sub (Publish/Subscribe) | In real-time systems, an asynchronous model in which each module can operate independently is overwhelmingly advantageous. |
| Process management | Static (world_size and rank are fixed at startup) |
Dynamic (automatic detection of nodes, free join/leave) | Can dynamically respond to failures or additions of robot modules. Increased system flexibility and fault tolerance. |
| Topology | Tightly coupled (all nodes are aware of each other's connections) | Loosely coupled (each node only communicates with the Zenoh router) | Significantly reduces system complexity. The impact of adding or changing a node on other nodes can be minimized. |
| Fault Tolerance | Low (Failure in one node tends to spread to the entire system) | High (Failure in one node does not directly affect other nodes) | Prevents failure of one sensor or actuator from causing the entire brain to stop functioning. |
| Performance | Optimized for high-throughput tensor transfer between GPUs | Optimized for low-latency messaging | Because in robot control, the latency of individual messages is more important than throughput. |
| Scalability | Proven track record in HPC clusters with hundreds of nodes | Proven track record in IoT/Robotics with tens of thousands to millions of devices | Selection with an eye on future collaboration of large-scale robot groups and integration of numerous sensors/actuators. |
| Data format | Specialized in direct transmission and reception of torch.Tensor |
Supports any serialization format (JSON, Pickle, Protobuf, etc.) | Easily exchanges structured data other than tensors, such as brain states and intentions. |
| Ecosystem | Limited to within the PyTorch ecosystem | High affinity with Robotics standards such as DDS and ROS2 | Low barriers to collaboration with robot middleware such as ROS2 in the future. |
| Security ⭐ NEW | Basic security features only | PSK/DH key exchange, AES-256-GCM, Forward Secrecy (SECURE_DISTRIBUTED_BRAIN.md, DISTRIBUTED_BRAIN_SYSTEM.md#36) | Confidentiality and integrity are guaranteed by encrypted communication between distributed brain nodes (MT25-EV015 patent implementation), session-based key management, and TLS integration. Encryption overhead <5%. |
| Implementation Complexity | Rank management and synchronization processes tend to be complex | Pub/Sub model simplifies the implementation of each node | Developers of each module can focus on their own logic rather than communication details. |
3. Architecture conceptual diagram
3.1. Old: torch.distributed architecture
graph TD
subgraph "固定されたプロセスグループ world_size=4"
PFC["Rank 0: PFC"]
Lang["Rank 1: Language"]
Vision["Rank 2: Vision"]
Motor["Rank 3: Motor"]
end
PFC -- send/recv --> Lang
PFC -- send/recv --> Vision
Vision -- send/recv --> Motor
Lang -- send/recv --> PFC
linkStyle 0 stroke-width:2px,fill:none,stroke:red;
linkStyle 1 stroke-width:2px,fill:none,stroke:red;
linkStyle 2 stroke-width:2px,fill:none,stroke:red;
linkStyle 3 stroke-width:2px,fill:none,stroke:red;
note["Note: All nodes are tightly coupled, failure of one node affects all"]
3.2. New: Zenoh Architecture
graph TD
subgraph "動的な分散システム"
PFC["PFC Node"]
Lang["Language Node"]
Vision["Vision Node"]
Motor["Motor Node"]
NewSensor["New Sensor dynamic addition"]
end
Router{{"Zenoh Router"}}
PFC -- "Publish: pfc/control" --> Router
Router -- "Subscribe: pfc/control" --> Lang
Router -- "Subscribe: pfc/control" --> Vision
Lang -- "Publish: lang/features" --> Router
Vision -- "Publish: vision/objects" --> Router
Motor -- "Publish: motor/status" --> Router
NewSensor -- "Publish: sensor/new_data" --> Router
Router -- "Subscribe: vision/*, lang/*" --> PFC
Router -- "Subscribe: vision/objects" --> Motor
Router -- "Subscribe: sensor/new_data" --> PFC
linkStyle 0 stroke-width:2px,fill:none,stroke:blue;
linkStyle 1 stroke-width:2px,fill:none,stroke:green;
linkStyle 2 stroke-width:2px,fill:none,stroke:green;
linkStyle 3 stroke-width:2px,fill:none,stroke:blue;
linkStyle 4 stroke-width:2px,fill:none,stroke:blue;
linkStyle 5 stroke-width:2px,fill:none,stroke:blue;
linkStyle 6 stroke-width:2px,fill:none,stroke:blue;
linkStyle 7 stroke-width:2px,fill:none,stroke:green;
linkStyle 8 stroke-width:2px,fill:none,stroke:green;
linkStyle 9 stroke-width:2px,fill:none,stroke:green;
note["Note: All nodes are loosely coupled via routers, making it easy to add and remove nodes"]
4. Current situation and future outlook
4.1. Implementation Status
- Main communication backend has been fully migrated to Zenoh.
run_zenoh_distributed_brain.pyis the official execution script for the current distributed brain simulation.docker-compose.ymlintegrates thezenoh-routerservice, allowing you to easily build a Zenoh network in a container environment.
4.2. Handling torch.distributed
- The legacy
run_distributed_brain_simulation.pyis kept in the repository for backward compatibility and specific research purposes (e.g. performance evaluation of highly efficient tensor parallelism between tightly coupled modules). For implementation details of PFC/Zenoh/ExecutiveControl, refer to implementation/PFC_ZENOH_EXECUTIVE.md. - However, all functions related to new feature development and implementation on physical robots will be done on the Zenoh-based architecture.
- In the future, the
torch.distributedversion may be deprecated and archived.
5. Conclusion
The transition from torch.distributed to Zenoh is a strategically critical architectural change as the EvoSpikeNet project moves from the research phase to the production/production phase. This change significantly increases the system's robustness, flexibility, and scalability, and establishes the technological foundation for a "true distributed brain" that operates autonomously on a physical robotic platform.