Skip to content

Comparison of communication backends: Zenoh vs. torch.distributed

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Last updated: January 26, 2026

Author: Masahiro Aoki

This document details the technical comparison and decision-making background for architectural changes to the communication backend in EvoSpikeNet's distributed brain simulation.

Purpose and use of this document

  • Purpose: Comparison material to quickly share Zenoh migration background and design decisions.
  • Target audience: Distributed infrastructure personnel, robotics collaboration personnel, PMs.
  • First reading order: Comparison table in Chapter 2 → Conceptual diagram in Chapter 3 → Current status in Chapter 4.
  • Related links: Distributed implementation details are implementation/PFC_ZENOH_EXECUTIVE.md, execution script is examples/run_zenoh_distributed_brain.py.

1. Background: Why was it necessary to review the communication architecture?

In the early days of the project, Distributed Brain was built on top of PyTorch's standard distributed computing library, torch.distributed. It is the de facto standard for data-parallel and model-parallel learning of machine learning models, and is especially optimized for high-throughput tensor exchange between GPUs.

However, as the project progressed from simple simulation to implementation on physical robots and autonomous decision-making in real time, the following limitations of torch.distributed became a major challenge.

  1. Synchronous/blocking communication: Operations such as send/recv and all_reduce are essentially synchronous and require all participating processes to be aligned. This caused delays in some nodes to lead to delays in the entire system.
  2. Static process groups: world_size (number of participating processes) was fixed at startup, making it extremely difficult to dynamically add or leave nodes during simulation. It cannot handle situations where robot modules fail or new sensors are added.
  3. Single Point of Failure: When one process crashes, the entire process group often hangs or crashes, making the entire system less resilient.
  4. HPC-centric design: Optimized for training on GPU clusters and is not necessarily suitable for environments with edge devices with uneven resources (such as small computers mounted in various parts of a robot).

To overcome these challenges and achieve a more dynamic, robust, and scalable distributed system, the decision was made to migrate the communications backend to Zenoh.


2. Technical comparison

Item torch.distributed (old architecture) Zenoh (new architecture) Reason for selection
Communication model Synchronous type (blocking send/recv, synchronous barrier) Asynchronous Pub/Sub (Publish/Subscribe) In real-time systems, an asynchronous model in which each module can operate independently is overwhelmingly advantageous.
Process management Static (world_size and rank are fixed at startup) Dynamic (automatic detection of nodes, free join/leave) Can dynamically respond to failures or additions of robot modules. Increased system flexibility and fault tolerance.
Topology Tightly coupled (all nodes are aware of each other's connections) Loosely coupled (each node only communicates with the Zenoh router) Significantly reduces system complexity. The impact of adding or changing a node on other nodes can be minimized.
Fault Tolerance Low (Failure in one node tends to spread to the entire system) High (Failure in one node does not directly affect other nodes) Prevents failure of one sensor or actuator from causing the entire brain to stop functioning.
Performance Optimized for high-throughput tensor transfer between GPUs Optimized for low-latency messaging Because in robot control, the latency of individual messages is more important than throughput.
Scalability Proven track record in HPC clusters with hundreds of nodes Proven track record in IoT/Robotics with tens of thousands to millions of devices Selection with an eye on future collaboration of large-scale robot groups and integration of numerous sensors/actuators.
Data format Specialized in direct transmission and reception of torch.Tensor Supports any serialization format (JSON, Pickle, Protobuf, etc.) Easily exchanges structured data other than tensors, such as brain states and intentions.
Ecosystem Limited to within the PyTorch ecosystem High affinity with Robotics standards such as DDS and ROS2 Low barriers to collaboration with robot middleware such as ROS2 in the future.
Security ⭐ NEW Basic security features only PSK/DH key exchange, AES-256-GCM, Forward Secrecy (SECURE_DISTRIBUTED_BRAIN.md, DISTRIBUTED_BRAIN_SYSTEM.md#36) Confidentiality and integrity are guaranteed by encrypted communication between distributed brain nodes (MT25-EV015 patent implementation), session-based key management, and TLS integration. Encryption overhead <5%.
Implementation Complexity Rank management and synchronization processes tend to be complex Pub/Sub model simplifies the implementation of each node Developers of each module can focus on their own logic rather than communication details.

3. Architecture conceptual diagram

3.1. Old: torch.distributed architecture

graph TD
    subgraph "固定されたプロセスグループ world_size=4"
        PFC["Rank 0: PFC"]
        Lang["Rank 1: Language"]
        Vision["Rank 2: Vision"]
        Motor["Rank 3: Motor"]
    end

    PFC -- send/recv --> Lang
    PFC -- send/recv --> Vision
    Vision -- send/recv --> Motor
    Lang -- send/recv --> PFC

    linkStyle 0 stroke-width:2px,fill:none,stroke:red;
    linkStyle 1 stroke-width:2px,fill:none,stroke:red;
    linkStyle 2 stroke-width:2px,fill:none,stroke:red;
    linkStyle 3 stroke-width:2px,fill:none,stroke:red;

    note["Note: All nodes are tightly coupled, failure of one node affects all"]

3.2. New: Zenoh Architecture

graph TD
    subgraph "動的な分散システム"
        PFC["PFC Node"]
        Lang["Language Node"]
        Vision["Vision Node"]
        Motor["Motor Node"]
        NewSensor["New Sensor dynamic addition"]
    end

    Router{{"Zenoh Router"}}

    PFC -- "Publish: pfc/control" --> Router
    Router -- "Subscribe: pfc/control" --> Lang
    Router -- "Subscribe: pfc/control" --> Vision

    Lang -- "Publish: lang/features" --> Router
    Vision -- "Publish: vision/objects" --> Router
    Motor -- "Publish: motor/status" --> Router
    NewSensor -- "Publish: sensor/new_data" --> Router

    Router -- "Subscribe: vision/*, lang/*" --> PFC
    Router -- "Subscribe: vision/objects" --> Motor
    Router -- "Subscribe: sensor/new_data" --> PFC

    linkStyle 0 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 1 stroke-width:2px,fill:none,stroke:green;
    linkStyle 2 stroke-width:2px,fill:none,stroke:green;
    linkStyle 3 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 4 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 5 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 6 stroke-width:2px,fill:none,stroke:blue;
    linkStyle 7 stroke-width:2px,fill:none,stroke:green;
    linkStyle 8 stroke-width:2px,fill:none,stroke:green;
    linkStyle 9 stroke-width:2px,fill:none,stroke:green;

    note["Note: All nodes are loosely coupled via routers, making it easy to add and remove nodes"]

4. Current situation and future outlook

4.1. Implementation Status

  • Main communication backend has been fully migrated to Zenoh.
  • run_zenoh_distributed_brain.py is the official execution script for the current distributed brain simulation.
  • docker-compose.yml integrates the zenoh-router service, allowing you to easily build a Zenoh network in a container environment.

4.2. Handling torch.distributed

  • The legacy run_distributed_brain_simulation.py is kept in the repository for backward compatibility and specific research purposes (e.g. performance evaluation of highly efficient tensor parallelism between tightly coupled modules). For implementation details of PFC/Zenoh/ExecutiveControl, refer to implementation/PFC_ZENOH_EXECUTIVE.md.
  • However, all functions related to new feature development and implementation on physical robots will be done on the Zenoh-based architecture.
  • In the future, the torch.distributed version may be deprecated and archived.

5. Conclusion

The transition from torch.distributed to Zenoh is a strategically critical architectural change as the EvoSpikeNet project moves from the research phase to the production/production phase. This change significantly increases the system's robustness, flexibility, and scalability, and establishes the technological foundation for a "true distributed brain" that operates autonomously on a physical robotic platform.