Skip to content

EvoSpikeNet Build & Service Matrix

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

We have summarized the services and volumes that are started for each build target, and the main usage methods. docker compose assumes v2.

Main compose file

Usage File Main target Typical command example
Core development/demo docker-compose.yml API/Frontend/DB/RAG (optional) docker compose up -d api frontend
Jupyter/Development Tools docker-compose.yml notebook/mkdocs/dev docker compose up -d notebook
RAG minimum configuration docker-compose.yml (profile rag) rag-api/milvus/elasticsearch docker compose --profile rag up -d rag-api
Large-scale learning (GPU/CPU) docker-compose.train.yml llm-trainer-gpu / llm-trainer-cpu docker compose -f docker-compose.train.yml up -d llm-trainer-gpu
Microservice division docker-compose.microservices.yml gateway/training/inference etc. docker compose -f docker-compose.microservices.yml up -d gateway
GPU resource allocation overlay docker-compose.gpu.yml GPU allocation to existing services docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d api
GPU single trainer docker-compose.gpu-only.yml llm-trainer-gpu (single) docker compose -f docker-compose.gpu-only.yml up -d llm-trainer-gpu
CPU-only trainer docker-compose.cpu-only.yml llm-trainer-cpu (single) docker compose -f docker-compose.cpu-only.yml up -d llm-trainer-cpu
Distributed node experiment docker-compose.distributed.yml brain-node-1..3 + zenoh-router + model-server (optional) docker compose -f docker-compose.distributed.yml up -d

Core development stack (docker-compose.yml)

  • api: FastAPI server (8000). Depends: postgres, zenoh-router. Volume: saved_models, shared_tmp.
  • frontend: Dash UI (8050/8051). Depends: api, milvus-standalone, elasticsearch, postgres. Volume: saved_models, shared_tmp.
  • dev: Dash execution for development (8052→8050, 8080, 8765). For code hot reload purposes.
  • notebook: Jupyter Lab (8888). Connect to API/RAG. Volume: saved_models, shared_tmp.
  • mkdocs: Documentation server (8001). profile full.
  • rag-api: API for RAG (external 8101 / internal 8001). profile rag. Volume: rag-system/data.
  • zenoh-router: Router for distributed nodes (7447/tcp+udp, 7446/udp).
  • milvus-standalone: ​​Vector DB (19530, 9091). Depends: etcd, minio. Volume: milvus_data.
  • elasticsearch: for logs/search (9200, 9300).
  • postgres: Main DB (5432). Volume: postgres_data.
  • etcd/minio: Milvus dependency.

Main volume

  • milvus_data, milvus_etcd, milvus_minio: RAG/Milvus persistence.
  • saved_models: Model artifact sharing (api/frontend/notebook).
  • postgres_data: DB persistence.
  • shared_tmp: Temporary space sharing.
  • rag-system/data: RAG data (mounted with rag-api service).

Typical startup example

# API + Frontend (development default)
docker compose up -d api frontend

# RAG set (profiling)
docker compose --profile rag up -d rag-api milvus-standalone elasticsearch

# notebook only
docker compose up -d notebook

Large-scale training stack (docker-compose.train.yml)

  • llm-trainer-gpu: GPU trainer (8000). Volume: ./data, ./saved_models, ./logs, ./config. NVIDIA runtime required.
  • llm-trainer-cpu: CPU trainer (8001→inner 8000). Volume: Same as above.
  • nginx (optional): Reverse proxy GPU/CPU on 8080.

Startup example

# GPU trainer
docker compose -f docker-compose.train.yml up -d llm-trainer-gpu

# CPU trainer
docker compose -f docker-compose.train.yml up -d llm-trainer-cpu

# Combined with proxy
docker compose -f docker-compose.train.yml up -d nginx

GPU/CPU independent trainer (simple configuration)

  • docker-compose.gpu-only.yml: Start llm-trainer-gpu alone (8000). Environment: CUDA_VISIBLE_DEVICES, TORCH_USE_CUDA_DSA, DEVICE_TYPE=gpu.
  • docker-compose.cpu-only.yml: Start llm-trainer-cpu alone (8001 → inner 8000). Environment: OMP_NUM_THREADS/MKL_NUM_THREADS, DEVICE_TYPE=cpu.

Startup example

# GPU alone
docker compose -f docker-compose.gpu-only.yml up -d llm-trainer-gpu

# CPU alone
docker compose -f docker-compose.cpu-only.yml up -d llm-trainer-cpu

GPU overlay (GPU allocation to existing compose)

  • docker-compose.gpu.yml: Overlay that grants GPU resources to existing services such as dev/test/prod/frontend. The base is used in combination with docker-compose.yml.

Startup example

# Example of assigning GPU to API + Frontend
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d api frontend

Distributed node configuration (docker-compose.distributed.yml)

  • brain-node-1..3: Start FastAPI on 8001/8002/8003 and connect as Zenoh peer.
  • zenoh-router: Distributed communication router (7447/tcp+udp, 7446/udp).
  • model-server (optional): Dedicated service for video/audio analysis (9002→8000). Manage Whisper dependencies with individual images.

Environment variables for distributed ASR/Whisper

  • VIDEO_ANALYSIS_ASR_BACKEND: asr_fallback (default) or whisper_real
  • VIDEO_ANALYSIS_WHISPER_MODEL: Whisper model size (e.g. tiny, base)
  • VIDEO_ANALYSIS_WHISPER_DEVICE: Execution device (e.g. cpu, cuda)
  • VIDEO_ANALYSIS_ASR_PREPROCESS: Preprocessing ON/OFF (1/0)

Startup example

# Distributed 3 nodes + router
docker compose -f docker-compose.distributed.yml up -d

# Enabling Whisper and launching distributed nodes
VIDEO_ANALYSIS_ASR_BACKEND=whisper_real \
VIDEO_ANALYSIS_WHISPER_MODEL=base \
docker compose -f docker-compose.distributed.yml up -d

# Start including dedicated model-server
ENABLE_WHISPER=true \
VIDEO_ANALYSIS_ASR_BACKEND=whisper_real \
docker compose -f docker-compose.distributed.yml up -d model-server brain-node-1 brain-node-2 brain-node-3 zenoh-router

Microservice configuration (docker-compose.microservices.yml)

  • gateway: API gateway (8000). Route to the following service.
  • training: Learning service (8001). Volume: ./artifacts, ./data.
  • inference: Inference service (8002). Volume: ./artifacts.
  • model-registry: Model management (8003). Volume: ./model_registry.
  • monitoring: metrics aggregation (8004).
  • postgres: Common DB (5432). Volume: postgres_data.
  • zenoh-router: Distributed communication.

Startup example

# Batch startup
docker compose -f docker-compose.microservices.yml up -d

# gateway only
docker compose -f docker-compose.microservices.yml up -d gateway

What stack can you do when you start it?

  • api + frontend: Core dashboard and API execution. It is possible to use SDK and call model training.
  • rag-api + milvus + elasticsearch: RAG pipeline (embedded search, log search).
  • notebook: Experimental environment (Jupyter) connected to all services.
  • llm-trainer-(gpu|cpu): Single execution of large-scale training jobs. Artifacts are saved in saved_models/logs.
  • microservices stack: operate learning/inference/model management/monitoring in a loosely coupled manner over the gateway.

RAG system startup procedure (rag-system directory linkage)

  • Service: rag-api (external 8101 / internal 8001), dependencies: milvus-standalone, elasticsearch. Data: Mount ./rag-system/data to /home/appuser/app/rag-system/data.
  • Environment variables: EVOSPIKENET_API_KEY / EVOSPIKENET_API_KEYS, MILVUS_HOST=milvus-standalone, ELASTICSEARCH_HOST=elasticsearch.
  • Execution location: Run docker compose in the repository root (no need to move to rag-system).

Startup example

# RAG dependency set (includes Milvus/Elasticsearch)
docker compose --profile rag up -d rag-api milvus-standalone elasticsearch

# Check RAG API logs
docker compose --profile rag logs -f rag-api

# Stop
docker compose --profile rag down

RAG data location

  • Persistent data: rag-system/data (host). Maintains vector index data and indexes.
  • Milvus/Elasticsearch persistent volumes: milvus_data, milvus_etcd, milvus_minio (Milvus), Elasticsearch is container local.

LLM learning wrapper for distributed_brain (full 30 ranks)

  • Script: scripts/run_distributed_brain_llm.sh
    • Role: (1) Data collection (RUN_DATA_COLLECTION=1 default), (2) Serial execution of train_llm_models.py by rank.
    • Main environment variables: CONFIG (default: config/training_config.yaml), CATEGORY (e.g. full_brain_llm / text_generation), RANKS (space-separated rank list), GPU (1 to give --gpu, 0 to run on CPU), RUN_DATA_COLLECTION (0 to skip collection).
  • Execution steps (full 30 ranks, with data collection)
export CONFIG=config/training_config.yaml
export CATEGORY=full_brain_llm
export RANKS="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29"
export GPU=1                 # 0 if running on CPU
export RUN_DATA_COLLECTION=1 # 0 to skip data collection

./scripts/run_distributed_brain_llm.sh

Note: - RANKS must be separated by spaces (commas cannot be separated). - 30 ranks consumes a lot of computational resources. Pay attention to GPU/CPU and storage availability.

Python environment setup before execution (PEP 668 avoidance)

pip install is blocked on Homebrew-like systems Python, so run it in a virtual environment. Recommended is Python 3.10/3.11.

# Run in project root
python3 -m venv .venv         # Skip if existing
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
# If you also collect data
python -m pip install -r scripts/requirements-llm-data.txt

# then run the wrapper
./scripts/run_distributed_brain_llm.sh

Note: - --break-system-packages is deprecated. Always in a virtual environment. - If you have an existing venv311/ etc., you can also use source venv311/bin/activate.

Version constraints on dependent packages (e.g. Ray)

  • Some packages such as Ray 2.31.0 do not support Python 3.13+. **Please use Python 3.10/3.11 (or 3.12). **
  • python@3.14 on macOS Homebrew cannot resolve ray==2.31.0 and pip returns "No matching distribution". Please recreate the virtual environment based on 3.10/3.11.
  • If pip gives you Invalid requirement: '#', make sure you run pip install -r requirements.txt and update pip (python -m pip install --upgrade pip).

Points about environment variables

  • API_URL / RAG_API_URL / EEG_WS_URL: Specify the connection destination from the front end or notebook.
  • EVOSPIKENET_API_KEY / EVOSPIKENET_API_KEYS: Keys for API authentication.
  • ENABLE_GPU: If set to true, GPU-specific packages (bitsandbytes etc.) will be additionally installed. Default false.
  • BASE_IMAGE: Switch base image. The default if not specified is ubuntu:22.04 (no CUDA).

How to use BASE_IMAGE

Purpose BASE_IMAGE ENABLE_GPU
CPU build (default) ubuntu:22.04 false
GPU build (CUDA 12.4) nvidia/cuda:12.4.1-base-ubuntu22.04 true
GPU build (CUDA 12.1) nvidia/cuda:12.1.1-base-ubuntu22.04 true
# CPU build (default, no CUDA)
docker build .

# GPU build (CUDA image + bitsandbytes, etc.)
docker build . \
  --build-arg BASE_IMAGE=nvidia/cuda:12.4.1-base-ubuntu22.04 \
  --build-arg ENABLE_GPU=true

Note: The base / notebook service in docker-compose.yml uses nvidia/cuda:12.4.1-base-ubuntu22.04 by default. The test service is fixed to ubuntu:22.04. When using docker compose up in a CPU environment, overwrite it with environment variables like BASE_IMAGE=ubuntu:22.04 ENABLE_GPU=false docker compose up.

About BuildKit caches

Dockerfile supports BuildKit caching using --mount=type=cache. There is no need to re-download pip packages (including PyTorch) for subsequent builds.

For Docker < 23.0, explicitly enable BuildKit:

export DOCKER_BUILDKIT=1
docker build .

Reference documents