EvoSpikeNet Build & Service Matrix
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
We have summarized the services and volumes that are started for each build target, and the main usage methods. docker compose assumes v2.
Main compose file
| Usage | File | Main target | Typical command example |
|---|---|---|---|
| Core development/demo | docker-compose.yml | API/Frontend/DB/RAG (optional) | docker compose up -d api frontend |
| Jupyter/Development Tools | docker-compose.yml | notebook/mkdocs/dev | docker compose up -d notebook |
| RAG minimum configuration | docker-compose.yml (profile rag) | rag-api/milvus/elasticsearch | docker compose --profile rag up -d rag-api |
| Large-scale learning (GPU/CPU) | docker-compose.train.yml | llm-trainer-gpu / llm-trainer-cpu | docker compose -f docker-compose.train.yml up -d llm-trainer-gpu |
| Microservice division | docker-compose.microservices.yml | gateway/training/inference etc. | docker compose -f docker-compose.microservices.yml up -d gateway |
| GPU resource allocation overlay | docker-compose.gpu.yml | GPU allocation to existing services | docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d api |
| GPU single trainer | docker-compose.gpu-only.yml | llm-trainer-gpu (single) | docker compose -f docker-compose.gpu-only.yml up -d llm-trainer-gpu |
| CPU-only trainer | docker-compose.cpu-only.yml | llm-trainer-cpu (single) | docker compose -f docker-compose.cpu-only.yml up -d llm-trainer-cpu |
| Distributed node experiment | docker-compose.distributed.yml | brain-node-1..3 + zenoh-router + model-server (optional) | docker compose -f docker-compose.distributed.yml up -d |
Core development stack (docker-compose.yml)
- api: FastAPI server (8000). Depends: postgres, zenoh-router. Volume: saved_models, shared_tmp.
- frontend: Dash UI (8050/8051). Depends: api, milvus-standalone, elasticsearch, postgres. Volume: saved_models, shared_tmp.
- dev: Dash execution for development (8052→8050, 8080, 8765). For code hot reload purposes.
- notebook: Jupyter Lab (8888). Connect to API/RAG. Volume: saved_models, shared_tmp.
- mkdocs: Documentation server (8001). profile
full. - rag-api: API for RAG (external 8101 / internal 8001). profile
rag. Volume: rag-system/data. - zenoh-router: Router for distributed nodes (7447/tcp+udp, 7446/udp).
- milvus-standalone: Vector DB (19530, 9091). Depends: etcd, minio. Volume: milvus_data.
- elasticsearch: for logs/search (9200, 9300).
- postgres: Main DB (5432). Volume: postgres_data.
- etcd/minio: Milvus dependency.
Main volume
- milvus_data, milvus_etcd, milvus_minio: RAG/Milvus persistence.
- saved_models: Model artifact sharing (api/frontend/notebook).
- postgres_data: DB persistence.
- shared_tmp: Temporary space sharing.
- rag-system/data: RAG data (mounted with rag-api service).
Typical startup example
# API + Frontend (development default)
docker compose up -d api frontend
# RAG set (profiling)
docker compose --profile rag up -d rag-api milvus-standalone elasticsearch
# notebook only
docker compose up -d notebook
Large-scale training stack (docker-compose.train.yml)
- llm-trainer-gpu: GPU trainer (8000). Volume: ./data, ./saved_models, ./logs, ./config. NVIDIA runtime required.
- llm-trainer-cpu: CPU trainer (8001→inner 8000). Volume: Same as above.
- nginx (optional): Reverse proxy GPU/CPU on 8080.
Startup example
# GPU trainer
docker compose -f docker-compose.train.yml up -d llm-trainer-gpu
# CPU trainer
docker compose -f docker-compose.train.yml up -d llm-trainer-cpu
# Combined with proxy
docker compose -f docker-compose.train.yml up -d nginx
GPU/CPU independent trainer (simple configuration)
- docker-compose.gpu-only.yml: Start
llm-trainer-gpualone (8000). Environment: CUDA_VISIBLE_DEVICES, TORCH_USE_CUDA_DSA, DEVICE_TYPE=gpu. - docker-compose.cpu-only.yml: Start
llm-trainer-cpualone (8001 → inner 8000). Environment: OMP_NUM_THREADS/MKL_NUM_THREADS, DEVICE_TYPE=cpu.
Startup example
# GPU alone
docker compose -f docker-compose.gpu-only.yml up -d llm-trainer-gpu
# CPU alone
docker compose -f docker-compose.cpu-only.yml up -d llm-trainer-cpu
GPU overlay (GPU allocation to existing compose)
- docker-compose.gpu.yml: Overlay that grants GPU resources to existing services such as
dev/test/prod/frontend. The base is used in combination with docker-compose.yml.
Startup example
# Example of assigning GPU to API + Frontend
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d api frontend
Distributed node configuration (docker-compose.distributed.yml)
- brain-node-1..3: Start FastAPI on 8001/8002/8003 and connect as Zenoh peer.
- zenoh-router: Distributed communication router (7447/tcp+udp, 7446/udp).
- model-server (optional): Dedicated service for video/audio analysis (9002→8000). Manage Whisper dependencies with individual images.
Environment variables for distributed ASR/Whisper
VIDEO_ANALYSIS_ASR_BACKEND:asr_fallback(default) orwhisper_realVIDEO_ANALYSIS_WHISPER_MODEL: Whisper model size (e.g.tiny,base)VIDEO_ANALYSIS_WHISPER_DEVICE: Execution device (e.g.cpu,cuda)VIDEO_ANALYSIS_ASR_PREPROCESS: Preprocessing ON/OFF (1/0)
Startup example
# Distributed 3 nodes + router
docker compose -f docker-compose.distributed.yml up -d
# Enabling Whisper and launching distributed nodes
VIDEO_ANALYSIS_ASR_BACKEND=whisper_real \
VIDEO_ANALYSIS_WHISPER_MODEL=base \
docker compose -f docker-compose.distributed.yml up -d
# Start including dedicated model-server
ENABLE_WHISPER=true \
VIDEO_ANALYSIS_ASR_BACKEND=whisper_real \
docker compose -f docker-compose.distributed.yml up -d model-server brain-node-1 brain-node-2 brain-node-3 zenoh-router
Microservice configuration (docker-compose.microservices.yml)
- gateway: API gateway (8000). Route to the following service.
- training: Learning service (8001). Volume: ./artifacts, ./data.
- inference: Inference service (8002). Volume: ./artifacts.
- model-registry: Model management (8003). Volume: ./model_registry.
- monitoring: metrics aggregation (8004).
- postgres: Common DB (5432). Volume: postgres_data.
- zenoh-router: Distributed communication.
Startup example
# Batch startup
docker compose -f docker-compose.microservices.yml up -d
# gateway only
docker compose -f docker-compose.microservices.yml up -d gateway
What stack can you do when you start it?
- api + frontend: Core dashboard and API execution. It is possible to use SDK and call model training.
- rag-api + milvus + elasticsearch: RAG pipeline (embedded search, log search).
- notebook: Experimental environment (Jupyter) connected to all services.
- llm-trainer-(gpu|cpu): Single execution of large-scale training jobs. Artifacts are saved in saved_models/logs.
- microservices stack: operate learning/inference/model management/monitoring in a loosely coupled manner over the gateway.
RAG system startup procedure (rag-system directory linkage)
- Service:
rag-api(external 8101 / internal 8001), dependencies:milvus-standalone,elasticsearch. Data: Mount./rag-system/datato/home/appuser/app/rag-system/data. - Environment variables:
EVOSPIKENET_API_KEY/EVOSPIKENET_API_KEYS,MILVUS_HOST=milvus-standalone,ELASTICSEARCH_HOST=elasticsearch. - Execution location: Run
docker composein the repository root (no need to move torag-system).
Startup example
# RAG dependency set (includes Milvus/Elasticsearch)
docker compose --profile rag up -d rag-api milvus-standalone elasticsearch
# Check RAG API logs
docker compose --profile rag logs -f rag-api
# Stop
docker compose --profile rag down
RAG data location
- Persistent data:
rag-system/data(host). Maintains vector index data and indexes. - Milvus/Elasticsearch persistent volumes:
milvus_data,milvus_etcd,milvus_minio(Milvus), Elasticsearch is container local.
LLM learning wrapper for distributed_brain (full 30 ranks)
- Script:
scripts/run_distributed_brain_llm.sh- Role: (1) Data collection (
RUN_DATA_COLLECTION=1default), (2) Serial execution oftrain_llm_models.pyby rank. - Main environment variables:
CONFIG(default: config/training_config.yaml),CATEGORY(e.g. full_brain_llm / text_generation),RANKS(space-separated rank list),GPU(1 to give --gpu, 0 to run on CPU),RUN_DATA_COLLECTION(0 to skip collection).
- Role: (1) Data collection (
- Execution steps (full 30 ranks, with data collection)
export CONFIG=config/training_config.yaml
export CATEGORY=full_brain_llm
export RANKS="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29"
export GPU=1 # 0 if running on CPU
export RUN_DATA_COLLECTION=1 # 0 to skip data collection
./scripts/run_distributed_brain_llm.sh
Note:
- RANKS must be separated by spaces (commas cannot be separated).
- 30 ranks consumes a lot of computational resources. Pay attention to GPU/CPU and storage availability.
Python environment setup before execution (PEP 668 avoidance)
pip install is blocked on Homebrew-like systems Python, so run it in a virtual environment. Recommended is Python 3.10/3.11.
# Run in project root
python3 -m venv .venv # Skip if existing
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
# If you also collect data
python -m pip install -r scripts/requirements-llm-data.txt
# then run the wrapper
./scripts/run_distributed_brain_llm.sh
Note:
- --break-system-packages is deprecated. Always in a virtual environment.
- If you have an existing venv311/ etc., you can also use source venv311/bin/activate.
Version constraints on dependent packages (e.g. Ray)
- Some packages such as Ray 2.31.0 do not support Python 3.13+. **Please use Python 3.10/3.11 (or 3.12). **
python@3.14on macOS Homebrew cannot resolveray==2.31.0and pip returns "No matching distribution". Please recreate the virtual environment based on 3.10/3.11.- If pip gives you
Invalid requirement: '#', make sure you runpip install -r requirements.txtand update pip (python -m pip install --upgrade pip).
Points about environment variables
API_URL/RAG_API_URL/EEG_WS_URL: Specify the connection destination from the front end or notebook.EVOSPIKENET_API_KEY/EVOSPIKENET_API_KEYS: Keys for API authentication.ENABLE_GPU: If set totrue, GPU-specific packages (bitsandbytes etc.) will be additionally installed. Defaultfalse.BASE_IMAGE: Switch base image. The default if not specified isubuntu:22.04(no CUDA).
How to use BASE_IMAGE
| Purpose | BASE_IMAGE | ENABLE_GPU |
|---|---|---|
| CPU build (default) | ubuntu:22.04 |
false |
| GPU build (CUDA 12.4) | nvidia/cuda:12.4.1-base-ubuntu22.04 |
true |
| GPU build (CUDA 12.1) | nvidia/cuda:12.1.1-base-ubuntu22.04 |
true |
# CPU build (default, no CUDA)
docker build .
# GPU build (CUDA image + bitsandbytes, etc.)
docker build . \
--build-arg BASE_IMAGE=nvidia/cuda:12.4.1-base-ubuntu22.04 \
--build-arg ENABLE_GPU=true
Note: The
base/notebookservice indocker-compose.ymlusesnvidia/cuda:12.4.1-base-ubuntu22.04by default. Thetestservice is fixed toubuntu:22.04. When usingdocker compose upin a CPU environment, overwrite it with environment variables likeBASE_IMAGE=ubuntu:22.04 ENABLE_GPU=false docker compose up.
About BuildKit caches
Dockerfile supports BuildKit caching using --mount=type=cache.
There is no need to re-download pip packages (including PyTorch) for subsequent builds.
For Docker < 23.0, explicitly enable BuildKit:
export DOCKER_BUILDKIT=1
docker build .
Reference documents
- docs/GPU_CPU_LAUNCH.md
- README.md
- Makefile