EvoSpikeNet HPC environment setup guide
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
overview
This guide explains how to run EvoSpikeNet in an HPC cluster environment using Slurm and Enroot/Pyxis.
Environmental specifications
Container execution infrastructure
- Job Scheduler: Slurm 21.08.8+
- Container runtime: Enroot/Pyxis
- Container Registry: NVIDIA NGC Private Registry (Security scan required)
Storage configuration
| Storage name | Mount point | Purpose | Special notes |
|---|---|---|---|
| Local storage | /raid |
Container placement/startup | RAID0 configuration, permanent storage not possible |
| Main storage | /lustre |
Learning data, high-speed IO | Luster file system |
| Home area | /home/[username] |
Project code | Assignment by user |
| Data store | /store |
Backup | Only login server can access |
Setup steps
1. Creating a storage structure
Run the following on the login server:
cd /path/to/EvoSpikeNet
bash slurm/setup_storage.sh
Directory structure created:
/lustre/${USER}/
├── evospikenet/ # project code
├── containers/ # Container image (.sqsh)
├── data/ # learning data
├── output/ # Learning results/model
├── logs/ # log file
├── checkpoints/ # checkpoint
└── artifacts/ # Other deliverables
/home/${USER}/
├── project -> /lustre/${USER}/evospikenet # symbolic link
├── data -> /lustre/${USER}/data
├── output -> /lustre/${USER}/output
├── scripts/ # personal script
└── configs/ # Personalization
/store/${USER}/
├── backups/ # backup
├── models/ # Archived model
├── datasets/ # Dataset backup
└── archive/ # Other archives
2. Upload project code
On the login server:
# Clone project to Luster
cd /lustre/${USER}
git clone https://github.com/your-org/EvoSpikeNet.git evospikenet
cd evospikenet
# Or upload with scp
scp -r /local/path/to/EvoSpikeNet user@login-server:/lustre/${USER}/evospikenet
3. Build and deploy container images
Important: HPC cluster environments do not have Docker installed, so building containers must be done on the local machine or on a separate build server where Docker is available.
3.1 Working on a local machine/build server
Run on a machine with Docker installed:
# Login to NGC Private Registry
docker login nvcr.io
# Building and pushing containers
cd /path/to/EvoSpikeNet
# Edit settings in advance
# Change NGC_REGISTRY="nvcr.io/your-org/your-private-registry" to your actual value
vim slurm/build_and_deploy_container.sh
# Run build and push
bash slurm/build_and_deploy_container.sh
Alternative: Deploy via Docker Tarfile
If not using NGC Private Registry:
# Build the container on your local machine
docker build -f Dockerfile.ngc -t evospikenet:latest .
# Export to Tar file
docker save evospikenet:latest -o evospikenet-latest.tar
# Upload to HPC environment
scp evospikenet-latest.tar user@login-server:/lustre/${USER}/containers/
3.2 Security scan with NGC console
- Access the NGC Private Registry console
- Check the uploaded image
- Wait for the security scan to run automatically
- After the scan is complete, check the vulnerability report
3.3 Container import into HPC environment
Run on HPC login server (docker not installed):
Method A: Via NGC Private Registry (recommended)
# Setting NGC API key (first time only)
export NGC_API_KEY="your-ngc-api-key"
# Import container with Enroot (docker not required)
enroot import docker://nvcr.io/your-org/your-private-registry/evospikenet:latest
# Move .sqsh file to appropriate location
mv evospikenet+latest.sqsh /lustre/${USER}/containers/evospikenet-ngc.sqsh
# confirmation
ls -lh /lustre/${USER}/containers/
Method B: Via Tar file
When using a locally uploaded Tar file:
# Import from Tar file in HPC environment
enroot import dockerd:///lustre/${USER}/containers/evospikenet-latest.tar
# Place the .sqsh file in the right place
mv evospikenet.sqsh /lustre/${USER}/containers/evospikenet-ngc.sqsh
# Tar files can be deleted
rm /lustre/${USER}/containers/evospikenet-latest.tar
# confirmation
ls -lh /lustre/${USER}/containers/
4. Configuring the environment module
Loading the Slurm module:
# Check available modules
module avail
# Loading the Slurm module
module load slurm/Slurm/21.08.8
# Persistence (for bash)
echo "module load slurm/Slurm/21.08.8" >> ~/.bashrc
5. Job script settings
5.1 Basic learning jobs
Edit example_training.slurm:
# Required: Change email address
#SBATCH --mail-user=your-email@example.com
# Confirm/change partition name
#SBATCH -p gpu-partition # change to actual partition name
execution:
cd /lustre/${USER}/evospikenet
sbatch slurm/example_training.slurm
5.2 Interactive Jupyter
Using example_interactive.slurm:
sbatch slurm/example_interactive.slurm
# Check job status
squeue -u ${USER}
# Check Jupyter URL from log
tail -f /lustre/${USER}/logs/jupyter-<job-id>.out
Example output:``` Jupyter URL: http://node-01:8888/?token=evospikenet
#### 5.3 Multi-node distributed learning
Using [example_batch_experiments.slurm](slurm/example_batch_experiments.slurm):
```bash
sbatch slurm/example_batch_experiments.slurm
Job management
Submit job
# Submit job
sbatch slurm/example_training.slurm
# interactive job
srun --pty -p gpu-partition -N 1 --gpus-per-node=1 -t 01:00:00 bash
Monitoring jobs
# My job list
squeue -u ${USER}
# Detailed information
scontrol show job <job-id>
# Real-time log monitoring
tail -f /lustre/${USER}/logs/evospikenet-train-<job-id>.out
tail -f /lustre/${USER}/logs/evospikenet-train-<job-id>.err
Job control
# Cancel job
scancel <job-id>
# Cancel all jobs
scancel -u ${USER}
# Pausing a job
scontrol suspend <job-id>
# Job restart
scontrol resume <job-id>
Data management
Upload data
# Upload via login server
scp -r /local/data user@login-server:/lustre/${USER}/data/
# Or sync with rsync
rsync -avz --progress /local/data/ user@login-server:/lustre/${USER}/data/
Backup strategy
Perform regular backups on the login server:
#!/bin/bash
# backup_models.sh
SOURCE_DIR="/lustre/${USER}/output/saved_models"
BACKUP_DIR="/store/${USER}/backups/models/$(date +%Y%m%d)"
echo "Backing up models to data store..."
mkdir -p ${BACKUP_DIR}
rsync -avz --progress ${SOURCE_DIR}/ ${BACKUP_DIR}/
echo "Backup completed: ${BACKUP_DIR}"
Automate with cron (login server):
# Editing crontab
crontab -e
# Backup every day at 3am
0 3 * * * /home/${USER}/scripts/backup_models.sh >> /home/${USER}/logs/backup.log 2>&1
Checkpoint management
Save checkpoints periodically while learning:
# In the EvoSpikeNet learning code
import torch
import os
checkpoint_dir = "/lustre/${USER}/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)
# Save regularly
if epoch % 10 == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f"{checkpoint_dir}/checkpoint_epoch_{epoch}.pt")
troubleshooting
Container doesn't start
# Checking the container image
ls -lh /lustre/${USER}/containers/
# manual test
srun --pty -p gpu-partition -N 1 -t 00:10:00 \
--container-image=/lustre/${USER}/containers/evospikenet-ngc.sqsh \
bash
GPU not recognized
# Check GPU in container
srun -p gpu-partition -N 1 --gpus-per-node=1 \
--container-image=/lustre/${USER}/containers/evospikenet-ngc.sqsh \
nvidia-smi
Permission error
# Check Luster storage permissions
ls -la /lustre/${USER}/
# Modify as necessary
chmod 755 /lustre/${USER}/evospikenet
chmod -R u+rwX /lustre/${USER}/evospikenet
Insufficient storage space
# Check capacity
df -h /lustre
df -h /home/${USER}
# Delete unnecessary files
find /lustre/${USER}/logs -name "*.out" -mtime +30 -delete
find /lustre/${USER}/output -name "*.tmp" -delete
Best practices
1. Optimize storage usage
- Local storage (
/raid): Only used for temporary deployment of containers - Main storage (
/lustre): Active training data and experiment results - Datastore (
/store): Models and datasets that require long-term storage
2. Job design
- Implement appropriate checkpoint functionality for long-running jobs
- Verify experiments on a small scale before scaling up
- Use exclusive mode with
--exclusiveoption (maximize GPU performance)
3. Resource efficiency
# Efficient use of GPU
#SBATCH --gpus-per-node=8 # Use all GPUs on the node
#SBATCH --exclusive # node exclusive
# test-partition for short experiments
#SBATCH -p test-partition
#SBATCH -t 0-00:30:00
4. Security
- Always run a security scan of NGC Private Registry
- Home directory permissions remain default
- Save API keys and authentication information in your home directory (
.ssh/,.config/)
Reference command collection
Slurm basic commands
# Partition list
sinfo
# Node information
scontrol show node
# Job history
sacct -u ${USER} --format=JobID,JobName,Partition,State,Elapsed,MaxRSS
# Job efficiency report
seff <job-id>
Performance monitoring
# Resource monitoring on GPU servers (in interactive session)
watch -n 1 nvidia-smi
# Job resource usage
sstat -j <job-id> --format=AveCPU,AveRSS,MaxRSS
Additional resources
support
If you run into problems:
- Check the log file:
/lustre/${USER}/logs/ - Check job details:
scontrol show job <job-id> - Contact your system administrator (if it's a system issue)
Updated: January 27, 2026 Version: 1.0