Skip to content

EvoSpikeNet HPC environment setup guide

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

overview

This guide explains how to run EvoSpikeNet in an HPC cluster environment using Slurm and Enroot/Pyxis.

Environmental specifications

Container execution infrastructure

  • Job Scheduler: Slurm 21.08.8+
  • Container runtime: Enroot/Pyxis
  • Container Registry: NVIDIA NGC Private Registry (Security scan required)

Storage configuration

Storage name Mount point Purpose Special notes
Local storage /raid Container placement/startup RAID0 configuration, permanent storage not possible
Main storage /lustre Learning data, high-speed IO Luster file system
Home area /home/[username] Project code Assignment by user
Data store /store Backup Only login server can access

Setup steps

1. Creating a storage structure

Run the following on the login server:

cd /path/to/EvoSpikeNet
bash slurm/setup_storage.sh

Directory structure created:

/lustre/${USER}/
├── evospikenet/          # project code
├── containers/           # Container image (.sqsh)
├── data/                # learning data
├── output/              # Learning results/model
├── logs/                # log file
├── checkpoints/         # checkpoint
└── artifacts/           # Other deliverables

/home/${USER}/
├── project -> /lustre/${USER}/evospikenet  # symbolic link
├── data -> /lustre/${USER}/data
├── output -> /lustre/${USER}/output
├── scripts/             # personal script
└── configs/             # Personalization

/store/${USER}/
├── backups/             # backup
├── models/              # Archived model
├── datasets/            # Dataset backup
└── archive/             # Other archives

2. Upload project code

On the login server:

# Clone project to Luster
cd /lustre/${USER}
git clone https://github.com/your-org/EvoSpikeNet.git evospikenet
cd evospikenet

# Or upload with scp
scp -r /local/path/to/EvoSpikeNet user@login-server:/lustre/${USER}/evospikenet

3. Build and deploy container images

Important: HPC cluster environments do not have Docker installed, so building containers must be done on the local machine or on a separate build server where Docker is available.

3.1 Working on a local machine/build server

Run on a machine with Docker installed:

# Login to NGC Private Registry
docker login nvcr.io

# Building and pushing containers
cd /path/to/EvoSpikeNet

# Edit settings in advance
# Change NGC_REGISTRY="nvcr.io/your-org/your-private-registry" to your actual value
vim slurm/build_and_deploy_container.sh

# Run build and push
bash slurm/build_and_deploy_container.sh

Alternative: Deploy via Docker Tarfile

If not using NGC Private Registry:

# Build the container on your local machine
docker build -f Dockerfile.ngc -t evospikenet:latest .

# Export to Tar file
docker save evospikenet:latest -o evospikenet-latest.tar

# Upload to HPC environment
scp evospikenet-latest.tar user@login-server:/lustre/${USER}/containers/

3.2 Security scan with NGC console

  1. Access the NGC Private Registry console
  2. Check the uploaded image
  3. Wait for the security scan to run automatically
  4. After the scan is complete, check the vulnerability report

3.3 Container import into HPC environment

Run on HPC login server (docker not installed):

Method A: Via NGC Private Registry (recommended)

# Setting NGC API key (first time only)
export NGC_API_KEY="your-ngc-api-key"

# Import container with Enroot (docker not required)
enroot import docker://nvcr.io/your-org/your-private-registry/evospikenet:latest

# Move .sqsh file to appropriate location
mv evospikenet+latest.sqsh /lustre/${USER}/containers/evospikenet-ngc.sqsh

# confirmation
ls -lh /lustre/${USER}/containers/

Method B: Via Tar file

When using a locally uploaded Tar file:

# Import from Tar file in HPC environment
enroot import dockerd:///lustre/${USER}/containers/evospikenet-latest.tar

# Place the .sqsh file in the right place
mv evospikenet.sqsh /lustre/${USER}/containers/evospikenet-ngc.sqsh

# Tar files can be deleted
rm /lustre/${USER}/containers/evospikenet-latest.tar

# confirmation
ls -lh /lustre/${USER}/containers/

4. Configuring the environment module

Loading the Slurm module:

# Check available modules
module avail

# Loading the Slurm module
module load slurm/Slurm/21.08.8

# Persistence (for bash)
echo "module load slurm/Slurm/21.08.8" >> ~/.bashrc

5. Job script settings

5.1 Basic learning jobs

Edit example_training.slurm:

# Required: Change email address
#SBATCH --mail-user=your-email@example.com

# Confirm/change partition name
#SBATCH -p gpu-partition # change to actual partition name

execution:

cd /lustre/${USER}/evospikenet
sbatch slurm/example_training.slurm

5.2 Interactive Jupyter

Using example_interactive.slurm:

sbatch slurm/example_interactive.slurm

# Check job status
squeue -u ${USER}

# Check Jupyter URL from log
tail -f /lustre/${USER}/logs/jupyter-<job-id>.out

Example output:``` Jupyter URL: http://node-01:8888/?token=evospikenet

#### 5.3 Multi-node distributed learning

Using [example_batch_experiments.slurm](slurm/example_batch_experiments.slurm):

```bash
sbatch slurm/example_batch_experiments.slurm

Job management

Submit job

# Submit job
sbatch slurm/example_training.slurm

# interactive job
srun --pty -p gpu-partition -N 1 --gpus-per-node=1 -t 01:00:00 bash

Monitoring jobs

# My job list
squeue -u ${USER}

# Detailed information
scontrol show job <job-id>

# Real-time log monitoring
tail -f /lustre/${USER}/logs/evospikenet-train-<job-id>.out
tail -f /lustre/${USER}/logs/evospikenet-train-<job-id>.err

Job control

# Cancel job
scancel <job-id>

# Cancel all jobs
scancel -u ${USER}

# Pausing a job
scontrol suspend <job-id>

# Job restart
scontrol resume <job-id>

Data management

Upload data

# Upload via login server
scp -r /local/data user@login-server:/lustre/${USER}/data/

# Or sync with rsync
rsync -avz --progress /local/data/ user@login-server:/lustre/${USER}/data/

Backup strategy

Perform regular backups on the login server:

#!/bin/bash
# backup_models.sh

SOURCE_DIR="/lustre/${USER}/output/saved_models"
BACKUP_DIR="/store/${USER}/backups/models/$(date +%Y%m%d)"

echo "Backing up models to data store..."
mkdir -p ${BACKUP_DIR}
rsync -avz --progress ${SOURCE_DIR}/ ${BACKUP_DIR}/

echo "Backup completed: ${BACKUP_DIR}"

Automate with cron (login server):

# Editing crontab
crontab -e

# Backup every day at 3am
0 3 * * * /home/${USER}/scripts/backup_models.sh >> /home/${USER}/logs/backup.log 2>&1

Checkpoint management

Save checkpoints periodically while learning:

# In the EvoSpikeNet learning code
import torch
import os

checkpoint_dir = "/lustre/${USER}/checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

# Save regularly
if epoch % 10 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, f"{checkpoint_dir}/checkpoint_epoch_{epoch}.pt")

troubleshooting

Container doesn't start

# Checking the container image
ls -lh /lustre/${USER}/containers/

# manual test
srun --pty -p gpu-partition -N 1 -t 00:10:00 \
    --container-image=/lustre/${USER}/containers/evospikenet-ngc.sqsh \
    bash

GPU not recognized

# Check GPU in container
srun -p gpu-partition -N 1 --gpus-per-node=1 \
    --container-image=/lustre/${USER}/containers/evospikenet-ngc.sqsh \
    nvidia-smi

Permission error

# Check Luster storage permissions
ls -la /lustre/${USER}/

# Modify as necessary
chmod 755 /lustre/${USER}/evospikenet
chmod -R u+rwX /lustre/${USER}/evospikenet

Insufficient storage space

# Check capacity
df -h /lustre
df -h /home/${USER}

# Delete unnecessary files
find /lustre/${USER}/logs -name "*.out" -mtime +30 -delete
find /lustre/${USER}/output -name "*.tmp" -delete

Best practices

1. Optimize storage usage

  • Local storage (/raid): Only used for temporary deployment of containers
  • Main storage (/lustre): Active training data and experiment results
  • Datastore (/store): Models and datasets that require long-term storage

2. Job design

  • Implement appropriate checkpoint functionality for long-running jobs
  • Verify experiments on a small scale before scaling up
  • Use exclusive mode with --exclusive option (maximize GPU performance)

3. Resource efficiency

# Efficient use of GPU
#SBATCH --gpus-per-node=8 # Use all GPUs on the node
#SBATCH --exclusive # node exclusive

# test-partition for short experiments
#SBATCH -p test-partition
#SBATCH -t 0-00:30:00

4. Security

  • Always run a security scan of NGC Private Registry
  • Home directory permissions remain default
  • Save API keys and authentication information in your home directory (.ssh/, .config/)

Reference command collection

Slurm basic commands

# Partition list
sinfo

# Node information
scontrol show node

# Job history
sacct -u ${USER} --format=JobID,JobName,Partition,State,Elapsed,MaxRSS

# Job efficiency report
seff <job-id>

Performance monitoring

# Resource monitoring on GPU servers (in interactive session)
watch -n 1 nvidia-smi

# Job resource usage
sstat -j <job-id> --format=AveCPU,AveRSS,MaxRSS

Additional resources

support

If you run into problems:

  1. Check the log file: /lustre/${USER}/logs/
  2. Check job details: scontrol show job <job-id>
  3. Contact your system administrator (if it's a system issue)

Updated: January 27, 2026 Version: 1.0