Skip to content

Large scale study guide

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Last updated: January 12, 2026 Color learning integration: January 12, 2026

overview

EvoSpikeNet's large-scale learning system provides a multimodal AI training environment that leverages a 24-node distributed brain architecture. This guide provides details on how to launch large-scale learning, data structure, storage location, and more.

Comprehensive AI learning system ⭐ UPDATED

EvoSpikeNet provides a comprehensive learning system for language comprehension, landmark recognition, Japanese-English speech listening, and multimodal integration. Data for each modality is stored and managed separately to achieve optimized model learning.

Features

  • 🗣️ Language Understanding: 3,080,000 samples of Japanese text data (rinna/japanese-gpt-1b optimization)
  • 🏛️ Landmark recognition: 100,000 samples of world landmark image data
  • 🎤 Japanese and English audio listening: 565,000 samples of high quality ASR data (LibriSpeech, VoxPopuli, ReazonSpeech)
  • 🔗 Multimodal: Image + text integrated learning data
  • 📁 Separate data storage: Efficient data management with category-based directory structure
  • 🚀 Easy start: Start data collection and training for all categories with one command
  • ⚡ Automatic optimization: GPU/CPU automatic detection, memory optimization, batch size adjustment
  • 📈 Scalable: Massively parallel processing with 24-node distributed architecture
  • 🎯 Rank Specialization: Each node (rank 0-23) generates a dedicated LLM optimized for its field of expertise

Rank-specific training

In a 24-node distributed brain architecture, each rank takes on a different role, producing a specialized LLM:

Rank range Role Specialty Main use
0-7 Language understanding node Japanese NLP Semantic understanding, context analysis
8-11 Visual processing node Landmark recognition Image understanding, object detection
12-15 Speech processing node Japanese-English ASR Speech recognition, multilingual processing
16-19 Movement control node Action generation Action planning, output generation
20-21 Memory nodes Episodic memory Long-term memory, experience integration
22-23 Decision-making node High-level reasoning Strategic judgment, executive function
# Rank-specific training example
./scripts/train_launcher.sh rank --rank 0 --category langtext   # Language understanding specialized LLM
./scripts/train_launcher.sh rank --rank 8 --category vision     # Visual processing specialized LLM
./scripts/train_launcher.sh rank --rank 12 --category audio     # Audio processing specialized LLM

Shared model training

You can train a generic model that can be shared across multiple ranks:

# Creating a shared model by specifying the rank range
./scripts/train_launcher.sh shared --category langtext --rank-range 0-7   # Language understanding nodes (0-7) shared model
./scripts/train_launcher.sh shared --category vision --rank-range 8-11    # Visual processing nodes (8-11) shared model

# General shared modeling (available for all ranks)
./scripts/train_launcher.sh shared --category multimodal --shared         # Multimodal general purpose model

The sharing model is used to efficiently share resources among multiple nodes with similar functionality.

Quick Start

# Start comprehensive AI learning with one command
./scripts/start_japanese_training.sh

Please refer to the "Japanese Learning Settings" section below for details.

Prerequisites

System requirements

  • CPU: Intel/AMD x64, ARM64, Apple Silicon
  • GPU: NVIDIA GPU (CUDA 11.8+), AMD GPU (ROCm), Apple Silicon GPU
  • Memory: Minimum 16GB, 64GB or more recommended
  • Storage: At least 100GB SSD, 1TB or more recommended for large-scale learning
  • OS: Linux, macOS, Windows (WSL2)

Software requirements

  • Docker: 20.10+
  • Docker Compose: 2.0+
  • Kubernetes: 1.24+ (for cluster deployments)
  • Python: 3.10+
  • CUDA: 11.8+ (when using GPU)

Network requirements

  • Internet Connection: For data download
  • Internal network: For distributed node communication
  • Port open: 8000-8007 (API), 5432 (PostgreSQL), 9200 (Elasticsearch)

Preferences

1. Clone the repository

git clone https://github.com/your-org/EvoSpikeNet.git
cd EvoSpikeNet

2. Setting environment variables

# Create .env file
cp .env.example .env

# Example of editing content
EVOSPIKENET_API_KEYS=your_api_key_here
DATABASE_URL=postgresql://user:password@localhost/evospikenet
OPENAI_API_KEY=your_openai_key
CUDA_VISIBLE_DEVICES=0,1,2,3  # When using GPU

3. Setting up the Python environment

# Virtual environment creation
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate    # Windows

# Dependency installation
pip install -r requirements.txt
pip install -e .

Data preparation

Bulk data download 🚀

Quick Start (Download all categories at once)

# Download data for all categories at once
python scripts/collect_llm_training_data.py --config config/data_config.yaml --all

# run in background
nohup python scripts/collect_llm_training_data.py --config config/data_config.yaml --all > download.log 2>&1 &

Bulk download by category

# Bulk download of language data (Japanese) (13M+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category langtext

# Download Vision data in bulk (190K+ samples)
python scripts/download_vision_data.py --quick  # High priority only
python scripts/download_vision_data.py --all    # Full dataset

# Download audio data in bulk (565K+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category audio

# Bulk download MultiModal data (885K+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category multimodal

Check data download status

# Check downloaded data
python scripts/verify_training_data_sufficiency.py

# Check the amount of data by category
find data/llm_training/ -type f -name "*.jsonl" -exec wc -l {} +

# Check the number of samples of Vision data
python -c "
from datasets import load_from_disk
import os
for dataset in ['cifar10', 'cifar100', 'fashion_mnist']:
    for split in ['train', 'test']:
        path = f'data/llm_training/Vision/{dataset}/{split}'
        if os.path.exists(path):
            ds = load_from_disk(path)
            print(f'{dataset}/{split}: {len(ds):,} samples')
"

Data download options

Options Description Example
--all Download all categories --all
--category <name> Specific categories only --category langtext
--rebuild Data reconstruction --rebuild
--max-samples <n> Sample number limit --max-samples 10000
--parallel Parallel download --parallel 4

Data structure

data/
├── llm_training/           # LLM training data (save by category)
│   ├── LangText/          # Text data for language understanding
│   │   ├── langtext_en_data.jsonl    # English text data
│   │   └── langtext_ja_data.jsonl    # Japanese text data (13M+ samples)
│   ├── Vision/            # image data
│   │   ├── cifar10/       # CIFAR-10 (60K)
│   │   ├── cifar100/      # CIFAR-100 (60K)
│   │   ├── fashion_mnist/ # Fashion-MNIST (70K)
│   │   └── vision_data.jsonl         # Landmark image data
│   ├── Audio/             # Audio listening data (565K+ samples)
│   │   └── audio_data.jsonl          # ASR learning data
│   └── MultiModal/        # Multimodal integrated data (885K+ samples)
│       └── multimodal_data.jsonl     # multimodal data
├── MNIST/                 # MNIST dataset
├── audio_dataset/         # audio dataset
├── multi_modal_dataset/   # Multimodal dataset
└── checkpoints/           # checkpoint

Data separation benefits: - LangText: Text data for training language understanding/generation models - Vision: World landmark image recognition data - Audio: Audio listening (ASR) data in both Japanese and English languages - MultiModal: Image + text integrated learning data

Data for each category is stored in a separate JSONL file and used for training depending on the model type.

Data collection script

LLM training data collection

# Confirm configuration file
cat config/data_config.yaml

# Data collection execution
python scripts/collect_llm_training_data.py --config config/data_config.yaml

Rank specific data collection

In a 24-node distributed brain architecture, each rank collects data optimized for its area of ​​expertise:

# Data collection for language understanding nodes (ranks 0-7)
./scripts/train_launcher.sh collect --rank 0    # Aozora Bunko, Japanese Wikipedia
./scripts/train_launcher.sh collect --rank 1    # Japanese classical literature, dialogue data

# Data collection for visual processing nodes (ranks 8-11)
./scripts/train_launcher.sh collect --rank 8    # ImageNet, COCO dataset
./scripts/train_launcher.sh collect --rank 9    # CIFAR-100, landmark recognition

# Data collection for voice processing nodes (ranks 12-15)
./scripts/train_launcher.sh collect --rank 12   # Common Voice Japanese, LibriSpeech
./scripts/train_launcher.sh collect --rank 13   # TEDlium, voice translation data

# Data collection for motion control nodes (ranks 16-19)
./scripts/train_launcher.sh collect --rank 16   # Roboturk, behavior generation data
./scripts/train_launcher.sh collect --rank 17   # Trajectory planning, sequence data

# Data collection for storage nodes (rank 20-21)
./scripts/train_launcher.sh collect --rank 20   # episodic memory data
./scripts/train_launcher.sh collect --rank 21   # Time series data, long-term dependence

# Data collection for decision nodes (ranks 22-23)
./scripts/train_launcher.sh collect --rank 22   # Strategy games, decision-making tasks
./scripts/train_launcher.sh collect --rank 23   # Reinforcement learning data, optimization problem

Data collection for each rank automatically selects the best data source and downloads high-quality, specialized data.

Collection from individual data sources

# Wikipedia data
python -c "

<!-- from evospikenet.dataloaders import WikipediaLoader -->
loader = WikipediaLoader(lang='en')
text = loader.load('Python (programming language)')
print(f'Downloaded {len(text)} characters')
"

# Hugging Face dataset
python -c "
from datasets import load_dataset
dataset = load_dataset('imdb', split='train[:10%]')
print(f'Loaded {len(dataset)} samples')
"

Data format

Text data (JSONL format)```json

{"text": "This is a sample text for LLM training.", "source": "wikipedia", "language": "en"} {"text": "これはLLMトレーニング用のサンプルテキストです。", "source": "aozora", "language": "ja"}

#### Image data (ImageFolder format)```
data/
├── train/
│   ├── class1/
│   │   ├── image001.jpg
│   │   └── image002.jpg
│   └── class2/
│       ├── image003.jpg
│       └── image004.jpg
└── test/
    ├── class1/
    └── class2/

Audio data (folder by class)```

data/audio_dataset/ ├── speech_commands/ │ ├── yes/ │ ├── no/ │ ├── up/ │ └── down/ └── custom_audio/ ├── music/ └── speech/

## How to start training

### Batch learning start 🎯

#### Quick Start (Learn all categories at once)

```bash
# Start training for all 24 nodes at once
./scripts/train_all_nodes.sh

# or individual launch
for rank in {0..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --rank $rank \
    --gpu &
done

Bulk learning by category

# Language understanding nodes (Rank 0-7) batch learning
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --gpu &
done

# Vision node (Rank 8-11) batch learning
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank $rank \
    --gpu &
done

# Audio node (Rank 12-15) bulk learning
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank $rank \
    --gpu &
done

# MultiModal node (Rank 16-23) batch learning
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank $rank \
    --gpu &
done

Color Learning Bulk Learning

# All ranks color learning training (minimum level)
for rank in {0..23}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level minimum \
    --rank $rank \
    --gpu &
done

# Vision-specific color learning (standard level)
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level standard \
    --rank $rank \
    --gpu &
done

# High precision color learning (maximum level)
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level maximum \
    --rank $rank \
    --gpu &
done

Bulk launch by learning level

Level Number of colors Recommended rank Execution example
minimum 8-16 colors All nodes --color-level minimum
standard 32-64 colors Vision specialized --color-level standard --rank 8-11
maximum 128-256 colors Vision specialized --color-level maximum --rank 8-11

1. Launch using Docker Compose

GPU training

# Start LLM training in GPU environment
docker-compose -f docker-compose.train.yml up llm-trainer-gpu

# Background execution
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu

CPU training

# Start LLM training in CPU environment
docker-compose -f docker-compose.train.yml up llm-trainer-cpu

# Background execution
docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu

2. Distributed training using Kubernetes

# Deploy to Kubernetes cluster
kubectl apply -f k8s/deployment.yaml

# Start training job
kubectl apply -f k8s/training-job.yaml

# Check status
kubectl get pods -n evospikenet
kubectl logs -f deployment/evospikenet-trainer -n evospikenet

3. Direct script execution

Start API server

# Start training server in API mode
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --gpu

# CPU mode
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --cpu

Direct training execution

# LangText model training
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --gpu \
    --epochs 10 \
    --batch-size 16

# Vision model training
python examples/train_vision_encoder.py \
    --dataset mnist \
    --epochs 50 \
    --batch-size 128 \
    --gpu

# Audio model training
python examples/train_audio_encoder.py \
    --epochs 30 \
    --batch-size 32 \
    --gpu

4. Distributed training

Distributed learning on multiple nodes

# master node
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --mode distributed \
    --rank 0 \
    --world-size 4 \
    --master-addr localhost \
    --master-port 12345

# worker node 1
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --mode distributed \
    --rank 1 \
    --world-size 4 \
    --master-addr master-node-ip \
    --master-port 12345

5. Rank specific training

In a 24-node distributed brain architecture, each rank (0-23) generates a specialized LLM. Rank-specific training automatically selects the best model architecture, training parameters, and dataset.

Perform rank-specific training

# Training language understanding nodes (rank 0-7)
./scripts/train_launcher.sh rank --rank 0 --category langtext --gpu
./scripts/train_launcher.sh rank --rank 1 --category langtext --gpu

# Training visual processing nodes (ranks 8-11)
./scripts/train_launcher.sh rank --rank 8 --category vision --gpu
./scripts/train_launcher.sh rank --rank 9 --category vision --gpu

# Training audio processing nodes (ranks 12-15)
./scripts/train_launcher.sh rank --rank 12 --category audio --gpu
./scripts/train_launcher.sh rank --rank 13 --category audio --gpu

# Training of motor control nodes (ranks 16-19)
./scripts/train_launcher.sh rank --rank 16 --category motor --gpu
./scripts/train_launcher.sh rank --rank 17 --category motor --gpu

# Training memory nodes (rank 20-21)
./scripts/train_launcher.sh rank --rank 20 --category memory --gpu
./scripts/train_launcher.sh rank --rank 21 --category memory --gpu

# Training decision nodes (rank 22-23)
./scripts/train_launcher.sh rank --rank 22 --category decision --gpu
./scripts/train_launcher.sh rank --rank 23 --category decision --gpu

Automatic setting of rank-specific parameters

Each rank automatically applies the following optimization parameters:

  • Language Understanding Nodes (0-7):
  • Model: rinna/japanese-gpt-1b
  • Optimization: Specialized in Japanese NLP tasks
  • Dataset: Aozora Bunko, Japanese Wikipedia
  • Learning rate: 2e-5

  • Visual processing nodes (8-11):

  • Model: google/vit-base-patch16-224
  • Optimization: image classification, object detection
  • Dataset: ImageNet, COCO
  • Learning rate: 1e-4

  • Speech processing nodes (12-15):

  • Model: openai/whisper-small
  • Optimization: speech recognition, multilingual support
  • Dataset: Common Voice, LibriSpeech
  • Learning rate: 1e-5

  • Motor control nodes (16-19):

  • Model: Custom Transformer
  • Optimization: sequence generation, behavior prediction
  • Dataset: robotics data, motion trajectory
  • Learning rate: 3e-5

  • Storage Node (20-21):

  • Model: Memory Expansion Transformer
  • Optimization: long-term dependencies, experience integration
  • Dataset: Episodic data, time series
  • Learning rate: 1e-5

  • Decision Node (22-23):

  • Model: High-Level Inference Transformer
  • Optimization: Strategic judgment, execution function
  • Datasets: decision-making tasks, strategy data
  • Learning rate: 2e-5

Rank specific training via API

# API server start
python scripts/train_llm_models.py --mode api --gpu

# Creating a single-rank-only model
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": 0,
    "epochs": 10,
    "batch_size": 16
  }'

# Creating a shared model (can be used with multiple ranks)
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": "shared",
    "shared": true,
    "epochs": 10,
    "batch_size": 16
  }'

# Creating a base model (for fine tuning)
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": "base",
    "epochs": 5,
    "batch_size": 32
  }'

Example of model name generated via API: - Single rank: evospike-langtext-r00-v001 - Rank range: evospike-langtext-r00-r07-v001 - Shared model: evospike-langtext-shared-v001 - Base model: evospike-langtext-base-v001 - General model: evospike-langtext-general-v001


Details of each learning method 📚

1. Language understanding learning (LangText)

Target rank

  • Rank 0-7: Language understanding node (Japanese NLP specialized)

Dataset

  • Japanese Wikipedia: 935,640 samples
  • Common Crawl Japanese: 2,342,820 samples
  • OSCAR Japanese: 1,399,920 samples
  • Aozora Bunko, papers, dialogues, codes, web, novels, legal documents, etc.
  • Total: 14,411,625 samples (OPTIMAL)

Data download

# Bulk download of language data
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category langtext

# Check download status
wc -l data/llm_training/LangText/langtext_ja_data.jsonl

Training method

# Single rank learning (Rank 0)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --gpu

# All language nodes batch learning (Rank 0-7)
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --gpu &
done
# config/training_config.yaml
model:
  name: "rinna/japanese-gpt-1b"
  max_length: 2048
  tokenizer: "rinna/japanese-gpt-1b"

training:
  epochs: 10
  batch_size: 4
  learning_rate: 2e-5
  gradient_accumulation_steps: 8
  warmup_steps: 1000
  fp16: true

Estimated study time

  • GPU (RTX 3090): 20-30 hours
  • GPU (A100): 10-15 hours
  • CPU: 100-150 hours (not recommended)

2. Vision learning (image recognition)

Target rank

  • Rank 8-11: Visual processing node (image classification/object detection)

Dataset

  • CIFAR-10: 60,000 samples (basic image classification)
  • CIFAR-100: 60,000 samples (detailed image classification)
  • Fashion-MNIST: 70,000 samples (fashion images)
  • Google Landmarks: Landmark recognition
  • Total: 195,000+ samples (OPTIMAL)

Data download

# Quick download (high priority dataset: CIFAR-10/100, Fashion-MNIST)
python scripts/download_vision_data.py --quick

# Download all datasets (including Food-101, Oxford Pets, Flowers)
python scripts/download_vision_data.py --all

# Individual download
python scripts/download_vision_data.py --dataset cifar10
python scripts/download_vision_data.py --dataset cifar100
python scripts/download_vision_data.py --dataset fashion_mnist

# List of available datasets
python scripts/download_vision_data.py --list

# Check download status
python -c "
from datasets import load_from_disk
import os
total = 0
for ds in ['cifar10', 'cifar100', 'fashion_mnist']:
    for split in ['train', 'test']:
        path = f'data/llm_training/Vision/{ds}/{split}'
        if os.path.exists(path):
            data = load_from_disk(path)
            samples = len(data)
            total += samples
            print(f'{ds}/{split}: {samples:,} samples')
print(f'\\n合計: {total:,} samples')
"

Training method

# Single rank learning (Rank 8)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank 8 \
    --gpu

# Batch learning of all Vision nodes (Rank 8-11)
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank $rank \
    --gpu &
done

# Dataset specific training
python scripts/train_llm_models.py \
    --category vision \
    --rank 8 \
    --dataset cifar10 \
    --gpu
# config/training_config.yaml
model:
  name: "google/vit-base-patch16-224"
  image_size: 224
  patch_size: 16

training:
  epochs: 30
  batch_size: 32
  learning_rate: 1e-4
  optimizer: "adamw"
  weight_decay: 0.01
  fp16: true

Estimated study time

  • GPU (RTX 3090): 5-8 hours
  • GPU (A100): 3-5 hours

3. Audio learning (speech recognition)

Target rank

  • Rank 12-15: Audio processing node (Japanese/English ASR)

Dataset

  • LibriSpeech: 460,000 samples
  • Common Voice: 50,000 samples
  • VoxPopuli: 30,000 samples
  • ReazonSpeech: 25,000 samples
  • Total: 575,000+ samples (OPTIMAL)

Data download

# Audio data bulk download
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category audio

# Check download status
wc -l data/llm_training/Audio/audio_data.jsonl

Training method

# Single rank learning (Rank 12)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank 12 \
    --gpu

# Batch learning of all audio nodes (Rank 12-15)
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank $rank \
    --gpu &
done
# config/training_config.yaml
model:
  name: "openai/whisper-small"
  sampling_rate: 16000
  language: "ja"

training:
  epochs: 20
  batch_size: 16
  learning_rate: 1e-5
  gradient_accumulation_steps: 4
  fp16: true

Estimated study time

  • GPU (RTX 3090): 10-15 hours
  • GPU (A100): 6-10 hours

4. MultiModal learning (multimodal integration)

Target rank

  • Rank 16-19: Motion control node
  • Rank 20-21: Storage node
  • Rank 22-23: Decision node

Dataset

  • COCO Captions: 414,000 samples
  • Flickr30k: 145,000 samples
  • Conceptual Captions: 300,000 samples
  • Visual Genome: 26,000 samples
  • Total: 885,000+ samples (OPTIMAL)

Data download

# MultiModal data bulk download
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category multimodal

# Check download status
wc -l data/llm_training/MultiModal/multimodal_data.jsonl

Training method

# Single rank learning (Rank 16)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank 16 \
    --gpu

# Batch learning of all MultiModal nodes (Rank 16-23)
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank $rank \
    --gpu &
done
# config/training_config.yaml
model:
  name: "openai/clip-vit-base-patch32"
  text_encoder: "bert-base-uncased"
  vision_encoder: "vit-base-patch32"

training:
  epochs: 15
  batch_size: 24
  learning_rate: 5e-5
  warmup_steps: 500
  fp16: true

Estimated study time

  • GPU (RTX 3090): 15-20 hours
  • GPU (A100): 8-12 hours

5. Comparison table of learning methods

Category Target rank Data amount Learning time (GPU) Recommended model Main uses
LangText 0-7 14.4M 20-30 hours rinna/japanese-gpt-1b Japanese understanding/generation
Vision 8-11 195K+ 5-8 hours vit-base-patch16-224 Image classification/recognition
Audio 12-15 575K+ 10-15 hours whisper-small Speech Recognition/ASR
MultiModal 16-23 885K+ 15-20 hours clip-vit-base-patch32 Image+Text Integration

6. All category learning script at once

#!/bin/bash
# scripts/train_all_categories.sh

echo "=== 全カテゴリ一括学習開始 ==="

# LangText(Rank 0-7)
echo "Starting LangText training..."
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --category langtext --rank $rank --gpu &
done

# Vision(Rank 8-11)
echo "Starting Vision training..."
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category vision --rank $rank --gpu &
done

# Audio(Rank 12-15)
echo "Starting Audio training..."
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --category audio --rank $rank --gpu &
done

# MultiModal(Rank 16-23)
echo "Starting MultiModal training..."
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --category multimodal --rank $rank --gpu &
done

echo "=== 全24ノードのトレーニングを起動しました ==="
echo "進捗確認: tail -f logs/training.log"

6. Color learning training ⭐ NEW

Color Learning in distributed brain systems is specialized training for each node to acquire the ability to understand, process, and generate color information. It offers three learning levels (minimum, standard, and maximum) and is optimized for each node type.

Characteristics of color learning

  • 3 learning levels: Minimum (8-16 colors), Standard (32-64 colors), Maximum (128-256 colors)
  • Node-specific optimization: Specialized for each node type such as PFC, Vision, Language, etc.
  • Automatic data generation: Integrating synthetic data and Hugging Face datasets
  • Transfer learning supported: Efficient learning from pre-trained models
  • Knowledge Distillation Support: Knowledge transfer from large-scale models to small-scale models

Color learning level details

Level Number of colors Dataset Training time GPU VRAM Application
Minimum 8-16 colors MNIST, Basic Colors (150MB) 1-2 hours 2-4GB Prototyping, basic color classification
Standard 32-64 colors CIFAR-10/100, subset ImageNet (2-5GB) 4-8 hours 8-12GB Practical applications, general color recognition
Maximum 128-256 colors ImageNet, COCO (20-50GB) 12-24 hours 16-24GB Professional color processing, research use
# 1. Check the configuration information
python scripts/train_llm_models.py --config config/training_config.yaml --show-color-config

# 2. Data download + training (Vision node Rank 9)
python scripts/train_llm_models.py --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank 9 \
  --download-data \
  --gpu

# 3. Language Node (Rank 20) ​​- Standard level
python scripts/train_llm_models.py --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 20 \
  --gpu

# 4. Parallel training of multiple nodes
# GPU 0: Vision node
CUDA_VISIBLE_DEVICES=0 python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 9 \
  --gpu &

# GPU 1: Language node
CUDA_VISIBLE_DEVICES=1 python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 20 \
  --gpu &

wait

Rank → automatic node type mapping

The --rank option automatically determines the appropriate node type:

Rank range Node type Importance of color learning Recommended level
0-7 PFC (prefrontal cortex) High Standard
8-11 Vision Maximum
12-15 Audio Low Minimum
16-19 Motor (movement) Medium Minimum-Standard
20-21 Memory Medium Standard
22-23 PFC (Decision Making) High Standard

Execution from dedicated script (detailed control)

# 1. Download the dataset
python scripts/download_color_datasets.py \
    --level minimum \
    --modality all \
    --output-dir data/color_learning

# Show available dataset information
python scripts/download_color_datasets.py --show-info

# 2. Training the model
# PFC node (multimodal) - lowest level
python scripts/train_color_learning_models.py \
    --node-type pfc \
    --level minimum \
    --dataset-path data/color_learning/multimodal/mnist_captions \
    --epochs 5 \
    --gpu

# Vision Node - Standard Level
python scripts/train_color_learning_models.py \
    --node-type vision \
    --level standard \
    --dataset cifar10 \
    --epochs 20 \
    --gpu

# Language Node - Standard Level
python scripts/train_color_learning_models.py \
    --node-type language \
    --level standard \
    --dataset-path data/color_learning/language \
    --epochs 25 \
    --gpu

# Check configuration information (dry-run)
python scripts/train_color_learning_models.py \
    --node-type vision-object \
    --level maximum \
    --show-config

Color learning training via API

# API server start
python scripts/train_llm_models.py --config config/training_config.yaml \
  --mode api --host 0.0.0.0 --port 8000

# Submit color learning job
curl -X POST "http://localhost:8000/train" \
  -H "Content-Type: application/json" \
  -d '{
    "category": "color_learning",
    "model_name": "evospike-color-vision-r09",
    "dataset_path": "data/color_learning/minimum/vision",
    "output_dir": "saved_models/color_vision_minimum",
    "gpu": true,
    "epochs": 10,
    "batch_size": 16,
    "learning_rate": 0.0001,
    "rank": 9
  }'

Color learning model naming convention

The generated color learning model follows the following naming convention:

  • Single rank: evospike-color_learning_vision-r09-v001
  • Rank Range: evospike-color_learning_vision-r09-r11-v001
  • Shared model: evospike-color_learning_vision-shared-v001

Color learning settings by node type

Each node type has different color learning requirements:

PFC (Prefrontal Cortex) Node: - Minimum: 8 colors, basic color understanding - Standard: 64 colors, practical color recognition - Maximum: 256 colors, professional color processing

Vision node: - Minimum: 16 colors, basic image classification - Standard: 64 colors, detailed color recognition - Maximum: 256 colors, professional color processing

Language node: - Minimum: 8 colors, basic understanding of color names - Standard: 32 colors, detailed representation of colors - Maximum: 128 colors, nuanced color rendering

Motor node: - Minimum: 8 colors, basic color feedback - Standard: 16 colors, visual guidance - Maximum: 32 colors, detailed visual control

Audio node: - Minimum: 4 colors, minimal visual integration - Standard: 8 colors, basic multimodal support - Maximum: 16 colors, audio-visual integration

Memory node: - Minimum: 8 colors, basic episode recording - Standard: 32 colors, detailed memory encoding - Maximum: 64 colors, high resolution memory retention

Progressive training

Efficient learning is possible by increasing the learning level step by step:

#!/bin/bash
# progressive_color_training.sh

RANK=9  # Vision node

# Stage 1: Minimum (basic learning)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank $RANK \
  --download-data \
  --gpu

# Stage 2: Standard (transfer learning)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank $RANK \
  --gpu

# Stage 3: Maximum (final adjustment)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level maximum \
  --rank $RANK \
  --gpu

echo "✅ Progressive training completed for Rank $RANK"

Efficiency through knowledge distillation

Knowledge transfer from large-scale models to small-scale models:

# 1. Train the teacher model (maximum level)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level maximum \
  --rank 9 \
  --gpu

# 2. Knowledge distillation to student model (standard level)
# TODO: Planned implementation of knowledge distillation script

Color learning assessment

# Evaluate model color recognition accuracy
python scripts/evaluate_color_learning.py \
  --model-path saved_models/evospike-color_learning_vision-r09-v001 \
  --test-dataset data/color_learning/standard/vision/test \
  --metrics accuracy,f1,confusion_matrix

# Visualization of results
python scripts/visualize_color_results.py \
  --results results/color_learning_evaluation.json \
  --output visualizations/color_learning

Dataset information

Main datasets used in color learning:

Minimum Level: - MNIST (60,000 images, grayscale) - Basic Colors (10,000 composite images, 8 colors) - Total: ~150MB

Standard Level: - CIFAR-10 (60,000 images, 10 classes) - CIFAR-100 subset (20,000 images, 64 colors) - Color Text (50,000 text, color description) - Total: ~2-5GB

Maximum Level: - ImageNet subset (100,000 images) - COCO (118,287 images) - Flickr30k (31,000 images) - Wikipedia Color Corpus (500,000 texts) - Total: ~20-50GB

troubleshooting

Out of memory error:```bash

Reduce batch size

python scripts/train_llm_models.py \ --config config/training_config.yaml \ --category color_learning \ --color-level standard \ --rank 9 \ --gpu \ --batch-size 4 # reduced from default 16

**Data download error:**```bash
# download separately
python scripts/download_color_datasets.py \
  --level minimum \
  --modality vision \
  --output-dir data/color_learning

# Then run training (without --download-data)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank 9 \
  --gpu

Data structure and storage location

LLM model naming convention

EvoSpikeNet applies a uniform naming convention to the generated LLM models, allowing you to identify model type, category, rank, and version at a glance. We consider not only models dedicated to a single rank, but also models that can be shared by multiple ranks.

Naming convention structure

Basic format:``` {Type}-{Category}-{RankSpec}-v{Version:03d}

**Description of each element:**
- **Type**: Model architecture type
  - `evospike`: EvoSpikeNet unique architecture
  - `brain`: Brain architecture model
  - `node`: Node-specific model
- **Category**: Specialty category (based on NODE_TYPE_TO_CATEGORIES)
  - `text_generation`: Text generation (executive node)
  - `image_classification`: Image classification (vision node)
  - `object_detection`: Object detection (vision node)
  - `speech_recognition`: Speech recognition (auditory node)
  - `motion_control`: Motion control (motor node)
  - `decision_making`: Decision making (executive node)
  - `planning`: Planning (executive node)
  - `reasoning`: Reasoning (executive node)
  - `rag`: RAG (search extension generation) (executive node)
  - `multimodal`: Multimodal (general node)
  - `embedding`: Embedding (general node)
- **RankSpec**: Rank specification (multiple patterns supported)
  - Single rank: `r{Rank:02d}` (e.g. `r00`, `r08`)
  - Rank range: `r{Start:02d}-r{End:02d}` (Example: `r00-r07`, `r08-r11`)
  - Shared model: `shared` (can be shared among all ranks)
  - General model: `general` (rank independent)
  - Base model: `base` (basis for other models)
- **Version**: Version number (001, 002, ...)

#### Naming Convention Examples

**Single rank only model:**```bash
evospike-text_generation-r00-v001     # Rank 0 exclusive text generation model
evospike-image_classification-r08-v001 # Rank 8 dedicated image classification model
evospike-speech_recognition-r12-v001   # Rank 12 dedicated voice recognition model
evospike-decision_making-r22-v001     # Rank 22 exclusive decision-making model

Multiple rank sharing model:```bash evospike-text_generation-r00-r07-v001 # Shared text generation model for ranks 0-7 evospike-image_classification-r08-r11-v001 # Shared image classification models with ranks 8-11 evospike-speech_recognition-r12-r15-v001 # Shared speech recognition model at rank 12-15 evospike-motion_control-r16-r19-v001 # Motion control model shared by ranks 16-19

**Shared/General/Base model:**```bash
evospike-text_generation-shared-v001   # Text generation model that can be shared across all ranks
evospike-image_classification-shared-v001 # Image classification model that can be shared across all ranks
evospike-text_generation-general-v001  # Rank-independent general text generation model
evospike-text_generation-base-v001     # Base of text generation model (for fine tuning)
evospike-multimodal-base-v001          # Base of multimodal model

Model sharing use cases

  • Shared model: When using the same model on multiple nodes with the same functionality
  • Base model: Basic model when creating a dedicated model for each rank
  • General model: General model without rank-specific optimization
  • Scope model: A model that can be shared by nodes in the same field of expertise

Advantages of naming conventions

  • Identification: Type/Category/RankSpec/Version can be seen at a glance
  • Flexibility: Supports single/multiple/shared models
  • Extensibility: Easy to add new sharing patterns
  • Consistency: Uniform naming across all training methods

Naming Convention Examples

# Initial model of language understanding node (rank 0)
evospike-langtext-r00-v001

# Improved model of visual processing node (rank 8)
evospike-vision-r08-v002

# Specialized model of speech processing node (rank 12)
evospike-audio-r12-v001

# Trained model for motion control node (rank 16)
evospike-motor-r16-v003

# Optimization model for storage nodes (rank 20)
evospike-memory-r20-v002

# High performance model of decision node (rank 22)
evospike-decision-r22-v001

Advantages of naming conventions

  • Identification: Type, Category, Rank, Version can be seen at a glance
  • Sortability: Easy to sort by rank or version
  • Extensibility: Easy to add new categories and types
  • Automation: Can be automatically generated using a script

Model storage structure

saved_models/
├── evospike-langtext-r00-v001/     # Language understanding node (rank 0) model
│   ├── config.json                # model settings
│   ├── pytorch_model.bin          # model weights
│   ├── tokenizer.json             # Tokenizer settings
│   ├── vocab.json                 # vocabulary file
│   ├── merges.txt                 # BPE merge file
│   └── training_args.bin          # training arguments
├── evospike-vision-r08-v001/      # Visual processing node (rank 8) model
│   ├── model.pth
│   ├── optimizer.pth
│   └── logs/
├── evospike-audio-r12-v001/       # Audio processing node (rank 12) model
│   ├── model.pt
│   ├── feature_extractor.json
│   └── logs/
├── evospike-decision-r22-v001/    # Decision node (rank 22) model
│   ├── model.bin
│   ├── processor_config.json
│   └── logs/
└── checkpoints/                   # training checkpoint
    ├── evospike-langtext-r00-v001-checkpoint-500/
    ├── evospike-langtext-r00-v001-checkpoint-1000/
    └── ...

Log storage structure

logs/
├── training.log                  # main training log
├── tensorboard/                  # TensorBoard log
│   ├── events.out.tfevents.1234567890.hostname
│   └── ...
├── wandb/                        # Weights & Biases Log
│   ├── run-20231231_123456-abc123/
│   └── ...
└── metrics.json                  # Metrics JSON

Configuration file structure

config/
├── training_config.yaml          # training settings
├── data_config.yaml              # Data collection settings
├── settings.yaml                 # Application settings
├── settings.production.yaml      # Production environment settings
├── settings.staging.yaml         # Staging environment settings
├── settings.development.yaml     # Development environment settings
├── settings.schema.json          # configuration schema
├── node_allocation.yaml          # Node allocation settings
└── progress_settings.yaml        # Progress settings

Database structure

PostgreSQL schema

-- トレーニングジョブテーブル
CREATE TABLE training_jobs (
    id SERIAL PRIMARY KEY,
    job_id VARCHAR(255) UNIQUE NOT NULL,
    model_type VARCHAR(50) NOT NULL,
    status VARCHAR(50) NOT NULL,
    config JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- モデルメトリクステーブル
CREATE TABLE model_metrics (
    id SERIAL PRIMARY KEY,
    job_id VARCHAR(255) REFERENCES training_jobs(job_id),
    epoch INTEGER,
    step INTEGER,
    loss FLOAT,
    accuracy FLOAT,
    perplexity FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 分散ノードテーブル
CREATE TABLE distributed_nodes (
    id SERIAL PRIMARY KEY,
    node_id VARCHAR(255) UNIQUE NOT NULL,
    ip_address INET,
    gpu_count INTEGER,
    memory_gb INTEGER,
    status VARCHAR(50),
    last_heartbeat TIMESTAMP
);

Elasticsearch index

{
  "mappings": {
    "properties": {
      "job_id": {"type": "keyword"},
      "timestamp": {"type": "date"},
      "level": {"type": "keyword"},
      "message": {"type": "text"},
      "metrics": {"type": "object"},
      "node_id": {"type": "keyword"}
    }
  }
}

Monitoring and management

Monitoring your training progress

Monitoring via API

# Check training status
curl http://localhost:8000/training/status

# Get metrics
curl http://localhost:8000/metrics

# Log acquisition
curl http://localhost:8000/logs?lines=100

Visualization on TensorBoard

# Start TensorBoard
tensorboard --logdir logs/tensorboard --port 6006

# Access with browser
open http://localhost:6006

Tracking with Weights & Biases

# W&B Dashboard
wandb login
# Automatic tracking during training

Resource monitoring

GPU usage monitoring

# NVIDIA GPU
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.free --format=csv

# AMD GPU
rocm-smi --showuse

System resource monitoring

# CPU/memory usage
top -p $(pgrep -f train_llm_models)

# Disk usage
df -h /path/to/data /path/to/models

# Network usage
iftop -i eth0

Training management

Pause/Resume Training

# Restart from checkpoint
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --resume-from-checkpoint saved_models/checkpoints/checkpoint-1000 \
    --gpu

Stop training

# Normal stop (checkpoint save)
curl -X POST http://localhost:8000/training/stop

# Forced stop
pkill -f train_llm_models

7. Additive and federated learning 🆕

Incremental Learning

It is the ability to continuously learn from new data based on existing models. You can leverage previous training results to efficiently improve your model.

How to use

# Restart from checkpoint
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001/checkpoint-1000 \
    --gpu

# Additional learning to existing model (Incremental Learning)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --incremental \
    --gpu

Features

  • ✅ Checkpoint Resume: Resume training interrupted with --resume-from
  • ✅ Knowledge retention: Learn with new data while preserving existing learning content
  • ✅ Efficient: Completed in less time than learning from scratch
  • ✅ Countermeasures against catastrophic forgetting: Minimize knowledge loss with gradual learning

Use case

  1. Add new data: Add new data set to existing model for learning
  2. Continuous Improvement: Continuous improvement of the model through regular data updates
  3. Domain adaptation: Adapting a general model to a specific domain
  4. Version control: History management of incremental model improvements

Example: Additional learning of LangText model

# Step 1: Initial training
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --gpu

# Step 2: Additional learning with new data
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --incremental \
    --gpu

# Step 3: Further learning with specific domain data
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001 \
    --gpu

Federated Learning

This function performs distributed learning on multiple nodes and aggregates model parameters to build an integrated model. Learn from distributed data while preserving privacy.

How to use

# Federated learning mode (FedAvg)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --federated \
    --aggregation-method fedavg \
    --federated-rounds 10 \
    --gpu

# Federated Learning (FedProx)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank 8 \
    --federated \
    --aggregation-method fedprox \
    --federated-rounds 20 \
    --gpu

# Federated Learning (FedOpt)
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank 12 \
    --federated \
    --aggregation-method fedopt \
    --federated-rounds 15 \
    --gpu

Aggregation Methods

Method Description Application Situation
FedAvg Simple averaging General federated learning
FedProx Learning with regularization When there is data imbalance
FedOpt Adaptive Optimizer When fast convergence is required

Features

  • 🔒 Privacy Protection: Learn while keeping your data local
  • 🌐 Distributed learning: Parallel learning on multiple nodes
  • 🔄 Model aggregation: Integrate models of each node
  • 📊 Supports non-IID data: Learning is possible even in environments with different data distributions
  • ⚡ Improved communication efficiency: Send only model parameters

Federation learning workflow

# Launch federated learning with multiple ranks
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --federated \
    --aggregation-method fedavg \
    --federated-rounds 10 \
    --gpu &
done

echo "全8ノードのフェデレーション学習を開始しました"

Use case

  1. Privacy-preserving AI: Learning with confidential data such as medical and financial data
  2. Edge device learning: Distributed learning from IoT devices
  3. Cross-silo learning: Collaborative learning of models across multiple organizations
  4. Distributed brain system: Building collaborative intelligence with a 24-node distributed brain

Parameter description

Parameters Default Description
--federated False Enable federated learning mode
--aggregation-method fedavg Aggregation method: fedavg/fedprox/fedopt
--federated-rounds 10 Number of rounds for federated learning
--resume-from None Checkpoint path
--incremental False Enable incremental learning mode

Combining additive and federated learning

# Improved with federated learning based on existing model
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001 \
    --federated \
    --aggregation-method fedprox \
    --federated-rounds 10 \
    --gpu

troubleshooting

Common issues

Out of memory

# Reduce batch size
--batch-size 4

# Use gradient accumulation
--gradient-accumulation-steps 4

# Increase CPU memory
--cpu-memory-fraction 0.8

Out of GPU memory

# Increase GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Use mixed precision
--fp16

# make the model smaller
--model-size small

Data loading error

# Data integrity check
python scripts/verify_data_integrity.py --data-dir data/

# data reconstruction
python scripts/collect_llm_training_data.py --config config/data_config.yaml --rebuild

Distributed training issues

Inter-node communication error

# Check firewall settings
sudo ufw allow 12345/tcp

# NCCL debug mode
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1

Node synchronization error

# Time synchronization confirmation
chronyc tracking

# NTP synchronization
sudo systemctl restart chrony

Performance issues

Training speed is slow

# DataLoader optimization
--num-workers 4
--pin-memory
--persistent-workers

# model optimization
--torch-compile
--flash-attention

Does not converge

# Learning rate adjustment
--learning-rate 1e-5

# Add warm-up
--warmup-steps 1000

# Change scheduler
--lr-scheduler cosine

Performance optimization

GPU optimization

Mixed precision training

# training_config.yaml
training:
  fp16: true
  bf16: false  # bf16 recommended for GPUs after Ampere
  gradient_checkpointing: true

Distributed data parallelism

# Use torchrun
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=master_node \
    scripts/train_llm_models.py --config config/training_config.yaml

Data optimization

DataLoader optimization

# High speed data loading settings
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=2
)

Data preprocessing

# Dataset pre-tokenization
python scripts/pretokenize_dataset.py \
    --input data/llm_training/raw \
    --output data/llm_training/tokenized \
    --tokenizer microsoft/DialoGPT-medium

Memory optimization

Gradient accumulation

training:
  batch_size: 2
  gradient_accumulation_steps: 8  # Effective batch size = 16

Memory efficient configuration

# PyTorch memory optimization
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

# CPU memory optimization
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

Storage optimization

Model compression

# Quantization
python scripts/quantize_model.py \
    --model saved_models/lang_evospike_lm_v1 \
    --quantization 8bit

# distillation
python scripts/distill_model.py \
    --teacher saved_models/large_model \
    --student saved_models/small_model

Checkpoint management

training:
  save_steps: 1000
  save_total_limit: 3  # Keep only the latest 3 checkpoints
  save_strategy: steps

Appendix

Configuration file example

training_config.yaml

data_dir: "data/llm_training"
output_dir: "saved_models"

langtext:
  model_name: "microsoft/DialoGPT-medium"
  max_length: 512
  lora_config:
    r: 16
    lora_alpha: 32
    target_modules: ["q_proj", "v_proj"]
    lora_dropout: 0.05

training:
  epochs: 10
  batch_size: 8
  learning_rate: 2e-5
  save_steps: 500
  logging_steps: 100
  fp16: true

gpu:
  use_gpu: true
  gpu_memory_fraction: 0.9

Japanese learning settings

overview

Large-scale Japanese learning consists of the following components:

  1. Data configuration (config/data_config.yaml)
  2. Training configuration (config/training_config.yaml)
  3. Quick start script (scripts/start_japanese_training.sh)

Data settings

Large-scale learning configuration in config/data_config.yaml:

# Text data for language understanding
langtext_datasets_ja:
  output_dir: "data/llm_training/LangText"
  output_file: "langtext_ja_data.jsonl"
  # 21 types of Japanese datasets (approximately 3,080,000 samples)

# Landmark image data
vision_datasets:
  output_dir: "data/llm_training/Vision"
  output_file: "vision_data.jsonl"
  # Google Landmarks, Europe Landmarks (approximately 100,000 samples)

# Japanese and English audio listening data
audio_datasets:
  output_dir: "data/llm_training/Audio"
  output_file: "audio_data.jsonl"
  # LibriSpeech, VoxPopuli, ReasonSpeech, Common Voice (approximately 565,000 samples)

# Multimodal integrated data
multimodal_datasets:
  output_dir: "data/llm_training/MultiModal"
  output_file: "multimodal_data.jsonl"
  # Image + text integrated data

Training settings

Japanese model settings in config/training_config.yaml:

model:
  name: "rinna/japanese-gpt-1b"  # Japanese specialized model
  language: "ja"
  type: "causal-lm"

training:
  epochs: 10
  batch_size: 4
  gradient_accumulation_steps: 8
  learning_rate: 2e-5
  warmup_steps: 1000
  max_seq_length: 2048

gpu:
  use_gpu: true
  gpu_memory_fraction: 0.95  # 95% GPU memory used

Quick start script

Features of scripts/start_japanese_training.sh:

  1. Environment check: Python3, automatic detection of GPU/CPU
  2. Dependency installation: Automatic installation of requirements.txt
  3. Data directory creation: Automatic creation of necessary directories
  4. Data collection: Download all configured datasets
  5. Start training: Start appropriate container depending on GPU/CPU

How to run

Basic execution

# Interactive mode (recommended)
./scripts/start_japanese_training.sh

# or non-interactive mode
echo "y" | ./scripts/start_japanese_training.sh

Custom execution

# Data collection only
python scripts/collect_llm_training_data.py --config config/data_config.yaml

# Training only (if data already exists)
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu

Monitoring

Once you have started training, check your progress in the following ways:

# training status
curl http://localhost:8000/training/status

# Log confirmation
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu

# Web UI
open http://localhost:8000/docs

Dataset details

Text dataset (3,080,000 samples)

Dataset Number of samples Contents
izumi-lab/llm-japanese-dataset 1,000,000 General Japanese text
llm-book/japanese-wikipedia 500,000 Wikipedia article
llm-book/japanese-news 300,000 News article
llm-book/japanese-books 200,000 Book text
llm-book/japanese-papers 150,000 Paper Abstract
llm-book/japanese-code 100,000 Programming code
llm-book/japanese-qa 80,000 QA data
and 13 others 750,000 Dialogue, reviews, etc.

Image dataset (landmark)

Dataset Number of samples Contents
visheratin/google_landmarks_photos 50,000 Google Landmarks photo dataset
Qdrant/google-landmark-geo 30,000 Google Landmarks + Geographic Coordinates
SablikJan/europe-landmarks-classification 20,000 European Landmarks Classification

Audio dataset (Japanese-English listening learning)

Dataset Sample size Language Content ASR suitability
mozilla-foundation/common_voice_11_0 (ja) 60,000 Japanese General purpose speech recognition ⭐⭐⭐
mozilla-foundation/common_voice_11_0 (en) 120,000 English General purpose speech recognition ⭐⭐⭐
librispeech_asr (clean) 150,000 English High quality speech reading ⭐⭐⭐⭐⭐
facebook/voxpopuli (en) 95,000 English Parliament audio data ⭐⭐⭐⭐
reazon-research/reazonspeech 60,000 Japanese High quality Japanese audio ⭐⭐⭐⭐
speech_commands (v0.02) 30,000 Multilingual Voice commands ⭐⭐

ASR Compatibility Description: - ⭐⭐⭐⭐⭐ Very high: LibriSpeech (clean and high quality) - ⭐⭐⭐⭐ High: VoxPopuli, ReasonSpeech (specialized data) - ⭐⭐⭐ Standard: Common Voice (various pronunciations) - ⭐⭐ Auxiliary: Speech Commands (Command Recognition)

Multimodal dataset

  • llm-book/japanese-image-text: 100,000 samples
  • Content: Pair of image and Japanese caption

Performance optimization

GPU optimization

  • Memory usage: 95% (maximum utilization)
  • Mixed Precision: FP16 enabled
  • Gradient Accumulation: 8 steps
  • Batch Size: 4 (adjusted according to GPU memory)

Distributed training

With 24 node distributed architecture:

  • Parallel processing: Data parallelism + Model parallelism
  • Communication optimization: Use Zenoh protocol
  • Fault Tolerance: Automatic recovery in case of node failure

troubleshooting

Frequently asked questions

  1. Out of memorybash # Reduce batch size sed -i 's/batch_size: 4/batch_size: 2/' config/training_config.yaml

  2. Data download failedbash # Skip individual datasets python scripts/collect_llm_training_data.py --skip-failed

  3. GPU not availablebash # Switch to CPU mode docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu

Log confirmation

# All logs
docker-compose -f docker-compose.train.yml logs

# real time log
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu

# error log only
docker-compose -f docker-compose.train.yml logs 2>&1 | grep ERROR

Advanced settings

Add custom dataset

Add new dataset to config/data_config.yaml:

datasets:
  - name: "your-custom-dataset"
    type: "text"
    samples: 100000
    custom_config:
      path: "path/to/your/data"
      format: "jsonl"

Hyperparameter adjustment

Edit config/training_config.yaml:

training:
  learning_rate: 5e-5  # Learning rate adjustment
  epochs: 20           # Increased number of epochs
  max_seq_length: 4096 # Sequence length extension

See the comments in each configuration file for detailed configuration options.

List of environment variables

Variable name Description Default value
CUDA_VISIBLE_DEVICES GPU device to use Automatic detection
OMP_NUM_THREADS Number of OpenMP threads Number of CPU cores
PYTORCH_CUDA_ALLOC_CONF CUDA memory settings -
NCCL_DEBUG NCCL debug level -
WANDB_API_KEY Weights & Biases API key -

API endpoint

Endpoint Method Description
/training/start POST Start training
/training/stop POST Stop training
/training/status GET Get training status
/metrics GET Get metrics
/logs GET Get logs

This guide provides comprehensive instructions on how to use EvoSpikeNet's large-scale learning system. For detailed settings and customization, please refer to the comments in each configuration file.


Advanced Distributed Training System

EvoSpikeNet now includes a comprehensive distributed training system that supports large-scale training across 100+ nodes with advanced fault tolerance, scalability testing, and resource management capabilities.

Core Components

1. DistributedTrainingCoordinator

Coordinates distributed training across multiple nodes with advanced synchronization and communication protocols.

<!-- TODO: update or remove - import faile<!-- Remember: Automatic conversion not possible  please fix manually -->g import DistributedTrainingCoordinator -->

# Initialize coordinator for multi-node training
coordinator = DistributedTrainingCoordinator(
    world_size=24,  # Total number of nodes
    rank=0,  # Current node rank
    master_addr='192.168.1.100',
    master_port=12345,
    backend='nccl'  # or 'gloo' for CPU-only
)

# Setup distributed training
coordinator.setup_distributed_training(
    model=model,
    optimizer=optimizer,
    scheduler=scheduler
)

# Coordinate training loop
for epoch in range(num_epochs):
    coordinator.start_epoch(epoch)

    for batch in dataloader:
        # Synchronize gradients across nodes
        loss = coordinator.train_step(batch)

        # Adaptive batch size adjustment
        coordinator.adapt_batch_size_if_needed(loss.item())

    coordinator.end_epoch(epoch)

Key Features: - Multi-node coordination - Gradient synchronization - Adaptive batch sizing - Training state management

2. FaultToleranceManager

Provides comprehensive fault tolerance for distributed training with automatic recovery and checkpoint management.

<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->
<!--<!-- Remember: Cannot convert automatically  please fix manually -->fault_manager = FaultToleranceManager(
    checkpoint_interval=100,
    max_retries=3,
    recovery_strategy='checkpoint_resume'
)

# Setup fault tolerance
fault_manager.setup_fault_tolerance(
    model=model,
    optimizer=optimizer,
    training_state=training_state
)

# Training loop with fault tolerance
try:
    for step in range(max_steps):
        # Train step
        loss = train_step(batch)

        # Periodic checkpoint
        if step % 100 == 0:
            fault_manager.save_checkpoint(step, loss.item())

        # Check for node failures
        if fault_manager.detect_node_failure():
            fault_manager.initiate_recovery()

except Exception as e:
    # Automatic recovery on failure
    recovered_state = fault_manager.recover_from_failure(e)
    resume_training_from_state(recovered_state)

Key Features: - Automatic failure detection - Checkpoint-based recovery - Node failure handling - Training state preservation

3. ScalabilityTester

Tests and validates scalability of distributed training across different cluster configurations.

```pytho->

s=1, max_nodes=128, test_duration_minutes=30 )

Run scalability tests

results = tester.run_scalability_tests( model=model, dataset=dataset, test_configs=[ {'nodes': 8, 'batch_size': 32}, {'nodes': 16, 'batch_size': 64}, {'nodes': 32, 'batch_size': 128}, {'nodes': 64, 'batch_size': 256} ] )

Analyze scalability results

analysis = tester.analyze_scalability( results=results, metrics=['throughput', 'efficiency', 'communication_overhead'] )

print(f"Optimal configuration: {analysis['optimal_config']}") print(f"Scalability efficiency: {analysis['efficiency']:.2%}")

**Key Features:**
- Automated scalability testing
- Performance benchmarking
- Bottleneck identification
- Optimal configuration recommendations

#### 4. ResourceManager
Manages computational resources across distributed nodes with intelligent allocation and monitoring.

```pyt<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment --> -->
<!-- from evospikenet.distributed_training import ResourceManager -->

resou<!-- Remember: Cannot convert automatically — please fix manually -->ource_policies={
        'cpu_allocation': 'dynamic',
        'memory_management': 'aggressive',
        'gpu_scheduling': 'fair_share'
    }
)

# Initialize resource allocation
resource_manager.initialize_resources(
    total_nodes=24,
    node_specs=[{'cpu': 32, 'memory': 128e9, 'gpu': 8e9} for _ in range(24)]
)

# Allocate resources for training job
allocation = resource_manager.allocate_resources(
    job_requirements={
        'model_size': 'large',
        'batch_size': 64,
        'expected_duration': '24h'
    }
)

# Monitor resource usage
usage_stats = resource_manager.monitor_resources()
for node_id, stats in usage_stats.items():
    print(f"Node {node_id}: CPU {stats['cpu']:.1%}, Memory {stats['memory']:.1%}, GPU {stats['gpu']:.1%}")

Key Features: - Dynamic resource allocation - Real-time monitoring - Load balancing - Resource optimization

5. TrainingStateManager

Manages training state across distributed nodes with synchronization and persistence.

```pl' -->

state_manager = TrainingStateManagerterval=10, # seconds consistency_level='strong' )

Initialize training state

state_manager.initialize_training_state( initial_epoch=0, initial_step=0, model_config=model_config, optimizer_config=optimizer_config )

Synchronize state across nodes

state_manager.sync_training_state( current_state={ 'epoch': current_epoch, 'step': current_step, 'loss': current_loss, 'metrics': current_metrics } )

Retrieve synchronized state

global_state = state_manager.get_global_training_state() print(f"Global epoch: {global_state['epoch']}") print(f"Global best loss: {global_state['best_loss']}")

**Key Features:**
- Distributed state synchronization
- Persistent state storage
- Consistency guarantees
- State recovery

#### 6. GradientSynchronizer
Advanced gradient synchronization with communication optimization and compression.

``<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->til' -->
<!-- from evospikenet.distributed_training import GradientSynchronizer -->

gradient_sync = GradientSynchronizer(
    world_size=24,
    compression<!-- Remember: Cannot convert automatically — please fix manually -->ckend='nccl',
    overlap_computation=True
)

# Setup gradient synchronization
gradient_sync.setup_synchronization(
    model=model,
    optimizer=optimizer
)

# Training step with optimized gradient sync
for batch in dataloader:
    # Forward pass
    outputs = model(batch['inputs'])
    loss = criterion(outputs, batch['targets'])

    # Backward pass
    loss.backward()

    # Synchronize gradients with optimization
    gradient_sync.synchronize_gradients(
        compression_ratio=0.1,  # 10% of original size
        overlap_with_computation=True
    )

    # Optimizer step
    optimizer.step()

Key Features: - Gradient compression - Communication overlap - Bandwidth optimization - Synchronization efficiency

7. NodeHealthMonitor

Monitors health and performance of distributed nodes with proactive issue detection.

PUtil' -->

health_monitor = NodeHealthMonitor( monitoring_interval=30, # seconds alert_thresholds={ 'cpu_ 'network_latency': 1000 # ms } )

Start health monitoring

health_monitor.start_monitoring( node_ids=range(24), monitoring_metrics=['cpu', 'memory', 'gpu', 'network', 'disk'] )

Get health status

health_status = health_monitor.get_cluster_health() for node_id, status in health_status.items(): if status['overall'] != 'healthy': print(f"Node {node_id} issues: {status['issues']}")

Proactive issue detection

issues = health_monitor.detect_potential_issues() for issue in issues: print(f"Potential issue: {issue['type']} on node {issue['node_id']}")

**Key Features:**
- Real-time health monitoring
- Proactive issue detection
- Alert system
- Performance tracking

#### 8. DistributedTrainingManager
Integrated manager that coordinates all distributed training components.<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->'GPUtil' -->
<!-- from evospikenet.distributed_training import DistributedTrainingManager -->

# Initialize distributed training manager
training_manager = DistributedTrainingManager(
    cluster_config={
        'world_size': 24,

    fault_tolerance_enabled=True,
    scalability_testing_enabled=True
)

# Setup complete distributed training
training_manager.setup_distributed_training(
    model=model,
    optimizer=optimizer,
    dataset=dataset,
    training_config={
        'batch_size': 64,
        'max_epochs': 100,
        'checkpoint_interval': 500,
        'scalability_test_interval': 1000
    }
)

# Run distributed training with all features
results = training_manager.run_distributed_training()

# Get comprehensive training report
report = training_manager.generate_training_report()
print(f"Training completed in {report['total_time']}")
print(f"Final loss: {report['final_loss']}")
print(f"Scalability achieved: {report['scalability_efficiency']:.2%}")

Key Features: - Unified distributed training interface - Automatic component coordination - Comprehensive monitoring and reporting - Production-ready deployment

Integration Examples

Large-Scale Training Sed 'GPUtil' -->

Configure for 100+ node training

training_manager = DistributedTrainingManager( cluster_config={ 'world_size': 128, 'backend': 'nccl', 'fault_toning with advanced features training_manager.setup_distributed_training( model=large_model, dataset=huge_dataset, training_config={ 'initial_batch_size': 32, 'adaptive_batching': True, 'gradient_compression': 'quantization', 'checkpoint_strategy': 'incremental' } )

Monitor training progress

while training_manager.is_training_active(): status = training_manager.get_training_status() print(f"Epoch {status['epoch']}, Loss: {status['loss']:.4f}") print(f"Nodes active: {status['active_nodes']}/{status['total_nodes']}")

time.sleep(60)  # Check every minute

```

Fault-Tolerant Training

```python

Configure for high-reliability training

fault_tolerant_manager = DistributedTrainingManager( cluster_config={ 'world_size': 64, 'fault_tolerance_level': 'high', 'auto_recovery': True, 'checkpoint_frequency': 'high' } )

Training with automatic fault recovery

try: results = fault_tolerant_manager.run_distributed_training() except Exception as e: print(f"Training interrupted: {e}") # Manager automatically handles recovery recovery_status = fault_tolerant_manager.get_recovery_status() print(f"Recovery progress: {recovery_status['progress']:.1%}") ```

Configuration Options

distributed_training:
  coordinator:
    world_size: 24
    backend: nccl
    master_addr: "192.168.1.100"
    master_port: 12345
    timeout: 600

  fault_tolerance:
    enabled: true
    checkpoint_interval: 100
    max_retries: 3
    recovery_strategy: checkpoint_resume
    auto_recovery: true

  scalability_testing:
    enabled: true
    test_interval: 1000
    min_nodes: 8
    max_nodes: 128
    test_duration_minutes: 30

  resource_management:
    dynamic_allocation: true
    load_balancing: true
    memory_optimization: true
    gpu_scheduling: fair_share

  gradient_synchronization:
    compression_type: quantization
    compression_ratio: 0.1
    overlap_computation: true
    bandwidth_optimization: true

  health_monitoring:
    enabled: true
    monitoring_interval: 30
    alert_thresholds:
      cpu_usage: 0.95
      memory_usage: 0.90
      gpu_memory: 0.95
      network_latency: 1000

  state_management:
    persistence_backend: redis
    sync_interval: 10
    consistency_level: strong
    state_compression: true

Best Practices

  1. Cluster Setup: Ensure proper network configuration and firewall settings
  2. Resource Allocation: Monitor resource usage and adjust allocation policies
  3. Fault Tolerance: Enable comprehensive fault tolerance for production training
  4. Scalability Testing: Regularly test scalability with different configurations
  5. Monitoring: Implement comprehensive monitoring and alerting
  6. Checkpointing: Use frequent checkpointing for long-running training jobs
  7. Network Optimization: Optimize network settings for gradient synchronization

Troubleshooting

Common Issues: - Communication timeouts: Increase timeout values or check network connectivity - Memory issues: Enable gradient compression or reduce batch sizes - Node failures: Ensure fault tolerance is properly configured - Performance degradation: Run scalability tests to identify bottlenecks

Debug Mode:

training_manager.enable_debug_mode()
training_manager.log_detailed_metrics()
training_manager.enable_performance_profiling()

Performance Optimization

  • Gradient Compression: Use quantization or sparsification to reduce communication overhead
  • Communication Overlap: Enable computation-communication overlap for better utilization
  • Adaptive Batching: Allow dynamic batch size adjustment based on performance
  • Resource Balancing: Regularly rebalance resources across nodes
  • Network Tuning: Optimize network settings for your cluster topology