Large scale study guide
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Last updated: January 12, 2026 Color learning integration: January 12, 2026
overview
EvoSpikeNet's large-scale learning system provides a multimodal AI training environment that leverages a 24-node distributed brain architecture. This guide provides details on how to launch large-scale learning, data structure, storage location, and more.
Comprehensive AI learning system ⭐ UPDATED
EvoSpikeNet provides a comprehensive learning system for language comprehension, landmark recognition, Japanese-English speech listening, and multimodal integration. Data for each modality is stored and managed separately to achieve optimized model learning.
Features
- 🗣️ Language Understanding: 3,080,000 samples of Japanese text data (rinna/japanese-gpt-1b optimization)
- 🏛️ Landmark recognition: 100,000 samples of world landmark image data
- 🎤 Japanese and English audio listening: 565,000 samples of high quality ASR data (LibriSpeech, VoxPopuli, ReazonSpeech)
- 🔗 Multimodal: Image + text integrated learning data
- 📁 Separate data storage: Efficient data management with category-based directory structure
- 🚀 Easy start: Start data collection and training for all categories with one command
- ⚡ Automatic optimization: GPU/CPU automatic detection, memory optimization, batch size adjustment
- 📈 Scalable: Massively parallel processing with 24-node distributed architecture
- 🎯 Rank Specialization: Each node (rank 0-23) generates a dedicated LLM optimized for its field of expertise
Rank-specific training
In a 24-node distributed brain architecture, each rank takes on a different role, producing a specialized LLM:
| Rank range | Role | Specialty | Main use |
|---|---|---|---|
| 0-7 | Language understanding node | Japanese NLP | Semantic understanding, context analysis |
| 8-11 | Visual processing node | Landmark recognition | Image understanding, object detection |
| 12-15 | Speech processing node | Japanese-English ASR | Speech recognition, multilingual processing |
| 16-19 | Movement control node | Action generation | Action planning, output generation |
| 20-21 | Memory nodes | Episodic memory | Long-term memory, experience integration |
| 22-23 | Decision-making node | High-level reasoning | Strategic judgment, executive function |
# Rank-specific training example
./scripts/train_launcher.sh rank --rank 0 --category langtext # Language understanding specialized LLM
./scripts/train_launcher.sh rank --rank 8 --category vision # Visual processing specialized LLM
./scripts/train_launcher.sh rank --rank 12 --category audio # Audio processing specialized LLM
Shared model training
You can train a generic model that can be shared across multiple ranks:
# Creating a shared model by specifying the rank range
./scripts/train_launcher.sh shared --category langtext --rank-range 0-7 # Language understanding nodes (0-7) shared model
./scripts/train_launcher.sh shared --category vision --rank-range 8-11 # Visual processing nodes (8-11) shared model
# General shared modeling (available for all ranks)
./scripts/train_launcher.sh shared --category multimodal --shared # Multimodal general purpose model
The sharing model is used to efficiently share resources among multiple nodes with similar functionality.
Quick Start
# Start comprehensive AI learning with one command
./scripts/start_japanese_training.sh
Please refer to the "Japanese Learning Settings" section below for details.
Prerequisites
System requirements
- CPU: Intel/AMD x64, ARM64, Apple Silicon
- GPU: NVIDIA GPU (CUDA 11.8+), AMD GPU (ROCm), Apple Silicon GPU
- Memory: Minimum 16GB, 64GB or more recommended
- Storage: At least 100GB SSD, 1TB or more recommended for large-scale learning
- OS: Linux, macOS, Windows (WSL2)
Software requirements
- Docker: 20.10+
- Docker Compose: 2.0+
- Kubernetes: 1.24+ (for cluster deployments)
- Python: 3.10+
- CUDA: 11.8+ (when using GPU)
Network requirements
- Internet Connection: For data download
- Internal network: For distributed node communication
- Port open: 8000-8007 (API), 5432 (PostgreSQL), 9200 (Elasticsearch)
Preferences
1. Clone the repository
git clone https://github.com/your-org/EvoSpikeNet.git
cd EvoSpikeNet
2. Setting environment variables
# Create .env file
cp .env.example .env
# Example of editing content
EVOSPIKENET_API_KEYS=your_api_key_here
DATABASE_URL=postgresql://user:password@localhost/evospikenet
OPENAI_API_KEY=your_openai_key
CUDA_VISIBLE_DEVICES=0,1,2,3 # When using GPU
3. Setting up the Python environment
# Virtual environment creation
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Dependency installation
pip install -r requirements.txt
pip install -e .
Data preparation
Bulk data download 🚀
Quick Start (Download all categories at once)
# Download data for all categories at once
python scripts/collect_llm_training_data.py --config config/data_config.yaml --all
# run in background
nohup python scripts/collect_llm_training_data.py --config config/data_config.yaml --all > download.log 2>&1 &
Bulk download by category
# Bulk download of language data (Japanese) (13M+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category langtext
# Download Vision data in bulk (190K+ samples)
python scripts/download_vision_data.py --quick # High priority only
python scripts/download_vision_data.py --all # Full dataset
# Download audio data in bulk (565K+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category audio
# Bulk download MultiModal data (885K+ samples)
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category multimodal
Check data download status
# Check downloaded data
python scripts/verify_training_data_sufficiency.py
# Check the amount of data by category
find data/llm_training/ -type f -name "*.jsonl" -exec wc -l {} +
# Check the number of samples of Vision data
python -c "
from datasets import load_from_disk
import os
for dataset in ['cifar10', 'cifar100', 'fashion_mnist']:
for split in ['train', 'test']:
path = f'data/llm_training/Vision/{dataset}/{split}'
if os.path.exists(path):
ds = load_from_disk(path)
print(f'{dataset}/{split}: {len(ds):,} samples')
"
Data download options
| Options | Description | Example |
|---|---|---|
--all |
Download all categories | --all |
--category <name> |
Specific categories only | --category langtext |
--rebuild |
Data reconstruction | --rebuild |
--max-samples <n> |
Sample number limit | --max-samples 10000 |
--parallel |
Parallel download | --parallel 4 |
Data structure
data/
├── llm_training/ # LLM training data (save by category)
│ ├── LangText/ # Text data for language understanding
│ │ ├── langtext_en_data.jsonl # English text data
│ │ └── langtext_ja_data.jsonl # Japanese text data (13M+ samples)
│ ├── Vision/ # image data
│ │ ├── cifar10/ # CIFAR-10 (60K)
│ │ ├── cifar100/ # CIFAR-100 (60K)
│ │ ├── fashion_mnist/ # Fashion-MNIST (70K)
│ │ └── vision_data.jsonl # Landmark image data
│ ├── Audio/ # Audio listening data (565K+ samples)
│ │ └── audio_data.jsonl # ASR learning data
│ └── MultiModal/ # Multimodal integrated data (885K+ samples)
│ └── multimodal_data.jsonl # multimodal data
├── MNIST/ # MNIST dataset
├── audio_dataset/ # audio dataset
├── multi_modal_dataset/ # Multimodal dataset
└── checkpoints/ # checkpoint
Data separation benefits: - LangText: Text data for training language understanding/generation models - Vision: World landmark image recognition data - Audio: Audio listening (ASR) data in both Japanese and English languages - MultiModal: Image + text integrated learning data
Data for each category is stored in a separate JSONL file and used for training depending on the model type.
Data collection script
LLM training data collection
# Confirm configuration file
cat config/data_config.yaml
# Data collection execution
python scripts/collect_llm_training_data.py --config config/data_config.yaml
Rank specific data collection
In a 24-node distributed brain architecture, each rank collects data optimized for its area of expertise:
# Data collection for language understanding nodes (ranks 0-7)
./scripts/train_launcher.sh collect --rank 0 # Aozora Bunko, Japanese Wikipedia
./scripts/train_launcher.sh collect --rank 1 # Japanese classical literature, dialogue data
# Data collection for visual processing nodes (ranks 8-11)
./scripts/train_launcher.sh collect --rank 8 # ImageNet, COCO dataset
./scripts/train_launcher.sh collect --rank 9 # CIFAR-100, landmark recognition
# Data collection for voice processing nodes (ranks 12-15)
./scripts/train_launcher.sh collect --rank 12 # Common Voice Japanese, LibriSpeech
./scripts/train_launcher.sh collect --rank 13 # TEDlium, voice translation data
# Data collection for motion control nodes (ranks 16-19)
./scripts/train_launcher.sh collect --rank 16 # Roboturk, behavior generation data
./scripts/train_launcher.sh collect --rank 17 # Trajectory planning, sequence data
# Data collection for storage nodes (rank 20-21)
./scripts/train_launcher.sh collect --rank 20 # episodic memory data
./scripts/train_launcher.sh collect --rank 21 # Time series data, long-term dependence
# Data collection for decision nodes (ranks 22-23)
./scripts/train_launcher.sh collect --rank 22 # Strategy games, decision-making tasks
./scripts/train_launcher.sh collect --rank 23 # Reinforcement learning data, optimization problem
Data collection for each rank automatically selects the best data source and downloads high-quality, specialized data.
Collection from individual data sources
# Wikipedia data
python -c "
<!-- from evospikenet.dataloaders import WikipediaLoader -->
loader = WikipediaLoader(lang='en')
text = loader.load('Python (programming language)')
print(f'Downloaded {len(text)} characters')
"
# Hugging Face dataset
python -c "
from datasets import load_dataset
dataset = load_dataset('imdb', split='train[:10%]')
print(f'Loaded {len(dataset)} samples')
"
Data format
Text data (JSONL format)```json
{"text": "This is a sample text for LLM training.", "source": "wikipedia", "language": "en"} {"text": "これはLLMトレーニング用のサンプルテキストです。", "source": "aozora", "language": "ja"}
#### Image data (ImageFolder format)```
data/
├── train/
│ ├── class1/
│ │ ├── image001.jpg
│ │ └── image002.jpg
│ └── class2/
│ ├── image003.jpg
│ └── image004.jpg
└── test/
├── class1/
└── class2/
Audio data (folder by class)```
data/audio_dataset/ ├── speech_commands/ │ ├── yes/ │ ├── no/ │ ├── up/ │ └── down/ └── custom_audio/ ├── music/ └── speech/
## How to start training
### Batch learning start 🎯
#### Quick Start (Learn all categories at once)
```bash
# Start training for all 24 nodes at once
./scripts/train_all_nodes.sh
# or individual launch
for rank in {0..23}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--rank $rank \
--gpu &
done
Bulk learning by category
# Language understanding nodes (Rank 0-7) batch learning
for rank in {0..7}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank $rank \
--gpu &
done
# Vision node (Rank 8-11) batch learning
for rank in {8..11}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category vision \
--rank $rank \
--gpu &
done
# Audio node (Rank 12-15) bulk learning
for rank in {12..15}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category audio \
--rank $rank \
--gpu &
done
# MultiModal node (Rank 16-23) batch learning
for rank in {16..23}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category multimodal \
--rank $rank \
--gpu &
done
Color Learning Bulk Learning
# All ranks color learning training (minimum level)
for rank in {0..23}; do
python scripts/train_llm_models.py \
--category color_learning \
--color-level minimum \
--rank $rank \
--gpu &
done
# Vision-specific color learning (standard level)
for rank in {8..11}; do
python scripts/train_llm_models.py \
--category color_learning \
--color-level standard \
--rank $rank \
--gpu &
done
# High precision color learning (maximum level)
for rank in {8..11}; do
python scripts/train_llm_models.py \
--category color_learning \
--color-level maximum \
--rank $rank \
--gpu &
done
Bulk launch by learning level
| Level | Number of colors | Recommended rank | Execution example |
|---|---|---|---|
| minimum | 8-16 colors | All nodes | --color-level minimum |
| standard | 32-64 colors | Vision specialized | --color-level standard --rank 8-11 |
| maximum | 128-256 colors | Vision specialized | --color-level maximum --rank 8-11 |
1. Launch using Docker Compose
GPU training
# Start LLM training in GPU environment
docker-compose -f docker-compose.train.yml up llm-trainer-gpu
# Background execution
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu
CPU training
# Start LLM training in CPU environment
docker-compose -f docker-compose.train.yml up llm-trainer-cpu
# Background execution
docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu
2. Distributed training using Kubernetes
# Deploy to Kubernetes cluster
kubectl apply -f k8s/deployment.yaml
# Start training job
kubectl apply -f k8s/training-job.yaml
# Check status
kubectl get pods -n evospikenet
kubectl logs -f deployment/evospikenet-trainer -n evospikenet
3. Direct script execution
Start API server
# Start training server in API mode
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --gpu
# CPU mode
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --cpu
Direct training execution
# LangText model training
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--gpu \
--epochs 10 \
--batch-size 16
# Vision model training
python examples/train_vision_encoder.py \
--dataset mnist \
--epochs 50 \
--batch-size 128 \
--gpu
# Audio model training
python examples/train_audio_encoder.py \
--epochs 30 \
--batch-size 32 \
--gpu
4. Distributed training
Distributed learning on multiple nodes
# master node
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--mode distributed \
--rank 0 \
--world-size 4 \
--master-addr localhost \
--master-port 12345
# worker node 1
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--mode distributed \
--rank 1 \
--world-size 4 \
--master-addr master-node-ip \
--master-port 12345
5. Rank specific training
In a 24-node distributed brain architecture, each rank (0-23) generates a specialized LLM. Rank-specific training automatically selects the best model architecture, training parameters, and dataset.
Perform rank-specific training
# Training language understanding nodes (rank 0-7)
./scripts/train_launcher.sh rank --rank 0 --category langtext --gpu
./scripts/train_launcher.sh rank --rank 1 --category langtext --gpu
# Training visual processing nodes (ranks 8-11)
./scripts/train_launcher.sh rank --rank 8 --category vision --gpu
./scripts/train_launcher.sh rank --rank 9 --category vision --gpu
# Training audio processing nodes (ranks 12-15)
./scripts/train_launcher.sh rank --rank 12 --category audio --gpu
./scripts/train_launcher.sh rank --rank 13 --category audio --gpu
# Training of motor control nodes (ranks 16-19)
./scripts/train_launcher.sh rank --rank 16 --category motor --gpu
./scripts/train_launcher.sh rank --rank 17 --category motor --gpu
# Training memory nodes (rank 20-21)
./scripts/train_launcher.sh rank --rank 20 --category memory --gpu
./scripts/train_launcher.sh rank --rank 21 --category memory --gpu
# Training decision nodes (rank 22-23)
./scripts/train_launcher.sh rank --rank 22 --category decision --gpu
./scripts/train_launcher.sh rank --rank 23 --category decision --gpu
Automatic setting of rank-specific parameters
Each rank automatically applies the following optimization parameters:
- Language Understanding Nodes (0-7):
- Model:
rinna/japanese-gpt-1b - Optimization: Specialized in Japanese NLP tasks
- Dataset: Aozora Bunko, Japanese Wikipedia
-
Learning rate: 2e-5
-
Visual processing nodes (8-11):
- Model:
google/vit-base-patch16-224 - Optimization: image classification, object detection
- Dataset: ImageNet, COCO
-
Learning rate: 1e-4
-
Speech processing nodes (12-15):
- Model:
openai/whisper-small - Optimization: speech recognition, multilingual support
- Dataset: Common Voice, LibriSpeech
-
Learning rate: 1e-5
-
Motor control nodes (16-19):
- Model: Custom Transformer
- Optimization: sequence generation, behavior prediction
- Dataset: robotics data, motion trajectory
-
Learning rate: 3e-5
-
Storage Node (20-21):
- Model: Memory Expansion Transformer
- Optimization: long-term dependencies, experience integration
- Dataset: Episodic data, time series
-
Learning rate: 1e-5
-
Decision Node (22-23):
- Model: High-Level Inference Transformer
- Optimization: Strategic judgment, execution function
- Datasets: decision-making tasks, strategy data
- Learning rate: 2e-5
Rank specific training via API
# API server start
python scripts/train_llm_models.py --mode api --gpu
# Creating a single-rank-only model
curl -X POST http://localhost:8000/train \
-H "Content-Type: application/json" \
-d '{
"category": "text_generation",
"model_name": "rinna/japanese-gpt-1b",
"dataset_path": "data/llm_training/text_generation",
"output_dir": "saved_models",
"rank": 0,
"epochs": 10,
"batch_size": 16
}'
# Creating a shared model (can be used with multiple ranks)
curl -X POST http://localhost:8000/train \
-H "Content-Type: application/json" \
-d '{
"category": "text_generation",
"model_name": "rinna/japanese-gpt-1b",
"dataset_path": "data/llm_training/text_generation",
"output_dir": "saved_models",
"rank": "shared",
"shared": true,
"epochs": 10,
"batch_size": 16
}'
# Creating a base model (for fine tuning)
curl -X POST http://localhost:8000/train \
-H "Content-Type: application/json" \
-d '{
"category": "text_generation",
"model_name": "rinna/japanese-gpt-1b",
"dataset_path": "data/llm_training/text_generation",
"output_dir": "saved_models",
"rank": "base",
"epochs": 5,
"batch_size": 32
}'
Example of model name generated via API:
- Single rank: evospike-langtext-r00-v001
- Rank range: evospike-langtext-r00-r07-v001
- Shared model: evospike-langtext-shared-v001
- Base model: evospike-langtext-base-v001
- General model: evospike-langtext-general-v001
Details of each learning method 📚
1. Language understanding learning (LangText)
Target rank
- Rank 0-7: Language understanding node (Japanese NLP specialized)
Dataset
- Japanese Wikipedia: 935,640 samples
- Common Crawl Japanese: 2,342,820 samples
- OSCAR Japanese: 1,399,920 samples
- Aozora Bunko, papers, dialogues, codes, web, novels, legal documents, etc.
- Total: 14,411,625 samples (OPTIMAL)
Data download
# Bulk download of language data
python scripts/collect_llm_training_data.py \
--config config/data_config.yaml \
--category langtext
# Check download status
wc -l data/llm_training/LangText/langtext_ja_data.jsonl
Training method
# Single rank learning (Rank 0)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--gpu
# All language nodes batch learning (Rank 0-7)
for rank in {0..7}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank $rank \
--gpu &
done
Recommended settings
# config/training_config.yaml
model:
name: "rinna/japanese-gpt-1b"
max_length: 2048
tokenizer: "rinna/japanese-gpt-1b"
training:
epochs: 10
batch_size: 4
learning_rate: 2e-5
gradient_accumulation_steps: 8
warmup_steps: 1000
fp16: true
Estimated study time
- GPU (RTX 3090): 20-30 hours
- GPU (A100): 10-15 hours
- CPU: 100-150 hours (not recommended)
2. Vision learning (image recognition)
Target rank
- Rank 8-11: Visual processing node (image classification/object detection)
Dataset
- CIFAR-10: 60,000 samples (basic image classification)
- CIFAR-100: 60,000 samples (detailed image classification)
- Fashion-MNIST: 70,000 samples (fashion images)
- Google Landmarks: Landmark recognition
- Total: 195,000+ samples (OPTIMAL)
Data download
# Quick download (high priority dataset: CIFAR-10/100, Fashion-MNIST)
python scripts/download_vision_data.py --quick
# Download all datasets (including Food-101, Oxford Pets, Flowers)
python scripts/download_vision_data.py --all
# Individual download
python scripts/download_vision_data.py --dataset cifar10
python scripts/download_vision_data.py --dataset cifar100
python scripts/download_vision_data.py --dataset fashion_mnist
# List of available datasets
python scripts/download_vision_data.py --list
# Check download status
python -c "
from datasets import load_from_disk
import os
total = 0
for ds in ['cifar10', 'cifar100', 'fashion_mnist']:
for split in ['train', 'test']:
path = f'data/llm_training/Vision/{ds}/{split}'
if os.path.exists(path):
data = load_from_disk(path)
samples = len(data)
total += samples
print(f'{ds}/{split}: {samples:,} samples')
print(f'\\n合計: {total:,} samples')
"
Training method
# Single rank learning (Rank 8)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category vision \
--rank 8 \
--gpu
# Batch learning of all Vision nodes (Rank 8-11)
for rank in {8..11}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category vision \
--rank $rank \
--gpu &
done
# Dataset specific training
python scripts/train_llm_models.py \
--category vision \
--rank 8 \
--dataset cifar10 \
--gpu
Recommended settings
# config/training_config.yaml
model:
name: "google/vit-base-patch16-224"
image_size: 224
patch_size: 16
training:
epochs: 30
batch_size: 32
learning_rate: 1e-4
optimizer: "adamw"
weight_decay: 0.01
fp16: true
Estimated study time
- GPU (RTX 3090): 5-8 hours
- GPU (A100): 3-5 hours
3. Audio learning (speech recognition)
Target rank
- Rank 12-15: Audio processing node (Japanese/English ASR)
Dataset
- LibriSpeech: 460,000 samples
- Common Voice: 50,000 samples
- VoxPopuli: 30,000 samples
- ReazonSpeech: 25,000 samples
- Total: 575,000+ samples (OPTIMAL)
Data download
# Audio data bulk download
python scripts/collect_llm_training_data.py \
--config config/data_config.yaml \
--category audio
# Check download status
wc -l data/llm_training/Audio/audio_data.jsonl
Training method
# Single rank learning (Rank 12)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category audio \
--rank 12 \
--gpu
# Batch learning of all audio nodes (Rank 12-15)
for rank in {12..15}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category audio \
--rank $rank \
--gpu &
done
Recommended settings
# config/training_config.yaml
model:
name: "openai/whisper-small"
sampling_rate: 16000
language: "ja"
training:
epochs: 20
batch_size: 16
learning_rate: 1e-5
gradient_accumulation_steps: 4
fp16: true
Estimated study time
- GPU (RTX 3090): 10-15 hours
- GPU (A100): 6-10 hours
4. MultiModal learning (multimodal integration)
Target rank
- Rank 16-19: Motion control node
- Rank 20-21: Storage node
- Rank 22-23: Decision node
Dataset
- COCO Captions: 414,000 samples
- Flickr30k: 145,000 samples
- Conceptual Captions: 300,000 samples
- Visual Genome: 26,000 samples
- Total: 885,000+ samples (OPTIMAL)
Data download
# MultiModal data bulk download
python scripts/collect_llm_training_data.py \
--config config/data_config.yaml \
--category multimodal
# Check download status
wc -l data/llm_training/MultiModal/multimodal_data.jsonl
Training method
# Single rank learning (Rank 16)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category multimodal \
--rank 16 \
--gpu
# Batch learning of all MultiModal nodes (Rank 16-23)
for rank in {16..23}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category multimodal \
--rank $rank \
--gpu &
done
Recommended settings
# config/training_config.yaml
model:
name: "openai/clip-vit-base-patch32"
text_encoder: "bert-base-uncased"
vision_encoder: "vit-base-patch32"
training:
epochs: 15
batch_size: 24
learning_rate: 5e-5
warmup_steps: 500
fp16: true
Estimated study time
- GPU (RTX 3090): 15-20 hours
- GPU (A100): 8-12 hours
5. Comparison table of learning methods
| Category | Target rank | Data amount | Learning time (GPU) | Recommended model | Main uses |
|---|---|---|---|---|---|
| LangText | 0-7 | 14.4M | 20-30 hours | rinna/japanese-gpt-1b | Japanese understanding/generation |
| Vision | 8-11 | 195K+ | 5-8 hours | vit-base-patch16-224 | Image classification/recognition |
| Audio | 12-15 | 575K+ | 10-15 hours | whisper-small | Speech Recognition/ASR |
| MultiModal | 16-23 | 885K+ | 15-20 hours | clip-vit-base-patch32 | Image+Text Integration |
6. All category learning script at once
#!/bin/bash
# scripts/train_all_categories.sh
echo "=== 全カテゴリ一括学習開始 ==="
# LangText(Rank 0-7)
echo "Starting LangText training..."
for rank in {0..7}; do
python scripts/train_llm_models.py \
--category langtext --rank $rank --gpu &
done
# Vision(Rank 8-11)
echo "Starting Vision training..."
for rank in {8..11}; do
python scripts/train_llm_models.py \
--category vision --rank $rank --gpu &
done
# Audio(Rank 12-15)
echo "Starting Audio training..."
for rank in {12..15}; do
python scripts/train_llm_models.py \
--category audio --rank $rank --gpu &
done
# MultiModal(Rank 16-23)
echo "Starting MultiModal training..."
for rank in {16..23}; do
python scripts/train_llm_models.py \
--category multimodal --rank $rank --gpu &
done
echo "=== 全24ノードのトレーニングを起動しました ==="
echo "進捗確認: tail -f logs/training.log"
6. Color learning training ⭐ NEW
Color Learning in distributed brain systems is specialized training for each node to acquire the ability to understand, process, and generate color information. It offers three learning levels (minimum, standard, and maximum) and is optimized for each node type.
Characteristics of color learning
- 3 learning levels: Minimum (8-16 colors), Standard (32-64 colors), Maximum (128-256 colors)
- Node-specific optimization: Specialized for each node type such as PFC, Vision, Language, etc.
- Automatic data generation: Integrating synthetic data and Hugging Face datasets
- Transfer learning supported: Efficient learning from pre-trained models
- Knowledge Distillation Support: Knowledge transfer from large-scale models to small-scale models
Color learning level details
| Level | Number of colors | Dataset | Training time | GPU VRAM | Application |
|---|---|---|---|---|---|
| Minimum | 8-16 colors | MNIST, Basic Colors (150MB) | 1-2 hours | 2-4GB | Prototyping, basic color classification |
| Standard | 32-64 colors | CIFAR-10/100, subset ImageNet (2-5GB) | 4-8 hours | 8-12GB | Practical applications, general color recognition |
| Maximum | 128-256 colors | ImageNet, COCO (20-50GB) | 12-24 hours | 16-24GB | Professional color processing, research use |
Running from an integrated training script (recommended)
# 1. Check the configuration information
python scripts/train_llm_models.py --config config/training_config.yaml --show-color-config
# 2. Data download + training (Vision node Rank 9)
python scripts/train_llm_models.py --config config/training_config.yaml \
--category color_learning \
--color-level minimum \
--rank 9 \
--download-data \
--gpu
# 3. Language Node (Rank 20) - Standard level
python scripts/train_llm_models.py --config config/training_config.yaml \
--category color_learning \
--color-level standard \
--rank 20 \
--gpu
# 4. Parallel training of multiple nodes
# GPU 0: Vision node
CUDA_VISIBLE_DEVICES=0 python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level standard \
--rank 9 \
--gpu &
# GPU 1: Language node
CUDA_VISIBLE_DEVICES=1 python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level standard \
--rank 20 \
--gpu &
wait
Rank → automatic node type mapping
The --rank option automatically determines the appropriate node type:
| Rank range | Node type | Importance of color learning | Recommended level |
|---|---|---|---|
| 0-7 | PFC (prefrontal cortex) | High | Standard |
| 8-11 | Vision | Maximum | |
| 12-15 | Audio | Low | Minimum |
| 16-19 | Motor (movement) | Medium | Minimum-Standard |
| 20-21 | Memory | Medium | Standard |
| 22-23 | PFC (Decision Making) | High | Standard |
Execution from dedicated script (detailed control)
# 1. Download the dataset
python scripts/download_color_datasets.py \
--level minimum \
--modality all \
--output-dir data/color_learning
# Show available dataset information
python scripts/download_color_datasets.py --show-info
# 2. Training the model
# PFC node (multimodal) - lowest level
python scripts/train_color_learning_models.py \
--node-type pfc \
--level minimum \
--dataset-path data/color_learning/multimodal/mnist_captions \
--epochs 5 \
--gpu
# Vision Node - Standard Level
python scripts/train_color_learning_models.py \
--node-type vision \
--level standard \
--dataset cifar10 \
--epochs 20 \
--gpu
# Language Node - Standard Level
python scripts/train_color_learning_models.py \
--node-type language \
--level standard \
--dataset-path data/color_learning/language \
--epochs 25 \
--gpu
# Check configuration information (dry-run)
python scripts/train_color_learning_models.py \
--node-type vision-object \
--level maximum \
--show-config
Color learning training via API
# API server start
python scripts/train_llm_models.py --config config/training_config.yaml \
--mode api --host 0.0.0.0 --port 8000
# Submit color learning job
curl -X POST "http://localhost:8000/train" \
-H "Content-Type: application/json" \
-d '{
"category": "color_learning",
"model_name": "evospike-color-vision-r09",
"dataset_path": "data/color_learning/minimum/vision",
"output_dir": "saved_models/color_vision_minimum",
"gpu": true,
"epochs": 10,
"batch_size": 16,
"learning_rate": 0.0001,
"rank": 9
}'
Color learning model naming convention
The generated color learning model follows the following naming convention:
- Single rank:
evospike-color_learning_vision-r09-v001 - Rank Range:
evospike-color_learning_vision-r09-r11-v001 - Shared model:
evospike-color_learning_vision-shared-v001
Color learning settings by node type
Each node type has different color learning requirements:
PFC (Prefrontal Cortex) Node: - Minimum: 8 colors, basic color understanding - Standard: 64 colors, practical color recognition - Maximum: 256 colors, professional color processing
Vision node: - Minimum: 16 colors, basic image classification - Standard: 64 colors, detailed color recognition - Maximum: 256 colors, professional color processing
Language node: - Minimum: 8 colors, basic understanding of color names - Standard: 32 colors, detailed representation of colors - Maximum: 128 colors, nuanced color rendering
Motor node: - Minimum: 8 colors, basic color feedback - Standard: 16 colors, visual guidance - Maximum: 32 colors, detailed visual control
Audio node: - Minimum: 4 colors, minimal visual integration - Standard: 8 colors, basic multimodal support - Maximum: 16 colors, audio-visual integration
Memory node: - Minimum: 8 colors, basic episode recording - Standard: 32 colors, detailed memory encoding - Maximum: 64 colors, high resolution memory retention
Progressive training
Efficient learning is possible by increasing the learning level step by step:
#!/bin/bash
# progressive_color_training.sh
RANK=9 # Vision node
# Stage 1: Minimum (basic learning)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level minimum \
--rank $RANK \
--download-data \
--gpu
# Stage 2: Standard (transfer learning)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level standard \
--rank $RANK \
--gpu
# Stage 3: Maximum (final adjustment)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level maximum \
--rank $RANK \
--gpu
echo "✅ Progressive training completed for Rank $RANK"
Efficiency through knowledge distillation
Knowledge transfer from large-scale models to small-scale models:
# 1. Train the teacher model (maximum level)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level maximum \
--rank 9 \
--gpu
# 2. Knowledge distillation to student model (standard level)
# TODO: Planned implementation of knowledge distillation script
Color learning assessment
# Evaluate model color recognition accuracy
python scripts/evaluate_color_learning.py \
--model-path saved_models/evospike-color_learning_vision-r09-v001 \
--test-dataset data/color_learning/standard/vision/test \
--metrics accuracy,f1,confusion_matrix
# Visualization of results
python scripts/visualize_color_results.py \
--results results/color_learning_evaluation.json \
--output visualizations/color_learning
Dataset information
Main datasets used in color learning:
Minimum Level: - MNIST (60,000 images, grayscale) - Basic Colors (10,000 composite images, 8 colors) - Total: ~150MB
Standard Level: - CIFAR-10 (60,000 images, 10 classes) - CIFAR-100 subset (20,000 images, 64 colors) - Color Text (50,000 text, color description) - Total: ~2-5GB
Maximum Level: - ImageNet subset (100,000 images) - COCO (118,287 images) - Flickr30k (31,000 images) - Wikipedia Color Corpus (500,000 texts) - Total: ~20-50GB
troubleshooting
Out of memory error:```bash
Reduce batch size
python scripts/train_llm_models.py \ --config config/training_config.yaml \ --category color_learning \ --color-level standard \ --rank 9 \ --gpu \ --batch-size 4 # reduced from default 16
**Data download error:**```bash
# download separately
python scripts/download_color_datasets.py \
--level minimum \
--modality vision \
--output-dir data/color_learning
# Then run training (without --download-data)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category color_learning \
--color-level minimum \
--rank 9 \
--gpu
Data structure and storage location
LLM model naming convention
EvoSpikeNet applies a uniform naming convention to the generated LLM models, allowing you to identify model type, category, rank, and version at a glance. We consider not only models dedicated to a single rank, but also models that can be shared by multiple ranks.
Naming convention structure
Basic format:``` {Type}-{Category}-{RankSpec}-v{Version:03d}
**Description of each element:**
- **Type**: Model architecture type
- `evospike`: EvoSpikeNet unique architecture
- `brain`: Brain architecture model
- `node`: Node-specific model
- **Category**: Specialty category (based on NODE_TYPE_TO_CATEGORIES)
- `text_generation`: Text generation (executive node)
- `image_classification`: Image classification (vision node)
- `object_detection`: Object detection (vision node)
- `speech_recognition`: Speech recognition (auditory node)
- `motion_control`: Motion control (motor node)
- `decision_making`: Decision making (executive node)
- `planning`: Planning (executive node)
- `reasoning`: Reasoning (executive node)
- `rag`: RAG (search extension generation) (executive node)
- `multimodal`: Multimodal (general node)
- `embedding`: Embedding (general node)
- **RankSpec**: Rank specification (multiple patterns supported)
- Single rank: `r{Rank:02d}` (e.g. `r00`, `r08`)
- Rank range: `r{Start:02d}-r{End:02d}` (Example: `r00-r07`, `r08-r11`)
- Shared model: `shared` (can be shared among all ranks)
- General model: `general` (rank independent)
- Base model: `base` (basis for other models)
- **Version**: Version number (001, 002, ...)
#### Naming Convention Examples
**Single rank only model:**```bash
evospike-text_generation-r00-v001 # Rank 0 exclusive text generation model
evospike-image_classification-r08-v001 # Rank 8 dedicated image classification model
evospike-speech_recognition-r12-v001 # Rank 12 dedicated voice recognition model
evospike-decision_making-r22-v001 # Rank 22 exclusive decision-making model
Multiple rank sharing model:```bash evospike-text_generation-r00-r07-v001 # Shared text generation model for ranks 0-7 evospike-image_classification-r08-r11-v001 # Shared image classification models with ranks 8-11 evospike-speech_recognition-r12-r15-v001 # Shared speech recognition model at rank 12-15 evospike-motion_control-r16-r19-v001 # Motion control model shared by ranks 16-19
**Shared/General/Base model:**```bash
evospike-text_generation-shared-v001 # Text generation model that can be shared across all ranks
evospike-image_classification-shared-v001 # Image classification model that can be shared across all ranks
evospike-text_generation-general-v001 # Rank-independent general text generation model
evospike-text_generation-base-v001 # Base of text generation model (for fine tuning)
evospike-multimodal-base-v001 # Base of multimodal model
Model sharing use cases
- Shared model: When using the same model on multiple nodes with the same functionality
- Base model: Basic model when creating a dedicated model for each rank
- General model: General model without rank-specific optimization
- Scope model: A model that can be shared by nodes in the same field of expertise
Advantages of naming conventions
- Identification: Type/Category/RankSpec/Version can be seen at a glance
- Flexibility: Supports single/multiple/shared models
- Extensibility: Easy to add new sharing patterns
- Consistency: Uniform naming across all training methods
Naming Convention Examples
# Initial model of language understanding node (rank 0)
evospike-langtext-r00-v001
# Improved model of visual processing node (rank 8)
evospike-vision-r08-v002
# Specialized model of speech processing node (rank 12)
evospike-audio-r12-v001
# Trained model for motion control node (rank 16)
evospike-motor-r16-v003
# Optimization model for storage nodes (rank 20)
evospike-memory-r20-v002
# High performance model of decision node (rank 22)
evospike-decision-r22-v001
Advantages of naming conventions
- Identification: Type, Category, Rank, Version can be seen at a glance
- Sortability: Easy to sort by rank or version
- Extensibility: Easy to add new categories and types
- Automation: Can be automatically generated using a script
Model storage structure
saved_models/
├── evospike-langtext-r00-v001/ # Language understanding node (rank 0) model
│ ├── config.json # model settings
│ ├── pytorch_model.bin # model weights
│ ├── tokenizer.json # Tokenizer settings
│ ├── vocab.json # vocabulary file
│ ├── merges.txt # BPE merge file
│ └── training_args.bin # training arguments
├── evospike-vision-r08-v001/ # Visual processing node (rank 8) model
│ ├── model.pth
│ ├── optimizer.pth
│ └── logs/
├── evospike-audio-r12-v001/ # Audio processing node (rank 12) model
│ ├── model.pt
│ ├── feature_extractor.json
│ └── logs/
├── evospike-decision-r22-v001/ # Decision node (rank 22) model
│ ├── model.bin
│ ├── processor_config.json
│ └── logs/
└── checkpoints/ # training checkpoint
├── evospike-langtext-r00-v001-checkpoint-500/
├── evospike-langtext-r00-v001-checkpoint-1000/
└── ...
Log storage structure
logs/
├── training.log # main training log
├── tensorboard/ # TensorBoard log
│ ├── events.out.tfevents.1234567890.hostname
│ └── ...
├── wandb/ # Weights & Biases Log
│ ├── run-20231231_123456-abc123/
│ └── ...
└── metrics.json # Metrics JSON
Configuration file structure
config/
├── training_config.yaml # training settings
├── data_config.yaml # Data collection settings
├── settings.yaml # Application settings
├── settings.production.yaml # Production environment settings
├── settings.staging.yaml # Staging environment settings
├── settings.development.yaml # Development environment settings
├── settings.schema.json # configuration schema
├── node_allocation.yaml # Node allocation settings
└── progress_settings.yaml # Progress settings
Database structure
PostgreSQL schema
-- トレーニングジョブテーブル
CREATE TABLE training_jobs (
id SERIAL PRIMARY KEY,
job_id VARCHAR(255) UNIQUE NOT NULL,
model_type VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL,
config JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- モデルメトリクステーブル
CREATE TABLE model_metrics (
id SERIAL PRIMARY KEY,
job_id VARCHAR(255) REFERENCES training_jobs(job_id),
epoch INTEGER,
step INTEGER,
loss FLOAT,
accuracy FLOAT,
perplexity FLOAT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- 分散ノードテーブル
CREATE TABLE distributed_nodes (
id SERIAL PRIMARY KEY,
node_id VARCHAR(255) UNIQUE NOT NULL,
ip_address INET,
gpu_count INTEGER,
memory_gb INTEGER,
status VARCHAR(50),
last_heartbeat TIMESTAMP
);
Elasticsearch index
{
"mappings": {
"properties": {
"job_id": {"type": "keyword"},
"timestamp": {"type": "date"},
"level": {"type": "keyword"},
"message": {"type": "text"},
"metrics": {"type": "object"},
"node_id": {"type": "keyword"}
}
}
}
Monitoring and management
Monitoring your training progress
Monitoring via API
# Check training status
curl http://localhost:8000/training/status
# Get metrics
curl http://localhost:8000/metrics
# Log acquisition
curl http://localhost:8000/logs?lines=100
Visualization on TensorBoard
# Start TensorBoard
tensorboard --logdir logs/tensorboard --port 6006
# Access with browser
open http://localhost:6006
Tracking with Weights & Biases
# W&B Dashboard
wandb login
# Automatic tracking during training
Resource monitoring
GPU usage monitoring
# NVIDIA GPU
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.free --format=csv
# AMD GPU
rocm-smi --showuse
System resource monitoring
# CPU/memory usage
top -p $(pgrep -f train_llm_models)
# Disk usage
df -h /path/to/data /path/to/models
# Network usage
iftop -i eth0
Training management
Pause/Resume Training
# Restart from checkpoint
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--resume-from-checkpoint saved_models/checkpoints/checkpoint-1000 \
--gpu
Stop training
# Normal stop (checkpoint save)
curl -X POST http://localhost:8000/training/stop
# Forced stop
pkill -f train_llm_models
7. Additive and federated learning 🆕
Incremental Learning
It is the ability to continuously learn from new data based on existing models. You can leverage previous training results to efficiently improve your model.
How to use
# Restart from checkpoint
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--resume-from saved_models/evospike-langtext-r00-v001/checkpoint-1000 \
--gpu
# Additional learning to existing model (Incremental Learning)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--incremental \
--gpu
Features
- ✅ Checkpoint Resume: Resume training interrupted with
--resume-from - ✅ Knowledge retention: Learn with new data while preserving existing learning content
- ✅ Efficient: Completed in less time than learning from scratch
- ✅ Countermeasures against catastrophic forgetting: Minimize knowledge loss with gradual learning
Use case
- Add new data: Add new data set to existing model for learning
- Continuous Improvement: Continuous improvement of the model through regular data updates
- Domain adaptation: Adapting a general model to a specific domain
- Version control: History management of incremental model improvements
Example: Additional learning of LangText model
# Step 1: Initial training
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--gpu
# Step 2: Additional learning with new data
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--incremental \
--gpu
# Step 3: Further learning with specific domain data
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--resume-from saved_models/evospike-langtext-r00-v001 \
--gpu
Federated Learning
This function performs distributed learning on multiple nodes and aggregates model parameters to build an integrated model. Learn from distributed data while preserving privacy.
How to use
# Federated learning mode (FedAvg)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--federated \
--aggregation-method fedavg \
--federated-rounds 10 \
--gpu
# Federated Learning (FedProx)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category vision \
--rank 8 \
--federated \
--aggregation-method fedprox \
--federated-rounds 20 \
--gpu
# Federated Learning (FedOpt)
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category audio \
--rank 12 \
--federated \
--aggregation-method fedopt \
--federated-rounds 15 \
--gpu
Aggregation Methods
| Method | Description | Application Situation |
|---|---|---|
| FedAvg | Simple averaging | General federated learning |
| FedProx | Learning with regularization | When there is data imbalance |
| FedOpt | Adaptive Optimizer | When fast convergence is required |
Features
- 🔒 Privacy Protection: Learn while keeping your data local
- 🌐 Distributed learning: Parallel learning on multiple nodes
- 🔄 Model aggregation: Integrate models of each node
- 📊 Supports non-IID data: Learning is possible even in environments with different data distributions
- ⚡ Improved communication efficiency: Send only model parameters
Federation learning workflow
# Launch federated learning with multiple ranks
for rank in {0..7}; do
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank $rank \
--federated \
--aggregation-method fedavg \
--federated-rounds 10 \
--gpu &
done
echo "全8ノードのフェデレーション学習を開始しました"
Use case
- Privacy-preserving AI: Learning with confidential data such as medical and financial data
- Edge device learning: Distributed learning from IoT devices
- Cross-silo learning: Collaborative learning of models across multiple organizations
- Distributed brain system: Building collaborative intelligence with a 24-node distributed brain
Parameter description
| Parameters | Default | Description |
|---|---|---|
--federated |
False | Enable federated learning mode |
--aggregation-method |
fedavg | Aggregation method: fedavg/fedprox/fedopt |
--federated-rounds |
10 | Number of rounds for federated learning |
--resume-from |
None | Checkpoint path |
--incremental |
False | Enable incremental learning mode |
Combining additive and federated learning
# Improved with federated learning based on existing model
python scripts/train_llm_models.py \
--config config/training_config.yaml \
--category langtext \
--rank 0 \
--resume-from saved_models/evospike-langtext-r00-v001 \
--federated \
--aggregation-method fedprox \
--federated-rounds 10 \
--gpu
troubleshooting
Common issues
Out of memory
# Reduce batch size
--batch-size 4
# Use gradient accumulation
--gradient-accumulation-steps 4
# Increase CPU memory
--cpu-memory-fraction 0.8
Out of GPU memory
# Increase GPU memory
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Use mixed precision
--fp16
# make the model smaller
--model-size small
Data loading error
# Data integrity check
python scripts/verify_data_integrity.py --data-dir data/
# data reconstruction
python scripts/collect_llm_training_data.py --config config/data_config.yaml --rebuild
Distributed training issues
Inter-node communication error
# Check firewall settings
sudo ufw allow 12345/tcp
# NCCL debug mode
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
Node synchronization error
# Time synchronization confirmation
chronyc tracking
# NTP synchronization
sudo systemctl restart chrony
Performance issues
Training speed is slow
# DataLoader optimization
--num-workers 4
--pin-memory
--persistent-workers
# model optimization
--torch-compile
--flash-attention
Does not converge
# Learning rate adjustment
--learning-rate 1e-5
# Add warm-up
--warmup-steps 1000
# Change scheduler
--lr-scheduler cosine
Performance optimization
GPU optimization
Mixed precision training
# training_config.yaml
training:
fp16: true
bf16: false # bf16 recommended for GPUs after Ampere
gradient_checkpointing: true
Distributed data parallelism
# Use torchrun
torchrun --nproc_per_node=4 \
--nnodes=2 \
--node_rank=0 \
--master_addr=master_node \
scripts/train_llm_models.py --config config/training_config.yaml
Data optimization
DataLoader optimization
# High speed data loading settings
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=True,
persistent_workers=True,
prefetch_factor=2
)
Data preprocessing
# Dataset pre-tokenization
python scripts/pretokenize_dataset.py \
--input data/llm_training/raw \
--output data/llm_training/tokenized \
--tokenizer microsoft/DialoGPT-medium
Memory optimization
Gradient accumulation
training:
batch_size: 2
gradient_accumulation_steps: 8 # Effective batch size = 16
Memory efficient configuration
# PyTorch memory optimization
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
# CPU memory optimization
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
Storage optimization
Model compression
# Quantization
python scripts/quantize_model.py \
--model saved_models/lang_evospike_lm_v1 \
--quantization 8bit
# distillation
python scripts/distill_model.py \
--teacher saved_models/large_model \
--student saved_models/small_model
Checkpoint management
training:
save_steps: 1000
save_total_limit: 3 # Keep only the latest 3 checkpoints
save_strategy: steps
Appendix
Configuration file example
training_config.yaml
data_dir: "data/llm_training"
output_dir: "saved_models"
langtext:
model_name: "microsoft/DialoGPT-medium"
max_length: 512
lora_config:
r: 16
lora_alpha: 32
target_modules: ["q_proj", "v_proj"]
lora_dropout: 0.05
training:
epochs: 10
batch_size: 8
learning_rate: 2e-5
save_steps: 500
logging_steps: 100
fp16: true
gpu:
use_gpu: true
gpu_memory_fraction: 0.9
Japanese learning settings
overview
Large-scale Japanese learning consists of the following components:
- Data configuration (
config/data_config.yaml) - Training configuration (
config/training_config.yaml) - Quick start script (
scripts/start_japanese_training.sh)
Data settings
Large-scale learning configuration in config/data_config.yaml:
# Text data for language understanding
langtext_datasets_ja:
output_dir: "data/llm_training/LangText"
output_file: "langtext_ja_data.jsonl"
# 21 types of Japanese datasets (approximately 3,080,000 samples)
# Landmark image data
vision_datasets:
output_dir: "data/llm_training/Vision"
output_file: "vision_data.jsonl"
# Google Landmarks, Europe Landmarks (approximately 100,000 samples)
# Japanese and English audio listening data
audio_datasets:
output_dir: "data/llm_training/Audio"
output_file: "audio_data.jsonl"
# LibriSpeech, VoxPopuli, ReasonSpeech, Common Voice (approximately 565,000 samples)
# Multimodal integrated data
multimodal_datasets:
output_dir: "data/llm_training/MultiModal"
output_file: "multimodal_data.jsonl"
# Image + text integrated data
Training settings
Japanese model settings in config/training_config.yaml:
model:
name: "rinna/japanese-gpt-1b" # Japanese specialized model
language: "ja"
type: "causal-lm"
training:
epochs: 10
batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 2e-5
warmup_steps: 1000
max_seq_length: 2048
gpu:
use_gpu: true
gpu_memory_fraction: 0.95 # 95% GPU memory used
Quick start script
Features of scripts/start_japanese_training.sh:
- Environment check: Python3, automatic detection of GPU/CPU
- Dependency installation: Automatic installation of requirements.txt
- Data directory creation: Automatic creation of necessary directories
- Data collection: Download all configured datasets
- Start training: Start appropriate container depending on GPU/CPU
How to run
Basic execution
# Interactive mode (recommended)
./scripts/start_japanese_training.sh
# or non-interactive mode
echo "y" | ./scripts/start_japanese_training.sh
Custom execution
# Data collection only
python scripts/collect_llm_training_data.py --config config/data_config.yaml
# Training only (if data already exists)
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu
Monitoring
Once you have started training, check your progress in the following ways:
# training status
curl http://localhost:8000/training/status
# Log confirmation
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu
# Web UI
open http://localhost:8000/docs
Dataset details
Text dataset (3,080,000 samples)
| Dataset | Number of samples | Contents |
|---|---|---|
| izumi-lab/llm-japanese-dataset | 1,000,000 | General Japanese text |
| llm-book/japanese-wikipedia | 500,000 | Wikipedia article |
| llm-book/japanese-news | 300,000 | News article |
| llm-book/japanese-books | 200,000 | Book text |
| llm-book/japanese-papers | 150,000 | Paper Abstract |
| llm-book/japanese-code | 100,000 | Programming code |
| llm-book/japanese-qa | 80,000 | QA data |
| and 13 others | 750,000 | Dialogue, reviews, etc. |
Image dataset (landmark)
| Dataset | Number of samples | Contents |
|---|---|---|
| visheratin/google_landmarks_photos | 50,000 | Google Landmarks photo dataset |
| Qdrant/google-landmark-geo | 30,000 | Google Landmarks + Geographic Coordinates |
| SablikJan/europe-landmarks-classification | 20,000 | European Landmarks Classification |
Audio dataset (Japanese-English listening learning)
| Dataset | Sample size | Language | Content | ASR suitability |
|---|---|---|---|---|
| mozilla-foundation/common_voice_11_0 (ja) | 60,000 | Japanese | General purpose speech recognition | ⭐⭐⭐ |
| mozilla-foundation/common_voice_11_0 (en) | 120,000 | English | General purpose speech recognition | ⭐⭐⭐ |
| librispeech_asr (clean) | 150,000 | English | High quality speech reading | ⭐⭐⭐⭐⭐ |
| facebook/voxpopuli (en) | 95,000 | English | Parliament audio data | ⭐⭐⭐⭐ |
| reazon-research/reazonspeech | 60,000 | Japanese | High quality Japanese audio | ⭐⭐⭐⭐ |
| speech_commands (v0.02) | 30,000 | Multilingual | Voice commands | ⭐⭐ |
ASR Compatibility Description: - ⭐⭐⭐⭐⭐ Very high: LibriSpeech (clean and high quality) - ⭐⭐⭐⭐ High: VoxPopuli, ReasonSpeech (specialized data) - ⭐⭐⭐ Standard: Common Voice (various pronunciations) - ⭐⭐ Auxiliary: Speech Commands (Command Recognition)
Multimodal dataset
- llm-book/japanese-image-text: 100,000 samples
- Content: Pair of image and Japanese caption
Performance optimization
GPU optimization
- Memory usage: 95% (maximum utilization)
- Mixed Precision: FP16 enabled
- Gradient Accumulation: 8 steps
- Batch Size: 4 (adjusted according to GPU memory)
Distributed training
With 24 node distributed architecture:
- Parallel processing: Data parallelism + Model parallelism
- Communication optimization: Use Zenoh protocol
- Fault Tolerance: Automatic recovery in case of node failure
troubleshooting
Frequently asked questions
-
Out of memory
bash # Reduce batch size sed -i 's/batch_size: 4/batch_size: 2/' config/training_config.yaml -
Data download failed
bash # Skip individual datasets python scripts/collect_llm_training_data.py --skip-failed -
GPU not available
bash # Switch to CPU mode docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu
Log confirmation
# All logs
docker-compose -f docker-compose.train.yml logs
# real time log
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu
# error log only
docker-compose -f docker-compose.train.yml logs 2>&1 | grep ERROR
Advanced settings
Add custom dataset
Add new dataset to config/data_config.yaml:
datasets:
- name: "your-custom-dataset"
type: "text"
samples: 100000
custom_config:
path: "path/to/your/data"
format: "jsonl"
Hyperparameter adjustment
Edit config/training_config.yaml:
training:
learning_rate: 5e-5 # Learning rate adjustment
epochs: 20 # Increased number of epochs
max_seq_length: 4096 # Sequence length extension
See the comments in each configuration file for detailed configuration options.
List of environment variables
| Variable name | Description | Default value |
|---|---|---|
CUDA_VISIBLE_DEVICES |
GPU device to use | Automatic detection |
OMP_NUM_THREADS |
Number of OpenMP threads | Number of CPU cores |
PYTORCH_CUDA_ALLOC_CONF |
CUDA memory settings | - |
NCCL_DEBUG |
NCCL debug level | - |
WANDB_API_KEY |
Weights & Biases API key | - |
API endpoint
| Endpoint | Method | Description |
|---|---|---|
/training/start |
POST | Start training |
/training/stop |
POST | Stop training |
/training/status |
GET | Get training status |
/metrics |
GET | Get metrics |
/logs |
GET | Get logs |
This guide provides comprehensive instructions on how to use EvoSpikeNet's large-scale learning system. For detailed settings and customization, please refer to the comments in each configuration file.
Advanced Distributed Training System
EvoSpikeNet now includes a comprehensive distributed training system that supports large-scale training across 100+ nodes with advanced fault tolerance, scalability testing, and resource management capabilities.
Core Components
1. DistributedTrainingCoordinator
Coordinates distributed training across multiple nodes with advanced synchronization and communication protocols.
<!-- TODO: update or remove - import faile<!-- Remember: Automatic conversion not possible — please fix manually -->g import DistributedTrainingCoordinator -->
# Initialize coordinator for multi-node training
coordinator = DistributedTrainingCoordinator(
world_size=24, # Total number of nodes
rank=0, # Current node rank
master_addr='192.168.1.100',
master_port=12345,
backend='nccl' # or 'gloo' for CPU-only
)
# Setup distributed training
coordinator.setup_distributed_training(
model=model,
optimizer=optimizer,
scheduler=scheduler
)
# Coordinate training loop
for epoch in range(num_epochs):
coordinator.start_epoch(epoch)
for batch in dataloader:
# Synchronize gradients across nodes
loss = coordinator.train_step(batch)
# Adaptive batch size adjustment
coordinator.adapt_batch_size_if_needed(loss.item())
coordinator.end_epoch(epoch)
Key Features: - Multi-node coordination - Gradient synchronization - Adaptive batch sizing - Training state management
2. FaultToleranceManager
Provides comprehensive fault tolerance for distributed training with automatic recovery and checkpoint management.
<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->
<!--<!-- Remember: Cannot convert automatically — please fix manually -->fault_manager = FaultToleranceManager(
checkpoint_interval=100,
max_retries=3,
recovery_strategy='checkpoint_resume'
)
# Setup fault tolerance
fault_manager.setup_fault_tolerance(
model=model,
optimizer=optimizer,
training_state=training_state
)
# Training loop with fault tolerance
try:
for step in range(max_steps):
# Train step
loss = train_step(batch)
# Periodic checkpoint
if step % 100 == 0:
fault_manager.save_checkpoint(step, loss.item())
# Check for node failures
if fault_manager.detect_node_failure():
fault_manager.initiate_recovery()
except Exception as e:
# Automatic recovery on failure
recovered_state = fault_manager.recover_from_failure(e)
resume_training_from_state(recovered_state)
Key Features: - Automatic failure detection - Checkpoint-based recovery - Node failure handling - Training state preservation
3. ScalabilityTester
Tests and validates scalability of distributed training across different cluster configurations.
```pytho->
s=1, max_nodes=128, test_duration_minutes=30 )
Run scalability tests
results = tester.run_scalability_tests( model=model, dataset=dataset, test_configs=[ {'nodes': 8, 'batch_size': 32}, {'nodes': 16, 'batch_size': 64}, {'nodes': 32, 'batch_size': 128}, {'nodes': 64, 'batch_size': 256} ] )
Analyze scalability results
analysis = tester.analyze_scalability( results=results, metrics=['throughput', 'efficiency', 'communication_overhead'] )
print(f"Optimal configuration: {analysis['optimal_config']}") print(f"Scalability efficiency: {analysis['efficiency']:.2%}")
**Key Features:**
- Automated scalability testing
- Performance benchmarking
- Bottleneck identification
- Optimal configuration recommendations
#### 4. ResourceManager
Manages computational resources across distributed nodes with intelligent allocation and monitoring.
```pyt<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment --> -->
<!-- from evospikenet.distributed_training import ResourceManager -->
resou<!-- Remember: Cannot convert automatically — please fix manually -->ource_policies={
'cpu_allocation': 'dynamic',
'memory_management': 'aggressive',
'gpu_scheduling': 'fair_share'
}
)
# Initialize resource allocation
resource_manager.initialize_resources(
total_nodes=24,
node_specs=[{'cpu': 32, 'memory': 128e9, 'gpu': 8e9} for _ in range(24)]
)
# Allocate resources for training job
allocation = resource_manager.allocate_resources(
job_requirements={
'model_size': 'large',
'batch_size': 64,
'expected_duration': '24h'
}
)
# Monitor resource usage
usage_stats = resource_manager.monitor_resources()
for node_id, stats in usage_stats.items():
print(f"Node {node_id}: CPU {stats['cpu']:.1%}, Memory {stats['memory']:.1%}, GPU {stats['gpu']:.1%}")
Key Features: - Dynamic resource allocation - Real-time monitoring - Load balancing - Resource optimization
5. TrainingStateManager
Manages training state across distributed nodes with synchronization and persistence.
```pl' -->
state_manager = TrainingStateManagerterval=10, # seconds consistency_level='strong' )
Initialize training state
state_manager.initialize_training_state( initial_epoch=0, initial_step=0, model_config=model_config, optimizer_config=optimizer_config )
Synchronize state across nodes
state_manager.sync_training_state( current_state={ 'epoch': current_epoch, 'step': current_step, 'loss': current_loss, 'metrics': current_metrics } )
Retrieve synchronized state
global_state = state_manager.get_global_training_state() print(f"Global epoch: {global_state['epoch']}") print(f"Global best loss: {global_state['best_loss']}")
**Key Features:**
- Distributed state synchronization
- Persistent state storage
- Consistency guarantees
- State recovery
#### 6. GradientSynchronizer
Advanced gradient synchronization with communication optimization and compression.
``<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->til' -->
<!-- from evospikenet.distributed_training import GradientSynchronizer -->
gradient_sync = GradientSynchronizer(
world_size=24,
compression<!-- Remember: Cannot convert automatically — please fix manually -->ckend='nccl',
overlap_computation=True
)
# Setup gradient synchronization
gradient_sync.setup_synchronization(
model=model,
optimizer=optimizer
)
# Training step with optimized gradient sync
for batch in dataloader:
# Forward pass
outputs = model(batch['inputs'])
loss = criterion(outputs, batch['targets'])
# Backward pass
loss.backward()
# Synchronize gradients with optimization
gradient_sync.synchronize_gradients(
compression_ratio=0.1, # 10% of original size
overlap_with_computation=True
)
# Optimizer step
optimizer.step()
Key Features: - Gradient compression - Communication overlap - Bandwidth optimization - Synchronization efficiency
7. NodeHealthMonitor
Monitors health and performance of distributed nodes with proactive issue detection.
PUtil' -->
health_monitor = NodeHealthMonitor( monitoring_interval=30, # seconds alert_thresholds={ 'cpu_ 'network_latency': 1000 # ms } )
Start health monitoring
health_monitor.start_monitoring( node_ids=range(24), monitoring_metrics=['cpu', 'memory', 'gpu', 'network', 'disk'] )
Get health status
health_status = health_monitor.get_cluster_health() for node_id, status in health_status.items(): if status['overall'] != 'healthy': print(f"Node {node_id} issues: {status['issues']}")
Proactive issue detection
issues = health_monitor.detect_potential_issues() for issue in issues: print(f"Potential issue: {issue['type']} on node {issue['node_id']}")
**Key Features:**
- Real-time health monitoring
- Proactive issue detection
- Alert system
- Performance tracking
#### 8. DistributedTrainingManager
Integrated manager that coordinates all distributed training components.<!-- Required dependency: Module 'GPUtil' not found. Consider 'pip install GPUtil' in your execution environment -->'GPUtil' -->
<!-- from evospikenet.distributed_training import DistributedTrainingManager -->
# Initialize distributed training manager
training_manager = DistributedTrainingManager(
cluster_config={
'world_size': 24,
fault_tolerance_enabled=True,
scalability_testing_enabled=True
)
# Setup complete distributed training
training_manager.setup_distributed_training(
model=model,
optimizer=optimizer,
dataset=dataset,
training_config={
'batch_size': 64,
'max_epochs': 100,
'checkpoint_interval': 500,
'scalability_test_interval': 1000
}
)
# Run distributed training with all features
results = training_manager.run_distributed_training()
# Get comprehensive training report
report = training_manager.generate_training_report()
print(f"Training completed in {report['total_time']}")
print(f"Final loss: {report['final_loss']}")
print(f"Scalability achieved: {report['scalability_efficiency']:.2%}")
Key Features: - Unified distributed training interface - Automatic component coordination - Comprehensive monitoring and reporting - Production-ready deployment
Integration Examples
Large-Scale Training Sed 'GPUtil' -->
Configure for 100+ node training
training_manager = DistributedTrainingManager( cluster_config={ 'world_size': 128, 'backend': 'nccl', 'fault_toning with advanced features training_manager.setup_distributed_training( model=large_model, dataset=huge_dataset, training_config={ 'initial_batch_size': 32, 'adaptive_batching': True, 'gradient_compression': 'quantization', 'checkpoint_strategy': 'incremental' } )
Monitor training progress
while training_manager.is_training_active(): status = training_manager.get_training_status() print(f"Epoch {status['epoch']}, Loss: {status['loss']:.4f}") print(f"Nodes active: {status['active_nodes']}/{status['total_nodes']}")
time.sleep(60) # Check every minute
```
Fault-Tolerant Training
```python
Configure for high-reliability training
fault_tolerant_manager = DistributedTrainingManager( cluster_config={ 'world_size': 64, 'fault_tolerance_level': 'high', 'auto_recovery': True, 'checkpoint_frequency': 'high' } )
Training with automatic fault recovery
try: results = fault_tolerant_manager.run_distributed_training() except Exception as e: print(f"Training interrupted: {e}") # Manager automatically handles recovery recovery_status = fault_tolerant_manager.get_recovery_status() print(f"Recovery progress: {recovery_status['progress']:.1%}") ```
Configuration Options
distributed_training:
coordinator:
world_size: 24
backend: nccl
master_addr: "192.168.1.100"
master_port: 12345
timeout: 600
fault_tolerance:
enabled: true
checkpoint_interval: 100
max_retries: 3
recovery_strategy: checkpoint_resume
auto_recovery: true
scalability_testing:
enabled: true
test_interval: 1000
min_nodes: 8
max_nodes: 128
test_duration_minutes: 30
resource_management:
dynamic_allocation: true
load_balancing: true
memory_optimization: true
gpu_scheduling: fair_share
gradient_synchronization:
compression_type: quantization
compression_ratio: 0.1
overlap_computation: true
bandwidth_optimization: true
health_monitoring:
enabled: true
monitoring_interval: 30
alert_thresholds:
cpu_usage: 0.95
memory_usage: 0.90
gpu_memory: 0.95
network_latency: 1000
state_management:
persistence_backend: redis
sync_interval: 10
consistency_level: strong
state_compression: true
Best Practices
- Cluster Setup: Ensure proper network configuration and firewall settings
- Resource Allocation: Monitor resource usage and adjust allocation policies
- Fault Tolerance: Enable comprehensive fault tolerance for production training
- Scalability Testing: Regularly test scalability with different configurations
- Monitoring: Implement comprehensive monitoring and alerting
- Checkpointing: Use frequent checkpointing for long-running training jobs
- Network Optimization: Optimize network settings for gradient synchronization
Troubleshooting
Common Issues: - Communication timeouts: Increase timeout values or check network connectivity - Memory issues: Enable gradient compression or reduce batch sizes - Node failures: Ensure fault tolerance is properly configured - Performance degradation: Run scalability tests to identify bottlenecks
Debug Mode:
training_manager.enable_debug_mode()
training_manager.log_detailed_metrics()
training_manager.enable_performance_profiling()
Performance Optimization
- Gradient Compression: Use quantization or sparsification to reduce communication overhead
- Communication Overlap: Enable computation-communication overlap for better utilization
- Adaptive Batching: Allow dynamic batch size adjustment based on performance
- Resource Balancing: Regularly rebalance resources across nodes
- Network Tuning: Optimize network settings for your cluster topology