大規模学習ガイド

[!NOTE] 最新の実装状況は機能実装ステータス (Remaining Functionality) を参照してください。

最終更新日: 2026年1月12日
色学習統合: 2026年1月12日

概要

EvoSpikeNetの大規模学習システムは、24ノード分散脳アーキテクチャを活用したマルチモーダルAIトレーニング環境を提供します。このガイドでは、大規模学習の起動方法、データ構造、保存位置などの詳細について説明します。

包括的なAI学習システム ⭐ UPDATED

EvoSpikeNetは、言語理解、ランドマーク認識、日英音声聞き取り、マルチモーダル統合のための包括的な学習システムを提供します。各モダリティのデータが個別に保存・管理され、最適化されたモデル学習を実現します。

特徴

🗣️ 言語理解: 3,080,000サンプルの日本語テキストデータ（rinna/japanese-gpt-1b最適化）
🏛️ ランドマーク認識: 100,000サンプルの世界ランドマーク画像データ
🎤 日英音声聞き取り: 565,000サンプルの高品質ASRデータ（LibriSpeech, VoxPopuli, ReazonSpeech）
🔗 マルチモーダル: 画像+テキスト統合学習データ
📁 データ分離保存: カテゴリ別ディレクトリ構造による効率的なデータ管理
🚀 簡単起動: ワンコマンドで全カテゴリのデータ収集・トレーニング開始
⚡ 自動最適化: GPU/CPU自動検出、メモリ最適化、バッチサイズ調整
📈 スケーラブル: 24ノード分散アーキテクチャによる大規模並列処理
🎯 ランク特化: 各ノード（ランク0-23）が専門分野に最適化された専用LLMを生成

ランク固有トレーニング

24ノード分散脳アーキテクチャでは、各ランクが異なる役割を担い、専門分野に最適化されたLLMを生成します：

ランク範囲	役割	専門分野	主な用途
0-7	言語理解ノード	日本語NLP	意味理解、文脈解析
8-11	視覚処理ノード	ランドマーク認識	画像理解、物体検出
12-15	音声処理ノード	日英ASR	音声認識、多言語処理
16-19	運動制御ノード	アクション生成	行動計画、出力生成
20-21	記憶ノード	エピソディック記憶	長期記憶、経験統合
22-23	意思決定ノード	高レベル推論	戦略的判断、実行機能

# ランク固有トレーニングの例
./scripts/train_launcher.sh rank --rank 0 --category langtext   # 言語理解特化LLM
./scripts/train_launcher.sh rank --rank 8 --category vision     # 視覚処理特化LLM
./scripts/train_launcher.sh rank --rank 12 --category audio     # 音声処理特化LLM

共有モデルトレーニング

複数のランクで共有可能な汎用モデルをトレーニングできます：

# ランク範囲指定での共有モデル作成
./scripts/train_launcher.sh shared --category langtext --rank-range 0-7   # 言語理解ノード（0-7）共有モデル
./scripts/train_launcher.sh shared --category vision --rank-range 8-11    # 視覚処理ノード（8-11）共有モデル

# 一般的な共有モデル作成（全ランクで使用可能）
./scripts/train_launcher.sh shared --category multimodal --shared         # マルチモーダル汎用モデル

共有モデルは、類似した機能を有する複数のノード間でリソースを効率的に共有するために使用されます。

クイックスタート

# ワンコマンドで包括的なAI学習を開始
./scripts/start_japanese_training.sh

詳細は後述の「日本語学習設定」セクションを参照してください。

前提条件

システム要件

CPU: Intel/AMD x64, ARM64, Apple Silicon
GPU: NVIDIA GPU (CUDA 11.8+), AMD GPU (ROCm), Apple Silicon GPU
メモリ: 最低16GB, 推奨64GB以上
ストレージ: 最低100GB SSD, 大規模学習時は1TB以上推奨
OS: Linux, macOS, Windows (WSL2)

ソフトウェア要件

Docker: 20.10+
Docker Compose: 2.0+
Kubernetes: 1.24+ (クラスターデプロイの場合)
Python: 3.10+
CUDA: 11.8+ (GPU使用時)

ネットワーク要件

インターネット接続: データダウンロード用
内部ネットワーク: 分散ノード通信用
ポート開放: 8000-8007 (API), 5432 (PostgreSQL), 9200 (Elasticsearch)

環境設定

1. リポジトリのクローン

git clone https://github.com/your-org/EvoSpikeNet.git
cd EvoSpikeNet

2. 環境変数の設定

# .envファイルを作成
cp .env.example .env

# 編集内容例
EVOSPIKENET_API_KEYS=your_api_key_here
DATABASE_URL=postgresql://user:password@localhost/evospikenet
OPENAI_API_KEY=your_openai_key
CUDA_VISIBLE_DEVICES=0,1,2,3  # GPU使用時

3. Python環境の設定

# 仮想環境作成
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate    # Windows

# 依存関係インストール
pip install -r requirements.txt
pip install -e .

データ準備

一括データダウンロード 🚀

クイックスタート（全カテゴリ一括ダウンロード）

# 全カテゴリのデータを一括ダウンロード
python scripts/collect_llm_training_data.py --config config/data_config.yaml --all

# バックグラウンドで実行
nohup python scripts/collect_llm_training_data.py --config config/data_config.yaml --all > download.log 2>&1 &

カテゴリ別一括ダウンロード

# 言語データ（日本語）を一括ダウンロード（13M+サンプル）
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category langtext

# Visionデータを一括ダウンロード（190K+サンプル）
python scripts/download_vision_data.py --quick  # 高優先度のみ
python scripts/download_vision_data.py --all    # 全データセット

# Audioデータを一括ダウンロード（565K+サンプル）
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category audio

# MultiModalデータを一括ダウンロード（885K+サンプル）
python scripts/collect_llm_training_data.py --config config/data_config.yaml --category multimodal

データダウンロード状況確認

# ダウンロード済みデータの確認
python scripts/verify_training_data_sufficiency.py

# カテゴリ別データ量の確認
find data/llm_training/ -type f -name "*.jsonl" -exec wc -l {} +

# Visionデータのサンプル数確認
python -c "
from datasets import load_from_disk
import os
for dataset in ['cifar10', 'cifar100', 'fashion_mnist']:
    for split in ['train', 'test']:
        path = f'data/llm_training/Vision/{dataset}/{split}'
        if os.path.exists(path):
            ds = load_from_disk(path)
            print(f'{dataset}/{split}: {len(ds):,} samples')
"

データダウンロードオプション

オプション	説明	実行例
`--all`	全カテゴリダウンロード	`--all`
`--category <name>`	特定カテゴリのみ	`--category langtext`
`--rebuild`	データ再構築	`--rebuild`
`--max-samples <n>`	サンプル数制限	`--max-samples 10000`
`--parallel`	並列ダウンロード	`--parallel 4`

データ構造

data/
├── llm_training/           # LLMトレーニングデータ（カテゴリ別保存）
│   ├── LangText/          # 言語理解用テキストデータ
│   │   ├── langtext_en_data.jsonl    # 英語テキストデータ
│   │   └── langtext_ja_data.jsonl    # 日本語テキストデータ（13M+サンプル）
│   ├── Vision/            # 画像データ
│   │   ├── cifar10/       # CIFAR-10 (60K)
│   │   ├── cifar100/      # CIFAR-100 (60K)
│   │   ├── fashion_mnist/ # Fashion-MNIST (70K)
│   │   └── vision_data.jsonl         # ランドマーク画像データ
│   ├── Audio/             # 音声聞き取りデータ（565K+サンプル）
│   │   └── audio_data.jsonl          # ASR学習データ
│   └── MultiModal/        # 多モーダル統合データ（885K+サンプル）
│       └── multimodal_data.jsonl     # マルチモーダルデータ
├── MNIST/                 # MNISTデータセット
├── audio_dataset/         # 音声データセット
├── multi_modal_dataset/   # 多モーダルデータセット
└── checkpoints/           # チェックポイント

データ分離の利点: - LangText: 言語理解・生成モデルの学習用テキストデータ - Vision: 世界のランドマーク画像認識データ - Audio: 日英両言語の音声聞き取り（ASR）データ - MultiModal: 画像+テキスト統合学習データ

各カテゴリのデータは個別のJSONLファイルに保存され、モデルの種類に応じた学習に使用されます。

データ収集スクリプト

LLMトレーニングデータ収集

# 設定ファイル確認
cat config/data_config.yaml

# データ収集実行
python scripts/collect_llm_training_data.py --config config/data_config.yaml

ランク特化データ収集

24ノード分散脳アーキテクチャでは、各ランクが専門分野に最適化されたデータを収集します：

# 言語理解ノード（ランク0-7）のデータ収集
./scripts/train_launcher.sh collect --rank 0    # 青空文庫、日本語Wikipedia
./scripts/train_launcher.sh collect --rank 1    # 日本語古典文学、対話データ

# 視覚処理ノード（ランク8-11）のデータ収集
./scripts/train_launcher.sh collect --rank 8    # ImageNet、COCOデータセット
./scripts/train_launcher.sh collect --rank 9    # CIFAR-100、ランドマーク認識

# 音声処理ノード（ランク12-15）のデータ収集
./scripts/train_launcher.sh collect --rank 12   # Common Voice日本語、LibriSpeech
./scripts/train_launcher.sh collect --rank 13   # TEDlium、音声翻訳データ

# 運動制御ノード（ランク16-19）のデータ収集
./scripts/train_launcher.sh collect --rank 16   # Roboturk、行動生成データ
./scripts/train_launcher.sh collect --rank 17   # 軌跡計画、シーケンスデータ

# 記憶ノード（ランク20-21）のデータ収集
./scripts/train_launcher.sh collect --rank 20   # エピソディック記憶データ
./scripts/train_launcher.sh collect --rank 21   # 時系列データ、長期依存

# 意思決定ノード（ランク22-23）のデータ収集
./scripts/train_launcher.sh collect --rank 22   # 戦略ゲーム、意思決定タスク
./scripts/train_launcher.sh collect --rank 23   # 強化学習データ、最適化問題

各ランクのデータ収集は自動的に最適なデータソースを選択し、専門分野に特化した高品質なデータをダウンロードします。

個別データソースからの収集

# Wikipediaデータ
python -c "

<!-- from evospikenet.dataloaders import WikipediaLoader -->
loader = WikipediaLoader(lang='en')
text = loader.load('Python (programming language)')
print(f'Downloaded {len(text)} characters')
"

# Hugging Faceデータセット
python -c "
from datasets import load_dataset
dataset = load_dataset('imdb', split='train[:10%]')
print(f'Loaded {len(dataset)} samples')
"

データ形式

テキストデータ (JSONL形式)

{"text": "This is a sample text for LLM training.", "source": "wikipedia", "language": "en"}
{"text": "これはLLMトレーニング用のサンプルテキストです。", "source": "aozora", "language": "ja"}

画像データ (ImageFolder形式)

data/
├── train/
│   ├── class1/
│   │   ├── image001.jpg
│   │   └── image002.jpg
│   └── class2/
│       ├── image003.jpg
│       └── image004.jpg
└── test/
    ├── class1/
    └── class2/

音声データ (クラス別フォルダ)

data/audio_dataset/
├── speech_commands/
│   ├── yes/
│   ├── no/
│   ├── up/
│   └── down/
└── custom_audio/
    ├── music/
    └── speech/

トレーニングの起動方法

一括学習起動 🎯

クイックスタート（全カテゴリ一括学習）

# 全24ノード一括トレーニング起動
./scripts/train_all_nodes.sh

# または個別起動
for rank in {0..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --rank $rank \
    --gpu &
done

カテゴリ別一括学習

# 言語理解ノード（Rank 0-7）一括学習
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --gpu &
done

# Visionノード（Rank 8-11）一括学習
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank $rank \
    --gpu &
done

# Audioノード（Rank 12-15）一括学習
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank $rank \
    --gpu &
done

# MultiModalノード（Rank 16-23）一括学習
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank $rank \
    --gpu &
done

色学習（Color Learning）一括学習

# 全ランク色学習トレーニング（minimum level）
for rank in {0..23}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level minimum \
    --rank $rank \
    --gpu &
done

# Vision特化色学習（standard level）
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level standard \
    --rank $rank \
    --gpu &
done

# 高精度色学習（maximum level）
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category color_learning \
    --color-level maximum \
    --rank $rank \
    --gpu &
done

学習レベル別一括起動

レベル	色数	推奨ランク	実行例
minimum	8-16色	全ノード	`--color-level minimum`
standard	32-64色	Vision特化	`--color-level standard --rank 8-11`
maximum	128-256色	Vision特化	`--color-level maximum --rank 8-11`

1. Docker Composeを使用した起動

GPUトレーニング

# GPU環境でLLMトレーニングを開始
docker-compose -f docker-compose.train.yml up llm-trainer-gpu

# バックグラウンド実行
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu

CPUトレーニング

# CPU環境でLLMトレーニングを開始
docker-compose -f docker-compose.train.yml up llm-trainer-cpu

# バックグラウンド実行
docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu

2. Kubernetesを使用した分散トレーニング

# Kubernetesクラスタにデプロイ
kubectl apply -f k8s/deployment.yaml

# トレーニングジョブを開始
kubectl apply -f k8s/training-job.yaml

# ステータス確認
kubectl get pods -n evospikenet
kubectl logs -f deployment/evospikenet-trainer -n evospikenet

3. 直接スクリプト実行

APIサーバー起動

# APIモードでトレーニングサーバー起動
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --gpu

# CPUモード
python scripts/train_llm_models.py --config config/training_config.yaml --mode api --cpu

直接トレーニング実行

# LangTextモデルトレーニング
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --gpu \
    --epochs 10 \
    --batch-size 16

# Visionモデルトレーニング
python examples/train_vision_encoder.py \
    --dataset mnist \
    --epochs 50 \
    --batch-size 128 \
    --gpu

# Audioモデルトレーニング
python examples/train_audio_encoder.py \
    --epochs 30 \
    --batch-size 32 \
    --gpu

4. 分散トレーニング

複数ノードでの分散学習

# マスターノード
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --mode distributed \
    --rank 0 \
    --world-size 4 \
    --master-addr localhost \
    --master-port 12345

# ワーカーノード1
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --mode distributed \
    --rank 1 \
    --world-size 4 \
    --master-addr master-node-ip \
    --master-port 12345

5. ランク特化トレーニング

24ノード分散脳アーキテクチャでは、各ランク（0-23）が専門分野に最適化されたLLMを生成します。ランク特化トレーニングでは、自動的に最適なモデルアーキテクチャ、トレーニングパラメータ、データセットが選択されます。

ランク特化トレーニングの実行

# 言語理解ノード（ランク0-7）のトレーニング
./scripts/train_launcher.sh rank --rank 0 --category langtext --gpu
./scripts/train_launcher.sh rank --rank 1 --category langtext --gpu

# 視覚処理ノード（ランク8-11）のトレーニング
./scripts/train_launcher.sh rank --rank 8 --category vision --gpu
./scripts/train_launcher.sh rank --rank 9 --category vision --gpu

# 音声処理ノード（ランク12-15）のトレーニング
./scripts/train_launcher.sh rank --rank 12 --category audio --gpu
./scripts/train_launcher.sh rank --rank 13 --category audio --gpu

# 運動制御ノード（ランク16-19）のトレーニング
./scripts/train_launcher.sh rank --rank 16 --category motor --gpu
./scripts/train_launcher.sh rank --rank 17 --category motor --gpu

# 記憶ノード（ランク20-21）のトレーニング
./scripts/train_launcher.sh rank --rank 20 --category memory --gpu
./scripts/train_launcher.sh rank --rank 21 --category memory --gpu

# 意思決定ノード（ランク22-23）のトレーニング
./scripts/train_launcher.sh rank --rank 22 --category decision --gpu
./scripts/train_launcher.sh rank --rank 23 --category decision --gpu

ランク特化パラメータの自動設定

各ランクは以下の最適化パラメータを自動的に適用します：

言語理解ノード (0-7):
モデル: rinna/japanese-gpt-1b
最適化: 日本語NLPタスク特化
データセット: 青空文庫、日本語Wikipedia
学習率: 2e-5
視覚処理ノード (8-11):
モデル: google/vit-base-patch16-224
最適化: 画像分類、物体検出
データセット: ImageNet、COCO
学習率: 1e-4
音声処理ノード (12-15):
モデル: openai/whisper-small
最適化: 音声認識、多言語対応
データセット: Common Voice、LibriSpeech
学習率: 1e-5
運動制御ノード (16-19):
モデル: カスタムTransformer
最適化: シーケンス生成、行動予測
データセット: ロボティクスデータ、運動軌跡
学習率: 3e-5
記憶ノード (20-21):
モデル: メモリ拡張Transformer
最適化: 長期依存関係、経験統合
データセット: エピソディックデータ、時系列
学習率: 1e-5
意思決定ノード (22-23):
モデル: 高レベル推論Transformer
最適化: 戦略的判断、実行機能
データセット: 意思決定タスク、戦略データ
学習率: 2e-5

API経由でのランク特化トレーニング

# APIサーバー起動
python scripts/train_llm_models.py --mode api --gpu

# 単一ランク専用モデルの作成
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": 0,
    "epochs": 10,
    "batch_size": 16
  }'

# 共有モデルの作成（複数ランクで使用可能）
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": "shared",
    "shared": true,
    "epochs": 10,
    "batch_size": 16
  }'

# ベースモデルの作成（ファインチューニング用）
curl -X POST http://localhost:8000/train \
  -H "Content-Type: application/json" \
  -d '{
    "category": "text_generation",
    "model_name": "rinna/japanese-gpt-1b",
    "dataset_path": "data/llm_training/text_generation",
    "output_dir": "saved_models",
    "rank": "base",
    "epochs": 5,
    "batch_size": 32
  }'

API経由で生成されるモデル名の例: - 単一ランク: evospike-langtext-r00-v001 - ランク範囲: evospike-langtext-r00-r07-v001 - 共有モデル: evospike-langtext-shared-v001 - ベースモデル: evospike-langtext-base-v001 - 一般モデル: evospike-langtext-general-v001

各学習方法の詳細 📚

1. 言語理解学習 (LangText)

対象ランク

Rank 0-7: 言語理解ノード（日本語NLP特化）

データセット

日本語Wikipedia: 935,640サンプル
Common Crawl日本語: 2,342,820サンプル
OSCAR日本語: 1,399,920サンプル
青空文庫、論文、対話、コード、Web、小説、法律文書など
合計: 14,411,625サンプル（OPTIMAL）

データダウンロード

# 言語データ一括ダウンロード
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category langtext

# ダウンロード状況確認
wc -l data/llm_training/LangText/langtext_ja_data.jsonl

トレーニング方法

# 単一ランク学習（Rank 0）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --gpu

# 全言語ノード一括学習（Rank 0-7）
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --gpu &
done

推奨設定

# config/training_config.yaml
model:
  name: "rinna/japanese-gpt-1b"
  max_length: 2048
  tokenizer: "rinna/japanese-gpt-1b"

training:
  epochs: 10
  batch_size: 4
  learning_rate: 2e-5
  gradient_accumulation_steps: 8
  warmup_steps: 1000
  fp16: true

学習時間目安

GPU（RTX 3090）: 20-30時間
GPU（A100）: 10-15時間
CPU: 100-150時間（非推奨）

2. Vision学習（画像認識）

対象ランク

Rank 8-11: 視覚処理ノード（画像分類・物体検出）

データセット

CIFAR-10: 60,000サンプル（基本画像分類）
CIFAR-100: 60,000サンプル（詳細画像分類）
Fashion-MNIST: 70,000サンプル（ファッション画像）
Google Landmarks: ランドマーク認識
合計: 195,000+サンプル（OPTIMAL）

データダウンロード

# クイックダウンロード（高優先度データセット: CIFAR-10/100, Fashion-MNIST）
python scripts/download_vision_data.py --quick

# 全データセットダウンロード（Food-101, Oxford Pets, Flowers含む）
python scripts/download_vision_data.py --all

# 個別ダウンロード
python scripts/download_vision_data.py --dataset cifar10
python scripts/download_vision_data.py --dataset cifar100
python scripts/download_vision_data.py --dataset fashion_mnist

# 利用可能なデータセット一覧
python scripts/download_vision_data.py --list

# ダウンロード状況確認
python -c "
from datasets import load_from_disk
import os
total = 0
for ds in ['cifar10', 'cifar100', 'fashion_mnist']:
    for split in ['train', 'test']:
        path = f'data/llm_training/Vision/{ds}/{split}'
        if os.path.exists(path):
            data = load_from_disk(path)
            samples = len(data)
            total += samples
            print(f'{ds}/{split}: {samples:,} samples')
print(f'\\n合計: {total:,} samples')
"

トレーニング方法

# 単一ランク学習（Rank 8）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank 8 \
    --gpu

# 全Visionノード一括学習（Rank 8-11）
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank $rank \
    --gpu &
done

# データセット指定トレーニング
python scripts/train_llm_models.py \
    --category vision \
    --rank 8 \
    --dataset cifar10 \
    --gpu

推奨設定

# config/training_config.yaml
model:
  name: "google/vit-base-patch16-224"
  image_size: 224
  patch_size: 16

training:
  epochs: 30
  batch_size: 32
  learning_rate: 1e-4
  optimizer: "adamw"
  weight_decay: 0.01
  fp16: true

学習時間目安

GPU（RTX 3090）: 5-8時間
GPU（A100）: 3-5時間

3. Audio学習（音声認識）

対象ランク

Rank 12-15: 音声処理ノード（日英ASR）

データセット

LibriSpeech: 460,000サンプル
Common Voice: 50,000サンプル
VoxPopuli: 30,000サンプル
ReazonSpeech: 25,000サンプル
合計: 575,000+サンプル（OPTIMAL）

データダウンロード

# Audioデータ一括ダウンロード
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category audio

# ダウンロード状況確認
wc -l data/llm_training/Audio/audio_data.jsonl

トレーニング方法

# 単一ランク学習（Rank 12）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank 12 \
    --gpu

# 全Audioノード一括学習（Rank 12-15）
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank $rank \
    --gpu &
done

推奨設定

# config/training_config.yaml
model:
  name: "openai/whisper-small"
  sampling_rate: 16000
  language: "ja"

training:
  epochs: 20
  batch_size: 16
  learning_rate: 1e-5
  gradient_accumulation_steps: 4
  fp16: true

学習時間目安

GPU（RTX 3090）: 10-15時間
GPU（A100）: 6-10時間

4. MultiModal学習（マルチモーダル統合）

対象ランク

Rank 16-19: 運動制御ノード
Rank 20-21: 記憶ノード
Rank 22-23: 意思決定ノード

データセット

COCO Captions: 414,000サンプル
Flickr30k: 145,000サンプル
Conceptual Captions: 300,000サンプル
Visual Genome: 26,000サンプル
合計: 885,000+サンプル（OPTIMAL）

データダウンロード

# MultiModalデータ一括ダウンロード
python scripts/collect_llm_training_data.py \
    --config config/data_config.yaml \
    --category multimodal

# ダウンロード状況確認
wc -l data/llm_training/MultiModal/multimodal_data.jsonl

トレーニング方法

# 単一ランク学習（Rank 16）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank 16 \
    --gpu

# 全MultiModalノード一括学習（Rank 16-23）
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category multimodal \
    --rank $rank \
    --gpu &
done

推奨設定

# config/training_config.yaml
model:
  name: "openai/clip-vit-base-patch32"
  text_encoder: "bert-base-uncased"
  vision_encoder: "vit-base-patch32"

training:
  epochs: 15
  batch_size: 24
  learning_rate: 5e-5
  warmup_steps: 500
  fp16: true

学習時間目安

GPU（RTX 3090）: 15-20時間
GPU（A100）: 8-12時間

5. 学習方法の比較表

カテゴリ	対象ランク	データ量	学習時間（GPU）	推奨モデル	主な用途
LangText	0-7	14.4M	20-30時間	rinna/japanese-gpt-1b	日本語理解・生成
Vision	8-11	195K+	5-8時間	vit-base-patch16-224	画像分類・認識
Audio	12-15	575K+	10-15時間	whisper-small	音声認識・ASR
MultiModal	16-23	885K+	15-20時間	clip-vit-base-patch32	画像+テキスト統合

6. 全カテゴリ一括学習スクリプト

#!/bin/bash
# scripts/train_all_categories.sh

echo "=== 全カテゴリ一括学習開始 ==="

# LangText（Rank 0-7）
echo "Starting LangText training..."
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --category langtext --rank $rank --gpu &
done

# Vision（Rank 8-11）
echo "Starting Vision training..."
for rank in {8..11}; do
  python scripts/train_llm_models.py \
    --category vision --rank $rank --gpu &
done

# Audio（Rank 12-15）
echo "Starting Audio training..."
for rank in {12..15}; do
  python scripts/train_llm_models.py \
    --category audio --rank $rank --gpu &
done

# MultiModal（Rank 16-23）
echo "Starting MultiModal training..."
for rank in {16..23}; do
  python scripts/train_llm_models.py \
    --category multimodal --rank $rank --gpu &
done

echo "=== 全24ノードのトレーニングを起動しました ==="
echo "進捗確認: tail -f logs/training.log"

6. 色学習トレーニング ⭐ NEW

分散脳システムにおける色学習（Color Learning）は、各ノードが色情報を理解・処理・生成する能力を獲得するための専門的なトレーニングです。3段階の学習レベル（最低・標準・最大）を提供し、各ノードタイプに最適化されています。

色学習の特徴

3段階学習レベル: 最低（8-16色）、標準（32-64色）、最大（128-256色）
ノード別最適化: PFC、Vision、Language など各ノードタイプに特化
自動データ生成: synthetic データと Hugging Face データセットの統合
転移学習対応: 事前学習済みモデルからの効率的な学習
知識蒸留サポート: 大規模モデルから小規模モデルへの知識転移

色学習レベル詳細

レベル	色数	データセット	学習時間	GPU VRAM	用途
Minimum	8-16色	MNIST, Basic Colors (150MB)	1-2時間	2-4GB	プロトタイピング、基本色分類
Standard	32-64色	CIFAR-10/100, subset ImageNet (2-5GB)	4-8時間	8-12GB	実用アプリケーション、一般的な色認識
Maximum	128-256色	ImageNet, COCO (20-50GB)	12-24時間	16-24GB	専門的な色処理、研究用途

統合トレーニングスクリプトからの実行（推奨）

# 1. 設定情報を確認
python scripts/train_llm_models.py --config config/training_config.yaml --show-color-config

# 2. データダウンロード + トレーニング（Vision ノード Rank 9）
python scripts/train_llm_models.py --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank 9 \
  --download-data \
  --gpu

# 3. Language ノード（Rank 20）- 標準レベル
python scripts/train_llm_models.py --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 20 \
  --gpu

# 4. 複数ノードの並列トレーニング
# GPU 0: Vision node
CUDA_VISIBLE_DEVICES=0 python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 9 \
  --gpu &

# GPU 1: Language node
CUDA_VISIBLE_DEVICES=1 python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 20 \
  --gpu &

wait

ランク → ノードタイプの自動マッピング

--rank オプションを指定すると、自動的に適切なノードタイプが決定されます：

ランク範囲	ノードタイプ	色学習の重要度	推奨レベル
0-7	PFC (前頭前野)	高	Standard
8-11	Vision (視覚)	最高	Maximum
12-15	Audio (音声)	低	Minimum
16-19	Motor (運動)	中	Minimum-Standard
20-21	Memory (記憶)	中	Standard
22-23	PFC (意思決定)	高	Standard

専用スクリプトからの実行（詳細制御）

# 1. データセットのダウンロード
python scripts/download_color_datasets.py \
    --level minimum \
    --modality all \
    --output-dir data/color_learning

# 利用可能なデータセット情報を表示
python scripts/download_color_datasets.py --show-info

# 2. モデルのトレーニング
# PFC ノード（マルチモーダル）- 最低レベル
python scripts/train_color_learning_models.py \
    --node-type pfc \
    --level minimum \
    --dataset-path data/color_learning/multimodal/mnist_captions \
    --epochs 5 \
    --gpu

# Vision ノード - 標準レベル
python scripts/train_color_learning_models.py \
    --node-type vision \
    --level standard \
    --dataset cifar10 \
    --epochs 20 \
    --gpu

# Language ノード - 標準レベル
python scripts/train_color_learning_models.py \
    --node-type language \
    --level standard \
    --dataset-path data/color_learning/language \
    --epochs 25 \
    --gpu

# 設定情報を確認（dry-run）
python scripts/train_color_learning_models.py \
    --node-type vision-object \
    --level maximum \
    --show-config

API経由での色学習トレーニング

# APIサーバー起動
python scripts/train_llm_models.py --config config/training_config.yaml \
  --mode api --host 0.0.0.0 --port 8000

# 色学習ジョブの送信
curl -X POST "http://localhost:8000/train" \
  -H "Content-Type: application/json" \
  -d '{
    "category": "color_learning",
    "model_name": "evospike-color-vision-r09",
    "dataset_path": "data/color_learning/minimum/vision",
    "output_dir": "saved_models/color_vision_minimum",
    "gpu": true,
    "epochs": 10,
    "batch_size": 16,
    "learning_rate": 0.0001,
    "rank": 9
  }'

色学習モデルの命名規則

生成される色学習モデルは以下の命名規則に従います：

単一ランク: evospike-color_learning_vision-r09-v001
ランク範囲: evospike-color_learning_vision-r09-r11-v001
共有モデル: evospike-color_learning_vision-shared-v001

ノードタイプ別の色学習設定

各ノードタイプは異なる色学習要件を持ちます：

PFC（前頭前野）ノード: - 最低: 8色、基本色理解 - 標準: 64色、実用色認識 - 最大: 256色、専門色処理

Vision（視覚）ノード: - 最低: 16色、基本画像分類 - 標準: 64色、詳細色認識 - 最大: 256色、プロフェッショナル色処理

Language（言語）ノード: - 最低: 8色、色名の基本理解 - 標準: 32色、色の詳細な表現 - 最大: 128色、ニュアンスのある色描写

Motor（運動）ノード: - 最低: 8色、基本的な色フィードバック - 標準: 16色、視覚ガイダンス - 最大: 32色、詳細な視覚制御

Audio（音声）ノード: - 最低: 4色、最小限の視覚統合 - 標準: 8色、基本的なマルチモーダル対応 - 最大: 16色、音声-視覚統合

Memory（記憶）ノード: - 最低: 8色、基本的なエピソード記録 - 標準: 32色、詳細な記憶エンコーディング - 最大: 64色、高解像度記憶保持

プログレッシブトレーニング

段階的に学習レベルを上げることで、効率的な学習が可能：

#!/bin/bash
# progressive_color_training.sh

RANK=9  # Vision node

# Stage 1: Minimum (基礎学習)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank $RANK \
  --download-data \
  --gpu

# Stage 2: Standard (転移学習)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank $RANK \
  --gpu

# Stage 3: Maximum (最終調整)
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level maximum \
  --rank $RANK \
  --gpu

echo "✅ Progressive training completed for Rank $RANK"

知識蒸留による効率化

大規模モデルから小規模モデルへの知識転移：

# 1. 教師モデル（最大レベル）をトレーニング
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level maximum \
  --rank 9 \
  --gpu

# 2. 生徒モデル（標準レベル）へ知識蒸留
# TODO: 知識蒸留スクリプトの実装予定

色学習の評価

# モデルの色認識精度を評価
python scripts/evaluate_color_learning.py \
  --model-path saved_models/evospike-color_learning_vision-r09-v001 \
  --test-dataset data/color_learning/standard/vision/test \
  --metrics accuracy,f1,confusion_matrix

# 結果の可視化
python scripts/visualize_color_results.py \
  --results results/color_learning_evaluation.json \
  --output visualizations/color_learning

データセット情報

色学習で使用される主要なデータセット：

Minimum Level: - MNIST (60,000画像, グレースケール) - Basic Colors (10,000合成画像, 8色) - 合計: ~150MB

Standard Level: - CIFAR-10 (60,000画像, 10クラス) - CIFAR-100 subset (20,000画像, 64色) - Color Text (50,000テキスト, 色記述) - 合計: ~2-5GB

Maximum Level: - ImageNet subset (100,000画像) - COCO (118,287画像) - Flickr30k (31,000画像) - Wikipedia Color Corpus (500,000テキスト) - 合計: ~20-50GB

トラブルシューティング

メモリ不足エラー:

# バッチサイズを削減
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level standard \
  --rank 9 \
  --gpu \
  --batch-size 4  # デフォルト16から削減

データダウンロードエラー:

# 個別にダウンロード
python scripts/download_color_datasets.py \
  --level minimum \
  --modality vision \
  --output-dir data/color_learning

# その後トレーニング実行（--download-dataなし）
python scripts/train_llm_models.py \
  --config config/training_config.yaml \
  --category color_learning \
  --color-level minimum \
  --rank 9 \
  --gpu

データ構造と保存位置

LLMモデル命名規則

EvoSpikeNetでは、生成されたLLMモデルに統一的な命名規則を適用し、モデルの種類、カテゴリ、ランク、バージョンを一目で識別できるようにしています。単一ランク専用モデルだけでなく、複数ランクで共有可能なモデルも考慮しています。

命名規則の構造

基本形式:

{Type}-{Category}-{RankSpec}-v{Version:03d}

各要素の説明: - Type: モデルアーキテクチャの種類 - evospike: EvoSpikeNet独自アーキテクチャ - brain: 脳型アーキテクチャモデル - node: ノード特化型モデル - Category: 専門分野カテゴリ (NODE_TYPE_TO_CATEGORIESに基づく) - text_generation: テキスト生成 (executiveノード) - image_classification: 画像分類 (visionノード) - object_detection: 物体検出 (visionノード) - speech_recognition: 音声認識 (auditoryノード) - motion_control: 運動制御 (motorノード) - decision_making: 意思決定 (executiveノード) - planning: プランニング (executiveノード) - reasoning: 推論 (executiveノード) - rag: RAG（検索拡張生成）(executiveノード) - multimodal: マルチモーダル (generalノード) - embedding: 埋め込み (generalノード) - RankSpec: ランク指定（複数パターン対応） - 単一ランク: r{Rank:02d} (例: r00, r08) - ランク範囲: r{Start:02d}-r{End:02d} (例: r00-r07, r08-r11) - 共有モデル: shared (全ランクで共有可能) - 一般モデル: general (ランク非依存) - ベースモデル: base (他のモデルの基礎) - Version: バージョン番号 (001, 002, ...)

命名規則の例

単一ランク専用モデル:

evospike-text_generation-r00-v001     # ランク0専用テキスト生成モデル
evospike-image_classification-r08-v001 # ランク8専用画像分類モデル
evospike-speech_recognition-r12-v001   # ランク12専用音声認識モデル
evospike-decision_making-r22-v001     # ランク22専用意思決定モデル

複数ランク共有モデル:

evospike-text_generation-r00-r07-v001  # ランク0-7で共有のテキスト生成モデル
evospike-image_classification-r08-r11-v001 # ランク8-11で共有の画像分類モデル
evospike-speech_recognition-r12-r15-v001   # ランク12-15で共有の音声認識モデル
evospike-motion_control-r16-r19-v001     # ランク16-19で共有の運動制御モデル

共有・一般・ベースモデル:

evospike-text_generation-shared-v001   # 全ランクで共有可能なテキスト生成モデル
evospike-image_classification-shared-v001 # 全ランクで共有可能な画像分類モデル
evospike-text_generation-general-v001  # ランク非依存の一般テキスト生成モデル
evospike-text_generation-base-v001     # テキスト生成モデルのベース（ファインチューニング用）
evospike-multimodal-base-v001          # マルチモーダルモデルのベース

モデル共有のユースケース

共有モデル: 同じ機能を持つ複数のノードで同じモデルを使用する場合
ベースモデル: 各ランクの専用モデル作成時の基礎モデル
一般モデル: ランク固有の最適化を施さない汎用モデル
範囲モデル: 同じ専門分野のノード群で共有可能なモデル

命名規則の利点

識別性: Type/Category/RankSpec/Versionが一目でわかる
柔軟性: 単一/複数/共有モデルに対応
拡張性: 新しい共有パターンの追加が容易
一貫性: 全トレーニング方法で統一された命名

命名規則の例

# 言語理解ノード（ランク0）の初回モデル
evospike-langtext-r00-v001

# 視覚処理ノード（ランク8）の改良版モデル
evospike-vision-r08-v002

# 音声処理ノード（ランク12）の専門モデル
evospike-audio-r12-v001

# 運動制御ノード（ランク16）の学習済みモデル
evospike-motor-r16-v003

# 記憶ノード（ランク20）の最適化モデル
evospike-memory-r20-v002

# 意思決定ノード（ランク22）の高性能モデル
evospike-decision-r22-v001

命名規則の利点

識別性: Type, Category, Rank, Versionが一目でわかる
ソート性: ランク順、バージョン順に並べやすい
拡張性: 新しいカテゴリやタイプの追加が容易
自動化: スクリプトでの自動生成が可能

モデル保存構造

saved_models/
├── evospike-langtext-r00-v001/     # 言語理解ノード（ランク0）モデル
│   ├── config.json                # モデル設定
│   ├── pytorch_model.bin          # モデル重み
│   ├── tokenizer.json             # トークナイザー設定
│   ├── vocab.json                 # 語彙ファイル
│   ├── merges.txt                 # BPEマージファイル
│   └── training_args.bin          # トレーニング引数
├── evospike-vision-r08-v001/      # 視覚処理ノード（ランク8）モデル
│   ├── model.pth
│   ├── optimizer.pth
│   └── logs/
├── evospike-audio-r12-v001/       # 音声処理ノード（ランク12）モデル
│   ├── model.pt
│   ├── feature_extractor.json
│   └── logs/
├── evospike-decision-r22-v001/    # 意思決定ノード（ランク22）モデル
│   ├── model.bin
│   ├── processor_config.json
│   └── logs/
└── checkpoints/                   # トレーニングチェックポイント
    ├── evospike-langtext-r00-v001-checkpoint-500/
    ├── evospike-langtext-r00-v001-checkpoint-1000/
    └── ...

ログ保存構造

logs/
├── training.log                  # メイン訓練ログ
├── tensorboard/                  # TensorBoardログ
│   ├── events.out.tfevents.1234567890.hostname
│   └── ...
├── wandb/                        # Weights & Biasesログ
│   ├── run-20231231_123456-abc123/
│   └── ...
└── metrics.json                  # メトリクスJSON

設定ファイル構造

config/
├── training_config.yaml          # トレーニング設定
├── data_config.yaml              # データ収集設定
├── settings.yaml                 # アプリケーション設定
├── settings.production.yaml      # 本番環境設定
├── settings.staging.yaml         # ステージング環境設定
├── settings.development.yaml     # 開発環境設定
├── settings.schema.json          # 設定スキーマ
├── node_allocation.yaml          # ノード割り当て設定
└── progress_settings.yaml        # 進捗設定

データベース構造

PostgreSQLスキーマ

-- トレーニングジョブテーブル
CREATE TABLE training_jobs (
    id SERIAL PRIMARY KEY,
    job_id VARCHAR(255) UNIQUE NOT NULL,
    model_type VARCHAR(50) NOT NULL,
    status VARCHAR(50) NOT NULL,
    config JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- モデルメトリクステーブル
CREATE TABLE model_metrics (
    id SERIAL PRIMARY KEY,
    job_id VARCHAR(255) REFERENCES training_jobs(job_id),
    epoch INTEGER,
    step INTEGER,
    loss FLOAT,
    accuracy FLOAT,
    perplexity FLOAT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- 分散ノードテーブル
CREATE TABLE distributed_nodes (
    id SERIAL PRIMARY KEY,
    node_id VARCHAR(255) UNIQUE NOT NULL,
    ip_address INET,
    gpu_count INTEGER,
    memory_gb INTEGER,
    status VARCHAR(50),
    last_heartbeat TIMESTAMP
);

Elasticsearchインデックス

{
  "mappings": {
    "properties": {
      "job_id": {"type": "keyword"},
      "timestamp": {"type": "date"},
      "level": {"type": "keyword"},
      "message": {"type": "text"},
      "metrics": {"type": "object"},
      "node_id": {"type": "keyword"}
    }
  }
}

モニタリングと管理

トレーニング進捗の監視

API経由での監視

# トレーニングステータス確認
curl http://localhost:8000/training/status

# メトリクス取得
curl http://localhost:8000/metrics

# ログ取得
curl http://localhost:8000/logs?lines=100

TensorBoardでの可視化

# TensorBoard起動
tensorboard --logdir logs/tensorboard --port 6006

# ブラウザでアクセス
open http://localhost:6006

Weights & Biasesでの追跡

# W&Bダッシュボード
wandb login
# トレーニング中に自動追跡

リソース監視

GPU使用率監視

# NVIDIA GPU
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.free --format=csv

# AMD GPU
rocm-smi --showuse

システムリソース監視

# CPU/メモリ使用率
top -p $(pgrep -f train_llm_models)

# ディスク使用率
df -h /path/to/data /path/to/models

# ネットワーク使用率
iftop -i eth0

トレーニング管理

トレーニングの一時停止/再開

# チェックポイントから再開
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --resume-from-checkpoint saved_models/checkpoints/checkpoint-1000 \
    --gpu

トレーニングの停止

# 正常停止（チェックポイント保存）
curl -X POST http://localhost:8000/training/stop

# 強制停止
pkill -f train_llm_models

7. 追加学習とフェデレーション学習 🆕

追加学習（Incremental Learning）

既存のモデルを基に新しいデータで継続的に学習する機能です。以前のトレーニング結果を活用して、効率的にモデルを改善できます。

使用方法

# チェックポイントから再開
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001/checkpoint-1000 \
    --gpu

# 既存モデルに追加学習（Incremental Learning）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --incremental \
    --gpu

特徴

✅ チェックポイント再開: --resume-fromで中断したトレーニングを再開
✅ 知識保持: 既存の学習内容を保持しながら新データで学習
✅ 効率的: ゼロからの学習より短時間で完了
✅ カタストロフィック・フォーゲッティング対策: 段階的学習で知識の喪失を最小化

ユースケース

新規データ追加: 既存モデルに新しいデータセットを追加学習
継続的改善: 定期的なデータ更新によるモデルの継続的改善
ドメイン適応: 一般モデルを特定ドメインに適応
バージョン管理: 段階的なモデル改善の履歴管理

例: LangTextモデルの追加学習

# ステップ1: 初期トレーニング
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --gpu

# ステップ2: 新データで追加学習
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --incremental \
    --gpu

# ステップ3: さらに特定ドメインデータで追加学習
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001 \
    --gpu

フェデレーション学習（Federated Learning）

複数のノードで分散学習を行い、モデルパラメータを集約して統合モデルを構築する機能です。プライバシーを保持しながら分散データから学習できます。

使用方法

# フェデレーション学習モード（FedAvg）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --federated \
    --aggregation-method fedavg \
    --federated-rounds 10 \
    --gpu

# フェデレーション学習（FedProx）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category vision \
    --rank 8 \
    --federated \
    --aggregation-method fedprox \
    --federated-rounds 20 \
    --gpu

# フェデレーション学習（FedOpt）
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category audio \
    --rank 12 \
    --federated \
    --aggregation-method fedopt \
    --federated-rounds 15 \
    --gpu

集約方法（Aggregation Methods）

メソッド	説明	適用場面
FedAvg	シンプルな平均化	一般的なフェデレーション学習
FedProx	正則化付き学習	データ不均衡がある場合
FedOpt	適応的オプティマイザ	高速収束が必要な場合

特徴

🔒 プライバシー保護: データをローカルに保持したまま学習
🌐 分散学習: 複数ノードで並列学習
🔄 モデル集約: 各ノードのモデルを統合
📊 非IIDデータ対応: データ分布が異なる環境でも学習可能
⚡ 通信効率化: モデルパラメータのみを送信

フェデレーション学習のワークフロー

# 複数ランクでフェデレーション学習を起動
for rank in {0..7}; do
  python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank $rank \
    --federated \
    --aggregation-method fedavg \
    --federated-rounds 10 \
    --gpu &
done

echo "全8ノードのフェデレーション学習を開始しました"

ユースケース

プライバシー保護AI: 医療・金融など機密データでの学習
エッジデバイス学習: IoTデバイスからの分散学習
クロスサイロ学習: 複数組織間でのモデル協調学習
分散脳システム: 24ノード分散脳での協調知能構築

パラメータ説明

パラメータ	デフォルト	説明
`--federated`	False	フェデレーション学習モードを有効化
`--aggregation-method`	fedavg	集約方法: fedavg/fedprox/fedopt
`--federated-rounds`	10	フェデレーション学習のラウンド数
`--resume-from`	None	チェックポイントパス
`--incremental`	False	追加学習モードを有効化

追加学習とフェデレーション学習の組み合わせ

# 既存モデルをベースにフェデレーション学習で改善
python scripts/train_llm_models.py \
    --config config/training_config.yaml \
    --category langtext \
    --rank 0 \
    --resume-from saved_models/evospike-langtext-r00-v001 \
    --federated \
    --aggregation-method fedprox \
    --federated-rounds 10 \
    --gpu

トラブルシューティング

一般的な問題

メモリ不足

# バッチサイズを小さくする
--batch-size 4

# 勾配累積を使用
--gradient-accumulation-steps 4

# CPUメモリを増やす
--cpu-memory-fraction 0.8

GPUメモリ不足

# GPUメモリを増やす
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# 混合精度を使用
--fp16

# モデルを小さくする
--model-size small

データ読み込みエラー

# データ整合性チェック
python scripts/verify_data_integrity.py --data-dir data/

# データ再構築
python scripts/collect_llm_training_data.py --config config/data_config.yaml --rebuild

分散トレーニングの問題

ノード間通信エラー

# ファイアウォール設定確認
sudo ufw allow 12345/tcp

# NCCLデバッグモード
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1

ノード同期エラー

# タイム同期確認
chronyc tracking

# NTP同期
sudo systemctl restart chrony

パフォーマンスの問題

トレーニング速度が遅い

# DataLoader最適化
--num-workers 4
--pin-memory
--persistent-workers

# モデル最適化
--torch-compile
--flash-attention

収束しない

# 学習率調整
--learning-rate 1e-5

# ウォームアップ追加
--warmup-steps 1000

# スケジューラー変更
--lr-scheduler cosine

パフォーマンス最適化

GPU最適化

混合精度トレーニング

# training_config.yaml
training:
  fp16: true
  bf16: false  # Ampere以降のGPUではbf16推奨
  gradient_checkpointing: true

分散データ並列

# torchrunを使用
torchrun --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=master_node \
    scripts/train_llm_models.py --config config/training_config.yaml

データ最適化

DataLoader最適化

# 高速データ読み込み設定
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=2
)

データ事前処理

# データセット事前トークナイズ
python scripts/pretokenize_dataset.py \
    --input data/llm_training/raw \
    --output data/llm_training/tokenized \
    --tokenizer microsoft/DialoGPT-medium

メモリ最適化

勾配累積

training:
  batch_size: 2
  gradient_accumulation_steps: 8  # 実効バッチサイズ = 16

メモリ効率的な設定

# PyTorchメモリ最適化
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

# CPUメモリ最適化
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4

ストレージ最適化

モデル圧縮

# 量子化
python scripts/quantize_model.py \
    --model saved_models/lang_evospike_lm_v1 \
    --quantization 8bit

# 蒸留
python scripts/distill_model.py \
    --teacher saved_models/large_model \
    --student saved_models/small_model

チェックポイント管理

training:
  save_steps: 1000
  save_total_limit: 3  # 最新3つのチェックポイントのみ保持
  save_strategy: steps

付録

設定ファイル例

training_config.yaml

data_dir: "data/llm_training"
output_dir: "saved_models"

langtext:
  model_name: "microsoft/DialoGPT-medium"
  max_length: 512
  lora_config:
    r: 16
    lora_alpha: 32
    target_modules: ["q_proj", "v_proj"]
    lora_dropout: 0.05

training:
  epochs: 10
  batch_size: 8
  learning_rate: 2e-5
  save_steps: 500
  logging_steps: 100
  fp16: true

gpu:
  use_gpu: true
  gpu_memory_fraction: 0.9

日本語学習設定

概要

大規模日本語学習は、以下のコンポーネントで構成されます：

データ設定 (config/data_config.yaml)
トレーニング設定 (config/training_config.yaml)
クイックスタートスクリプト (scripts/start_japanese_training.sh)

データ設定

config/data_config.yaml の大規模学習設定：

# 言語理解用テキストデータ
langtext_datasets_ja:
  output_dir: "data/llm_training/LangText"
  output_file: "langtext_ja_data.jsonl"
  # 21種類の日本語データセット（約3,080,000サンプル）

# ランドマーク画像データ
vision_datasets:
  output_dir: "data/llm_training/Vision"
  output_file: "vision_data.jsonl"
  # Google Landmarks, Europe Landmarks（約100,000サンプル）

# 日英音声聞き取りデータ
audio_datasets:
  output_dir: "data/llm_training/Audio"
  output_file: "audio_data.jsonl"
  # LibriSpeech, VoxPopuli, ReazonSpeech, Common Voice（約565,000サンプル）

# マルチモーダル統合データ
multimodal_datasets:
  output_dir: "data/llm_training/MultiModal"
  output_file: "multimodal_data.jsonl"
  # 画像+テキスト統合データ

トレーニング設定

config/training_config.yaml の日本語モデル設定：

model:
  name: "rinna/japanese-gpt-1b"  # 日本語特化モデル
  language: "ja"
  type: "causal-lm"

training:
  epochs: 10
  batch_size: 4
  gradient_accumulation_steps: 8
  learning_rate: 2e-5
  warmup_steps: 1000
  max_seq_length: 2048

gpu:
  use_gpu: true
  gpu_memory_fraction: 0.95  # GPUメモリ95%使用

クイックスタートスクリプト

scripts/start_japanese_training.sh の機能：

環境チェック: Python3、GPU/CPUの自動検出
依存関係インストール: requirements.txtの自動インストール
データディレクトリ作成: 必要なディレクトリの自動作成
データ収集: 設定された全データセットのダウンロード
トレーニング開始: GPU/CPUに応じた適切なコンテナ起動

実行方法

基本実行

# インタラクティブモード（推奨）
./scripts/start_japanese_training.sh

# または非インタラクティブモード
echo "y" | ./scripts/start_japanese_training.sh

カスタム実行

# データ収集のみ
python scripts/collect_llm_training_data.py --config config/data_config.yaml

# トレーニングのみ（データが既にある場合）
docker-compose -f docker-compose.train.yml up -d llm-trainer-gpu

モニタリング

トレーニング開始後は以下の方法で進捗を確認：

# トレーニングステータス
curl http://localhost:8000/training/status

# ログ確認
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu

# Web UI
open http://localhost:8000/docs

データセット詳細

テキストデータセット（3,080,000サンプル）

データセット	サンプル数	内容
izumi-lab/llm-japanese-dataset	1,000,000	汎用日本語テキスト
llm-book/japanese-wikipedia	500,000	Wikipedia記事
llm-book/japanese-news	300,000	ニュース記事
llm-book/japanese-books	200,000	書籍テキスト
llm-book/japanese-papers	150,000	論文アブストラクト
llm-book/japanese-code	100,000	プログラミングコード
llm-book/japanese-qa	80,000	QAデータ
および他13種類	750,000	対話、レビューなど

画像データセット（ランドマーク）

データセット	サンプル数	内容
visheratin/google_landmarks_photos	50,000	Google Landmarks写真データセット
Qdrant/google-landmark-geo	30,000	Google Landmarks + 地理座標
SablikJan/europe-landmarks-classification	20,000	ヨーロッパランドマーク分類

音声データセット（日英聞き取り学習）

データセット	サンプル数	言語	内容	ASR適合性
mozilla-foundation/common_voice_11_0 (ja)	60,000	日本語	汎用音声認識	⭐⭐⭐
mozilla-foundation/common_voice_11_0 (en)	120,000	英語	汎用音声認識	⭐⭐⭐
librispeech_asr (clean)	150,000	英語	高品質読み上げ音声	⭐⭐⭐⭐⭐
facebook/voxpopuli (en)	95,000	英語	議会音声データ	⭐⭐⭐⭐
reazon-research/reazonspeech	60,000	日本語	高品質日本語音声	⭐⭐⭐⭐
speech_commands (v0.02)	30,000	多言語	音声コマンド	⭐⭐

ASR適合性説明: - ⭐⭐⭐⭐⭐ 非常に高い: LibriSpeech (クリーンで高品質) - ⭐⭐⭐⭐ 高い: VoxPopuli, ReazonSpeech (専門データ) - ⭐⭐⭐ 標準: Common Voice (多様な発音) - ⭐⭐ 補助的: Speech Commands (コマンド認識)

マルチモーダルデータセット

llm-book/japanese-image-text: 100,000サンプル
内容: 画像と日本語キャプションのペア

パフォーマンス最適化

GPU最適化

メモリ使用率: 95% (最大限活用)
Mixed Precision: FP16有効
Gradient Accumulation: 8ステップ
Batch Size: 4 (GPUメモリに応じて調整)

分散トレーニング

24ノード分散アーキテクチャにより：

並列処理: データ並列 + モデル並列
通信最適化: Zenohプロトコル使用
フォールトトレランス: ノード障害時の自動回復

トラブルシューティング

よくある問題

メモリ不足

# バッチサイズを小さくする
sed -i 's/batch_size: 4/batch_size: 2/' config/training_config.yaml

データダウンロード失敗

# 個別データセットをスキップ
python scripts/collect_llm_training_data.py --skip-failed

GPU使用不可

# CPUモードに切り替え
docker-compose -f docker-compose.train.yml up -d llm-trainer-cpu

ログ確認

# 全ログ
docker-compose -f docker-compose.train.yml logs

# リアルタイムログ
docker-compose -f docker-compose.train.yml logs -f llm-trainer-gpu

# エラーログのみ
docker-compose -f docker-compose.train.yml logs 2>&1 | grep ERROR

拡張設定

カスタムデータセット追加

config/data_config.yaml に新しいデータセットを追加：

datasets:
  - name: "your-custom-dataset"
    type: "text"
    samples: 100000
    custom_config:
      path: "path/to/your/data"
      format: "jsonl"

ハイパーパラメータ調整

config/training_config.yaml を編集：

training:
  learning_rate: 5e-5  # 学習率調整
  epochs: 20           # エポック数増加
  max_seq_length: 4096 # シーケンス長拡張

詳細な設定オプションは各設定ファイルのコメントを参照してください。

環境変数一覧

変数名	説明	デフォルト値
`CUDA_VISIBLE_DEVICES`	使用するGPUデバイス	自動検出
`OMP_NUM_THREADS`	OpenMPスレッド数	CPUコア数
`PYTORCH_CUDA_ALLOC_CONF`	CUDAメモリ設定	-
`NCCL_DEBUG`	NCCLデバッグレベル	-
`WANDB_API_KEY`	Weights & Biases APIキー	-

APIエンドポイント

エンドポイント	メソッド	説明
`/training/start`	POST	トレーニング開始
`/training/stop`	POST	トレーニング停止
`/training/status`	GET	トレーニングステータス取得
`/metrics`	GET	メトリクス取得
`/logs`	GET	ログ取得

このガイドはEvoSpikeNetの大規模学習システムの包括的な使用方法を提供します。詳細な設定やカスタマイズについては、各設定ファイルのコメントを参照してください。

Advanced Distributed Training System

EvoSpikeNet now includes a comprehensive distributed training system that supports large-scale training across 100+ nodes with advanced fault tolerance, scalability testing, and resource management capabilities.

Core Components

1. DistributedTrainingCoordinator

Coordinates distributed training across multiple nodes with advanced synchronization and communication protocols.

<!-- TODO: update or remove - import faileg import DistributedTrainingCoordinator -->

# Initialize coordinator for multi-node training
coordinator = DistributedTrainingCoordinator(
    world_size=24,  # Total number of nodes
    rank=0,  # Current node rank
    master_addr='192.168.1.100',
    master_port=12345,
    backend='nccl'  # or 'gloo' for CPU-only
)

# Setup distributed training
coordinator.setup_distributed_training(
    model=model,
    optimizer=optimizer,
    scheduler=scheduler
)

# Coordinate training loop
for epoch in range(num_epochs):
    coordinator.start_epoch(epoch)

    for batch in dataloader:
        # Synchronize gradients across nodes
        loss = coordinator.train_step(batch)

        # Adaptive batch size adjustment
        coordinator.adapt_batch_size_if_needed(loss.item())

    coordinator.end_epoch(epoch)

Key Features: - Multi-node coordination - Gradient synchronization - Adaptive batch sizing - Training state management

2. FaultToleranceManager

Provides comprehensive fault tolerance for distributed training with automatic recovery and checkpoint management.

<!-- 必要依存: モジュール 'GPUtil' が見つかりません。実行環境で 'pip install GPUtil' を検討してください -->
<!--fault_manager = FaultToleranceManager(
    checkpoint_interval=100,
    max_retries=3,
    recovery_strategy='checkpoint_resume'
)

# Setup fault tolerance
fault_manager.setup_fault_tolerance(
    model=model,
    optimizer=optimizer,
    training_state=training_state
)

# Training loop with fault tolerance
try:
    for step in range(max_steps):
        # Train step
        loss = train_step(batch)

        # Periodic checkpoint
        if step % 100 == 0:
            fault_manager.save_checkpoint(step, loss.item())

        # Check for node failures
        if fault_manager.detect_node_failure():
            fault_manager.initiate_recovery()

except Exception as e:
    # Automatic recovery on failure
    recovered_state = fault_manager.recover_from_failure(e)
    resume_training_from_state(recovered_state)

Key Features: - Automatic failure detection - Checkpoint-based recovery - Node failure handling - Training state preservation

3. ScalabilityTester

Tests and validates scalability of distributed training across different cluster configurations.

```pytho->

l' -->

state_manager = TrainingStateManagerterval=10, # seconds consistency_level='strong' )

Initialize training state

state_manager.initialize_training_state( initial_epoch=0, initial_step=0, model_config=model_config, optimizer_config=optimizer_config )

Synchronize state across nodes

state_manager.sync_training_state( current_state={ 'epoch': current_epoch, 'step': current_step, 'loss': current_loss, 'metrics': current_metrics } )

Retrieve synchronized state

global_state = state_manager.get_global_training_state() print(f"Global epoch: {global_state['epoch']}") print(f"Global best loss: {global_state['best_loss']}")

**Key Features:**
- Distributed state synchronization
- Persistent state storage
- Consistency guarantees
- State recovery

#### 6. GradientSynchronizer
Advanced gradient synchronization with communication optimization and compression.

``<!-- 必要依存: モジュール 'GPUtil' が見つかりません。実行環境で 'pip install GPUtil' を検討してください -->til' -->
<!-- from evospikenet.distributed_training import GradientSynchronizer -->

gradient_sync = GradientSynchronizer(
    world_size=24,
    compressionckend='nccl',
    overlap_computation=True
)

# Setup gradient synchronization
gradient_sync.setup_synchronization(
    model=model,
    optimizer=optimizer
)

# Training step with optimized gradient sync
for batch in dataloader:
    # Forward pass
    outputs = model(batch['inputs'])
    loss = criterion(outputs, batch['targets'])

    # Backward pass
    loss.backward()

    # Synchronize gradients with optimization
    gradient_sync.synchronize_gradients(
        compression_ratio=0.1,  # 10% of original size
        overlap_with_computation=True
    )

    # Optimizer step
    optimizer.step()

Key Features: - Gradient compression - Communication overlap - Bandwidth optimization - Synchronization efficiency

7. NodeHealthMonitor

Monitors health and performance of distributed nodes with proactive issue detection.

PUtil' -->

health_monitor = NodeHealthMonitor( monitoring_interval=30, # seconds alert_thresholds={ 'cpu_ 'network_latency': 1000 # ms } )

Start health monitoring

health_monitor.start_monitoring( node_ids=range(24), monitoring_metrics=['cpu', 'memory', 'gpu', 'network', 'disk'] )

Get health status

health_status = health_monitor.get_cluster_health() for node_id, status in health_status.items(): if status['overall'] != 'healthy': print(f"Node {node_id} issues: {status['issues']}")

Proactive issue detection

issues = health_monitor.detect_potential_issues() for issue in issues: print(f"Potential issue: {issue['type']} on node {issue['node_id']}")

**Key Features:**
- Real-time health monitoring
- Proactive issue detection
- Alert system
- Performance tracking

#### 8. DistributedTrainingManager
Integrated manager that coordinates all distributed training components.<!-- 必要依存: モジュール 'GPUtil' が見つかりません。実行環境で 'pip install GPUtil' を検討してください -->'GPUtil' -->
<!-- from evospikenet.distributed_training import DistributedTrainingManager -->

# Initialize distributed training manager
training_manager = DistributedTrainingManager(
    cluster_config={
        'world_size': 24,

    fault_tolerance_enabled=True,
    scalability_testing_enabled=True
)

# Setup complete distributed training
training_manager.setup_distributed_training(
    model=model,
    optimizer=optimizer,
    dataset=dataset,
    training_config={
        'batch_size': 64,
        'max_epochs': 100,
        'checkpoint_interval': 500,
        'scalability_test_interval': 1000
    }
)

# Run distributed training with all features
results = training_manager.run_distributed_training()

# Get comprehensive training report
report = training_manager.generate_training_report()
print(f"Training completed in {report['total_time']}")
print(f"Final loss: {report['final_loss']}")
print(f"Scalability achieved: {report['scalability_efficiency']:.2%}")

Key Features: - Unified distributed training interface - Automatic component coordination - Comprehensive monitoring and reporting - Production-ready deployment

Integration Examples

Large-Scale Training Sed 'GPUtil' -->

Configure for 100+ node training

training_manager = DistributedTrainingManager( cluster_config={ 'world_size': 128, 'backend': 'nccl', 'fault_toning with advanced features training_manager.setup_distributed_training( model=large_model, dataset=huge_dataset, training_config={ 'initial_batch_size': 32, 'adaptive_batching': True, 'gradient_compression': 'quantization', 'checkpoint_strategy': 'incremental' } )

Monitor training progress

while training_manager.is_training_active(): status = training_manager.get_training_status() print(f"Epoch {status['epoch']}, Loss: {status['loss']:.4f}") print(f"Nodes active: {status['active_nodes']}/{status['total_nodes']}")

time.sleep(60)  # Check every minute

```

Fault-Tolerant Training

```python

Configure for high-reliability training

fault_tolerant_manager = DistributedTrainingManager( cluster_config={ 'world_size': 64, 'fault_tolerance_level': 'high', 'auto_recovery': True, 'checkpoint_frequency': 'high' } )

Training with automatic fault recovery

try: results = fault_tolerant_manager.run_distributed_training() except Exception as e: print(f"Training interrupted: {e}") # Manager automatically handles recovery recovery_status = fault_tolerant_manager.get_recovery_status() print(f"Recovery progress: {recovery_status['progress']:.1%}") ```

Configuration Options

distributed_training:
  coordinator:
    world_size: 24
    backend: nccl
    master_addr: "192.168.1.100"
    master_port: 12345
    timeout: 600

  fault_tolerance:
    enabled: true
    checkpoint_interval: 100
    max_retries: 3
    recovery_strategy: checkpoint_resume
    auto_recovery: true

  scalability_testing:
    enabled: true
    test_interval: 1000
    min_nodes: 8
    max_nodes: 128
    test_duration_minutes: 30

  resource_management:
    dynamic_allocation: true
    load_balancing: true
    memory_optimization: true
    gpu_scheduling: fair_share

  gradient_synchronization:
    compression_type: quantization
    compression_ratio: 0.1
    overlap_computation: true
    bandwidth_optimization: true

  health_monitoring:
    enabled: true
    monitoring_interval: 30
    alert_thresholds:
      cpu_usage: 0.95
      memory_usage: 0.90
      gpu_memory: 0.95
      network_latency: 1000

  state_management:
    persistence_backend: redis
    sync_interval: 10
    consistency_level: strong
    state_compression: true

Best Practices

Cluster Setup: Ensure proper network configuration and firewall settings
Resource Allocation: Monitor resource usage and adjust allocation policies
Fault Tolerance: Enable comprehensive fault tolerance for production training
Scalability Testing: Regularly test scalability with different configurations
Monitoring: Implement comprehensive monitoring and alerting
Checkpointing: Use frequent checkpointing for long-running training jobs
Network Optimization: Optimize network settings for gradient synchronization

Troubleshooting

Common Issues: - Communication timeouts: Increase timeout values or check network connectivity - Memory issues: Enable gradient compression or reduce batch sizes - Node failures: Ensure fault tolerance is properly configured - Performance degradation: Run scalability tests to identify bottlenecks

Debug Mode:

training_manager.enable_debug_mode()
training_manager.log_detailed_metrics()
training_manager.enable_performance_profiling()

Performance Optimization

Gradient Compression: Use quantization or sparsification to reduce communication overhead
Communication Overlap: Enable computation-communication overlap for better utilization
Adaptive Batching: Allow dynamic batch size adjustment based on performance
Resource Balancing: Regularly rebalance resources across nodes

Network Tuning: Optimize network settings for your cluster topology