LLM Training System

[!NOTE] 最新の実装状況は機能実装ステータス (Remaining Functionality) を参照してください。

This system provides automated training for different types of Large Language Models (LLMs) with API support and Docker integration.

Features

Multi-Modal Training: Support for LangText, Vision, Audio, and MultiModal models
API-Driven: RESTful API for job submission and monitoring
GPU/CPU Support: Optimized for both GPU and CPU environments
Docker Integration: Easy deployment with containerization
Load Balancing: Nginx-based load balancer for multiple instances

Quick Start

Prerequisites

Docker and Docker Compose
NVIDIA Docker (for GPU training)
At least 16GB RAM recommended

1. Data Collection

First, collect training data using the data collection script:

python scripts/collect_llm_training_data.py --config config/data_config.yaml

This will create the following directory structure:

data/llm_training/
├── LangText/
├── Vision/
├── Audio/
└── MultiModal/

2. Start Training Servers

GPU Training

./scripts/train_launcher.sh gpu

CPU Training

./scripts/train_launcher.sh cpu

Both GPU and CPU with Load Balancer

./scripts/train_launcher.sh all

3. Access API

GPU Server: http://localhost:8000
CPU Server: http://localhost:8001
Load Balancer: http://localhost:8080
API Documentation: http://localhost:8000/docs (FastAPI)

API Usage

Submit Training Job

curl -X POST "http://localhost:8000/train" \
  -H "Content-Type: application/json" \
  -d '{
    "category": "LangText",
    "model_name": "microsoft/DialoGPT-medium",
    "dataset_path": "data/llm_training/LangText/langtext_data.jsonl",
    "output_dir": "saved_models/LangText",
    "gpu": true,
    "epochs": 3,
    "batch_size": 8,
    "learning_rate": 0.00002
  }'

Check Job Status

curl http://localhost:8000/jobs
curl http://localhost:8000/jobs/{job_id}

Supported Categories

LangText

Models: GPT, BERT, DialoGPT
Features: Text generation, language understanding
Use Cases: Chatbots, text completion

Vision

Models: ViT, ResNet, CLIP
Features: Image classification, object detection
Use Cases: Image recognition, visual QA

Audio

Models: Whisper, Wav2Vec2
Features: Speech recognition, audio classification
Use Cases: Transcription, voice commands

MultiModal

Models: CLIP, LLaVA
Features: Text-image understanding
Use Cases: Visual question answering, image captioning

Configuration

Training Configuration (`config/training_config.yaml`)

# Model configurations for each category
langtext:
  model_name: "microsoft/DialoGPT-medium"
  max_length: 512

vision:
  model_name: "google/vit-base-patch16-224"
  num_labels: 1000

audio:
  model_name: "openai/whisper-small"
  language: "en"

multimodal:
  model_name: "openai/clip-vit-base-patch32"

# Training parameters
training:
  epochs: 3
  batch_size: 8
  learning_rate: 2e-5

Docker Commands

Build and Run Manually

# GPU
docker build -f Dockerfile.train.gpu -t llm-trainer-gpu .
docker run -p 8002:8000 -v $(pwd)/data:/app/data llm-trainer-gpu

# CPU
docker build -f Dockerfile.train.cpu -t llm-trainer-cpu .
docker run -p 8001:8000 -v $(pwd)/data:/app/data llm-trainer-cpu

Using Docker Compose

# Start all services
docker-compose -f docker-compose.train.yml up -d

# View logs
docker-compose -f docker-compose.train.yml logs -f

# Stop services
docker-compose -f docker-compose.train.yml down

Monitoring and Troubleshooting

Check Server Status

./scripts/train_launcher.sh status

View Logs

./scripts/train_launcher.sh logs gpu
./scripts/train_launcher.sh logs cpu

Test API

./scripts/train_launcher.sh test

Common Issues

CUDA out of memory: Reduce batch size in training config
Port conflicts: Change ports in docker-compose.yml
Model download fails: Check internet connection and Hugging Face access
NVIDIA Docker not available: Use CPU mode or install NVIDIA Docker

Advanced Usage

Custom Model Training

Modify config/training_config.yaml to use different models:

langtext:
  model_name: "meta-llama/Llama-2-7b-hf"  # Requires access token

Scaling

For production deployment, consider: - Kubernetes for orchestration - Model versioning with MLflow - Distributed training with Accelerate - GPU resource management

Web UI Integration

The LLM Training System is fully integrated with the EvoSpikeNet Web UI, providing user-friendly interfaces for training different types of models.

Vision Encoder Training UI

Location: frontend/pages/vision_encoder.py

Features: - API Training Tab: Dedicated tab for LLM training via REST API - Category Selection: Dropdown to select model category (LangText, Vision, Audio, MultiModal) - Dynamic Type/Category Display: Shows selected model type and category in real-time - Training Parameters: Configurable epochs, batch size, learning rate, run name - GPU Support: Checkbox to enable/disable GPU training - Training Status: Real-time status updates and output display

UI Components:

- vision-category-dropdown: Category selection (LangText/Vision/Audio/MultiModal)
- vision-selected-type-category: Dynamic display of model type and category
- vision-new-epochs-input: Number of training epochs
- vision-new-batch-size-input: Training batch size
- vision-new-learning-rate-input: Learning rate
- vision-new-run-name-input: Custom run name
- vision-new-gpu-checkbox: GPU training toggle
- vision-new-training-output: Training output display
- vision-new-training-status: Status messages

Audio Encoder Training UI

Location: frontend/pages/audio_encoder.py

Features: - API Training Tab: Dedicated tab for LLM training via REST API - Category Selection: Dropdown to select model category (LangText, Vision, Audio, MultiModal) - Dynamic Type/Category Display: Shows selected model type and category in real-time - Training Parameters: Configurable epochs, batch size, learning rate, run name - GPU Support: Checkbox to enable/disable GPU training - Training Status: Real-time status updates and output display

UI Components:

- audio-category-dropdown: Category selection (LangText/Vision/Audio/MultiModal)
- audio-selected-type-category: Dynamic display of model type and category
- audio-new-epochs-input: Number of training epochs
- audio-new-batch-size-input: Training batch size
- audio-new-learning-rate-input: Learning rate
- audio-new-run-name-input: Custom run name
- audio-new-gpu-checkbox: GPU training toggle
- audio-new-training-output: Training output display
- audio-new-training-status: Status messages

Integration with Distributed Brain

The UI components automatically map categories to appropriate model types: - LangText → text model type - Vision → vision model type - Audio → audio model type - MultiModal → multimodal model type

This ensures that training jobs are submitted with the correct parameters for the distributed brain node configuration.

API Endpoints Used

The UI integrates with the following training API endpoints: - POST /train: Submit training job - Training jobs are submitted with category-specific model configurations - Real-time status updates via callback functions

License

This project is part of EvoSpikeNet and follows the same licensing terms.

LLM Training System

Features

Quick Start

Prerequisites

1. Data Collection

2. Start Training Servers

GPU Training

CPU Training

Both GPU and CPU with Load Balancer

3. Access API

API Usage

Submit Training Job

Check Job Status

Supported Categories

LangText

Vision

Audio

MultiModal

Configuration

Training Configuration (config/training_config.yaml)

Docker Commands

Build and Run Manually

Using Docker Compose

Monitoring and Troubleshooting

Check Server Status

View Logs

Test API

Common Issues

Advanced Usage

Custom Model Training

Scaling

Web UI Integration

Vision Encoder Training UI

Audio Encoder Training UI

Integration with Distributed Brain

API Endpoints Used

License

Training Configuration (`config/training_config.yaml`)