Model candidates and installation steps for video/audio analysis

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: To summarize candidate models for production operations, conversion/optimization paths, introduction procedures, and operational precautions, focusing on MoveNet-based key point extraction and Whisper-based ASR.

1) Key point extraction (pose) candidates - MoveNet (TF / TFLite / ONNX) - Variant: Lightning, Thunder - Pros: Fast, lightweight, GPU/CPU compatible, good accuracy/latency trade-off - Disadvantages: Inferior to HRNet etc. in complex postures and crowded situations - Recommended backend: ONNX Runtime (GPU/CPU), TensorRT for GPU acceleration - Conversion: TF -> ONNX (tf2onnx) -> ORT/TensorRT

MediaPipe Pose
Pros: Fast and has a pre-implemented pipeline (inference + preprocessing)
Cons: Less customizability, license/binary dependence
Recommended: Edge/Mobile Deployment (TFLite)
HRNet / Higher-accuracy models
Advantages: High accuracy (for research/evaluation)
Cons: Heavy, requires GPU for real-time requirements
Recommended: offline/batch processing, accuracy benchmarks

Shortest introduction procedure (pose) 1. Get/convert ONNX for MoveNet Lightning (using tf2onnx). 2. Implement simple wrapper get_pose_backend(name) in ONNX Runtime. 3. Fixed inference batch/streaming input/output specifications with pose interface. 4. Added accuracy/latency bench in CI (sample 1000 consecutive 1080p frames).

2) ASR candidate - OpenAI Whisper series - Variants: tiny, base, small, medium, large - Advantages: Multilingual support, reproducibility, consistent accuracy - Cons: medium/large is heavy (requires GPU), PyTorch dependent - Lightweight: whisper.cpp (int8), ONNX conversion + quantize available - Recommended backends: ONNX Runtime (CPU/GPU), torch (GPU), whisper.cpp (for CPU embedded)

Wav2Vec2 (HuggingFace)
Advantages: High accuracy, easy custom learning
Disadvantages: Requires pre-processing/decoder design

-VOSK/Kaldi - Advantages: Engine for on-premises and low-latency embedded applications - Cons: Requires configuration and dictionary management

Fastest Deployment Steps (ASR) 1. Variant determination based on requirements (small <-> high accuracy). 2. Incorporate Whisper small with torch, or consider whisper.cpp for lightweight use. 3. ONNX conversion: PyTorch -> ONNX -> ORT path maintained. INT8/FP16 quantization if necessary. 4. If you want to add speaker diarization, plug-in additional modules such as pyannote.

3) Model conversion and optimization (common procedure) - Recommended tools: torch.onnx, tf2onnx, onnxruntime-tools, onnxruntime-transformers, optimum (HuggingFace), tensorrt/trtexec. - Basic pipeline: 1. Obtain learned weights (license confirmation) 2. Framework conversion (PyTorch/TF -> ONNX) 3. ONNX simple verification (shape, dtype) 4. Quantization (FP16/INT8): First FP16, then INT8 (confirm accuracy degradation with calibration data) 5. TensorRT engineization (GPU, dynamic shape settings if necessary)

4) Inference service design (operation) - Model repository: separate models/video/pose/ and models/audio/asr/ and version control. Record SHA/date in Manifest. - Container: Base is python:3.10-slim, GPU is NVIDIA CUDA compatible image. Prepare an image with ONNX Runtime / TensorRT installed. - Model switching: Enable switching at runtime using the MODEL_SELECTION environment variable or API. - Monitoring: Output per-model latency, success_rate, memory_usage to Prometheus.

5) Implementation checklist (practical) - License confirmation (Whisper: MIT? / Background model is license dependent) - Accuracy/latency bench with sample data - Performance and accuracy difference report after quantization - Separation of CI (CPU) and nightly GPU benches - Security: Signing/verification of model artifacts

6) Specific command examples - TF -> ONNX (MoveNet example):pip install tf2onnx onnx onnxruntime python -m tf2onnx.convert --saved-model movenet_savedmodel --output movenet.onnx --opset 14- PyTorch Whisper -> ONNX (simple):pip install torch onnx onnxruntime python export_whisper_to_onnx.py --model small --out whisper_small.onnx- Convert ONNX to FP16 (example):python -m onnxruntime.tools.convert_to_ort --model whisper_small.onnx --output whisper_small_fp16.onnx --target_fp16

7) Recommended initial configuration (production startup) - pose: MoveNet Lightning -> ONNX Runtime (GPU priority, FP16) - asr: Whisper small -> ONNXRuntime GPU (switched according to medium request high precision) - Quantization: INT8 was carefully adopted in production (evaluated with calibration data)

8) CI/Test - Always run small CPU-based smoke tests (whisper.cpp / ORT cpu) - nightly GPU bench (accuracy + latency)

9) Additional materials and references - ONNX Runtime Quantization docs - TensorRT practices best - whisper.cpp repository (optimized for CPU)

Summary: It is realistic to first introduce MoveNet (ONNX, FP16) and Whisper small (ONNX, FP16) as initial candidates, and then gradually add medium/large after preparing the CI bench and quantization pipeline.

Implemented fallback module (added on 2026-04-24)

The following 7 modules have been implemented in evospikenet/video_analysis/ as a fallback before introducing the production model. Switching to the production model is done with the backend argument or configuration file.

Module	Implemented backend	Production model hook	Switch key
`depth_estimation.py`	Brightness gradient pseudo-depth	`torch.hub` MiDaS-small	`backend="midas_real"`
`vad.py`	Energy threshold VAD	pyannote.audio `Pipeline.from_pretrained()`	`SpeakerDiarizer.use_pyannote=True`
`shot_boundary_detector.py`	Histogram / Pixel Difference	PySceneDetect / own CNN	`method="combined"`
`narrative_generator.py`	Template statement	EvoSpikeNet LM (`NeuralLanguageAdapter`)	`use_lm=True`
`temporal_action_localizer.py`	Position difference sliding window	ST-GCN / SlowFast	`backend="stgcn_real"`
`spatial_relations.py`	BBox coordinates + depth map	Same as left (additional algorithm planned)	—
`event_schema.py`	Pure dataclass (no dependencies)	—	—

Production migration checklist

Test with depth_estimation.py: pip install timm torch → DepthEstimator(backend="midas_real")
vad.py: pip install pyannote.audio → SpeakerDiarizer(model_id="pyannote/speaker-diarization-3.1")
narrative_generator.py: After starting the EvoSpikeNet LM service, check operation with NarrativeGenerator(use_lm=True)
temporal_action_localizer.py: Register ST-GCN ONNX model to STGCNRealActionBackend in backends.py
E2E: Latency measurement with pipeline.run(frames, audio, fps=25, enable_extended=True)

Update date: 2026-04-24