Model candidates and installation steps for video/audio analysis
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Purpose: To summarize candidate models for production operations, conversion/optimization paths, introduction procedures, and operational precautions, focusing on MoveNet-based key point extraction and Whisper-based ASR.
1) Key point extraction (pose) candidates - MoveNet (TF / TFLite / ONNX) - Variant: Lightning, Thunder - Pros: Fast, lightweight, GPU/CPU compatible, good accuracy/latency trade-off - Disadvantages: Inferior to HRNet etc. in complex postures and crowded situations - Recommended backend: ONNX Runtime (GPU/CPU), TensorRT for GPU acceleration - Conversion: TF -> ONNX (tf2onnx) -> ORT/TensorRT
- MediaPipe Pose
- Pros: Fast and has a pre-implemented pipeline (inference + preprocessing)
- Cons: Less customizability, license/binary dependence
-
Recommended: Edge/Mobile Deployment (TFLite)
-
HRNet / Higher-accuracy models
- Advantages: High accuracy (for research/evaluation)
- Cons: Heavy, requires GPU for real-time requirements
- Recommended: offline/batch processing, accuracy benchmarks
Shortest introduction procedure (pose)
1. Get/convert ONNX for MoveNet Lightning (using tf2onnx).
2. Implement simple wrapper get_pose_backend(name) in ONNX Runtime.
3. Fixed inference batch/streaming input/output specifications with pose interface.
4. Added accuracy/latency bench in CI (sample 1000 consecutive 1080p frames).
2) ASR candidate
- OpenAI Whisper series
- Variants: tiny, base, small, medium, large
- Advantages: Multilingual support, reproducibility, consistent accuracy
- Cons: medium/large is heavy (requires GPU), PyTorch dependent
- Lightweight: whisper.cpp (int8), ONNX conversion + quantize available
- Recommended backends: ONNX Runtime (CPU/GPU), torch (GPU), whisper.cpp (for CPU embedded)
- Wav2Vec2 (HuggingFace)
- Advantages: High accuracy, easy custom learning
- Disadvantages: Requires pre-processing/decoder design
-VOSK/Kaldi - Advantages: Engine for on-premises and low-latency embedded applications - Cons: Requires configuration and dictionary management
Fastest Deployment Steps (ASR)
1. Variant determination based on requirements (small <-> high accuracy).
2. Incorporate Whisper small with torch, or consider whisper.cpp for lightweight use.
3. ONNX conversion: PyTorch -> ONNX -> ORT path maintained. INT8/FP16 quantization if necessary.
4. If you want to add speaker diarization, plug-in additional modules such as pyannote.
3) Model conversion and optimization (common procedure)
- Recommended tools: torch.onnx, tf2onnx, onnxruntime-tools, onnxruntime-transformers, optimum (HuggingFace), tensorrt/trtexec.
- Basic pipeline:
1. Obtain learned weights (license confirmation)
2. Framework conversion (PyTorch/TF -> ONNX)
3. ONNX simple verification (shape, dtype)
4. Quantization (FP16/INT8): First FP16, then INT8 (confirm accuracy degradation with calibration data)
5. TensorRT engineization (GPU, dynamic shape settings if necessary)
4) Inference service design (operation)
- Model repository: separate models/video/pose/ and models/audio/asr/ and version control. Record SHA/date in Manifest.
- Container: Base is python:3.10-slim, GPU is NVIDIA CUDA compatible image. Prepare an image with ONNX Runtime / TensorRT installed.
- Model switching: Enable switching at runtime using the MODEL_SELECTION environment variable or API.
- Monitoring: Output per-model latency, success_rate, memory_usage to Prometheus.
5) Implementation checklist (practical) - License confirmation (Whisper: MIT? / Background model is license dependent) - Accuracy/latency bench with sample data - Performance and accuracy difference report after quantization - Separation of CI (CPU) and nightly GPU benches - Security: Signing/verification of model artifacts
6) Specific command examples
- TF -> ONNX (MoveNet example):pip install tf2onnx onnx onnxruntime
python -m tf2onnx.convert --saved-model movenet_savedmodel --output movenet.onnx --opset 14- PyTorch Whisper -> ONNX (simple):pip install torch onnx onnxruntime
python export_whisper_to_onnx.py --model small --out whisper_small.onnx- Convert ONNX to FP16 (example):python -m onnxruntime.tools.convert_to_ort --model whisper_small.onnx --output whisper_small_fp16.onnx --target_fp16
7) Recommended initial configuration (production startup) - pose: MoveNet Lightning -> ONNX Runtime (GPU priority, FP16) - asr: Whisper small -> ONNXRuntime GPU (switched according to medium request high precision) - Quantization: INT8 was carefully adopted in production (evaluated with calibration data)
8) CI/Test - Always run small CPU-based smoke tests (whisper.cpp / ORT cpu) - nightly GPU bench (accuracy + latency)
9) Additional materials and references - ONNX Runtime Quantization docs - TensorRT practices best - whisper.cpp repository (optimized for CPU)
Summary: It is realistic to first introduce MoveNet (ONNX, FP16) and Whisper small (ONNX, FP16) as initial candidates, and then gradually add medium/large after preparing the CI bench and quantization pipeline.
Implemented fallback module (added on 2026-04-24)
The following 7 modules have been implemented in evospikenet/video_analysis/ as a fallback before introducing the production model.
Switching to the production model is done with the backend argument or configuration file.
| Module | Implemented backend | Production model hook | Switch key |
|---|---|---|---|
depth_estimation.py |
Brightness gradient pseudo-depth | torch.hub MiDaS-small |
backend="midas_real" |
vad.py |
Energy threshold VAD | pyannote.audio Pipeline.from_pretrained() |
SpeakerDiarizer.use_pyannote=True |
shot_boundary_detector.py |
Histogram / Pixel Difference | PySceneDetect / own CNN | method="combined" |
narrative_generator.py |
Template statement | EvoSpikeNet LM (NeuralLanguageAdapter) |
use_lm=True |
temporal_action_localizer.py |
Position difference sliding window | ST-GCN / SlowFast | backend="stgcn_real" |
spatial_relations.py |
BBox coordinates + depth map | Same as left (additional algorithm planned) | — |
event_schema.py |
Pure dataclass (no dependencies) | — | — |
Production migration checklist
- Test with
depth_estimation.py:pip install timm torch→DepthEstimator(backend="midas_real") vad.py:pip install pyannote.audio→SpeakerDiarizer(model_id="pyannote/speaker-diarization-3.1")narrative_generator.py: After starting the EvoSpikeNet LM service, check operation withNarrativeGenerator(use_lm=True)temporal_action_localizer.py: Register ST-GCN ONNX model toSTGCNRealActionBackendinbackends.py- E2E: Latency measurement with
pipeline.run(frames, audio, fps=25, enable_extended=True)
Update date: 2026-04-24