Skip to content

Model candidates and installation steps for video/audio analysis

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: To summarize candidate models for production operations, conversion/optimization paths, introduction procedures, and operational precautions, focusing on MoveNet-based key point extraction and Whisper-based ASR.

1) Key point extraction (pose) candidates - MoveNet (TF / TFLite / ONNX) - Variant: Lightning, Thunder - Pros: Fast, lightweight, GPU/CPU compatible, good accuracy/latency trade-off - Disadvantages: Inferior to HRNet etc. in complex postures and crowded situations - Recommended backend: ONNX Runtime (GPU/CPU), TensorRT for GPU acceleration - Conversion: TF -> ONNX (tf2onnx) -> ORT/TensorRT

  • MediaPipe Pose
  • Pros: Fast and has a pre-implemented pipeline (inference + preprocessing)
  • Cons: Less customizability, license/binary dependence
  • Recommended: Edge/Mobile Deployment (TFLite)

  • HRNet / Higher-accuracy models

  • Advantages: High accuracy (for research/evaluation)
  • Cons: Heavy, requires GPU for real-time requirements
  • Recommended: offline/batch processing, accuracy benchmarks

Shortest introduction procedure (pose) 1. Get/convert ONNX for MoveNet Lightning (using tf2onnx). 2. Implement simple wrapper get_pose_backend(name) in ONNX Runtime. 3. Fixed inference batch/streaming input/output specifications with pose interface. 4. Added accuracy/latency bench in CI (sample 1000 consecutive 1080p frames).

2) ASR candidate - OpenAI Whisper series - Variants: tiny, base, small, medium, large - Advantages: Multilingual support, reproducibility, consistent accuracy - Cons: medium/large is heavy (requires GPU), PyTorch dependent - Lightweight: whisper.cpp (int8), ONNX conversion + quantize available - Recommended backends: ONNX Runtime (CPU/GPU), torch (GPU), whisper.cpp (for CPU embedded)

  • Wav2Vec2 (HuggingFace)
  • Advantages: High accuracy, easy custom learning
  • Disadvantages: Requires pre-processing/decoder design

-VOSK/Kaldi - Advantages: Engine for on-premises and low-latency embedded applications - Cons: Requires configuration and dictionary management

Fastest Deployment Steps (ASR) 1. Variant determination based on requirements (small <-> high accuracy). 2. Incorporate Whisper small with torch, or consider whisper.cpp for lightweight use. 3. ONNX conversion: PyTorch -> ONNX -> ORT path maintained. INT8/FP16 quantization if necessary. 4. If you want to add speaker diarization, plug-in additional modules such as pyannote.

3) Model conversion and optimization (common procedure) - Recommended tools: torch.onnx, tf2onnx, onnxruntime-tools, onnxruntime-transformers, optimum (HuggingFace), tensorrt/trtexec. - Basic pipeline: 1. Obtain learned weights (license confirmation) 2. Framework conversion (PyTorch/TF -> ONNX) 3. ONNX simple verification (shape, dtype) 4. Quantization (FP16/INT8): First FP16, then INT8 (confirm accuracy degradation with calibration data) 5. TensorRT engineization (GPU, dynamic shape settings if necessary)

4) Inference service design (operation) - Model repository: separate models/video/pose/ and models/audio/asr/ and version control. Record SHA/date in Manifest. - Container: Base is python:3.10-slim, GPU is NVIDIA CUDA compatible image. Prepare an image with ONNX Runtime / TensorRT installed. - Model switching: Enable switching at runtime using the MODEL_SELECTION environment variable or API. - Monitoring: Output per-model latency, success_rate, memory_usage to Prometheus.

5) Implementation checklist (practical) - License confirmation (Whisper: MIT? / Background model is license dependent) - Accuracy/latency bench with sample data - Performance and accuracy difference report after quantization - Separation of CI (CPU) and nightly GPU benches - Security: Signing/verification of model artifacts

6) Specific command examples - TF -> ONNX (MoveNet example):pip install tf2onnx onnx onnxruntime python -m tf2onnx.convert --saved-model movenet_savedmodel --output movenet.onnx --opset 14- PyTorch Whisper -> ONNX (simple):pip install torch onnx onnxruntime python export_whisper_to_onnx.py --model small --out whisper_small.onnx- Convert ONNX to FP16 (example):python -m onnxruntime.tools.convert_to_ort --model whisper_small.onnx --output whisper_small_fp16.onnx --target_fp16

7) Recommended initial configuration (production startup) - pose: MoveNet Lightning -> ONNX Runtime (GPU priority, FP16) - asr: Whisper small -> ONNXRuntime GPU (switched according to medium request high precision) - Quantization: INT8 was carefully adopted in production (evaluated with calibration data)

8) CI/Test - Always run small CPU-based smoke tests (whisper.cpp / ORT cpu) - nightly GPU bench (accuracy + latency)

9) Additional materials and references - ONNX Runtime Quantization docs - TensorRT practices best - whisper.cpp repository (optimized for CPU)

Summary: It is realistic to first introduce MoveNet (ONNX, FP16) and Whisper small (ONNX, FP16) as initial candidates, and then gradually add medium/large after preparing the CI bench and quantization pipeline.


Implemented fallback module (added on 2026-04-24)

The following 7 modules have been implemented in evospikenet/video_analysis/ as a fallback before introducing the production model. Switching to the production model is done with the backend argument or configuration file.

Module Implemented backend Production model hook Switch key
depth_estimation.py Brightness gradient pseudo-depth torch.hub MiDaS-small backend="midas_real"
vad.py Energy threshold VAD pyannote.audio Pipeline.from_pretrained() SpeakerDiarizer.use_pyannote=True
shot_boundary_detector.py Histogram / Pixel Difference PySceneDetect / own CNN method="combined"
narrative_generator.py Template statement EvoSpikeNet LM (NeuralLanguageAdapter) use_lm=True
temporal_action_localizer.py Position difference sliding window ST-GCN / SlowFast backend="stgcn_real"
spatial_relations.py BBox coordinates + depth map Same as left (additional algorithm planned)
event_schema.py Pure dataclass (no dependencies)

Production migration checklist

  1. Test with depth_estimation.py: pip install timm torchDepthEstimator(backend="midas_real")
  2. vad.py: pip install pyannote.audioSpeakerDiarizer(model_id="pyannote/speaker-diarization-3.1")
  3. narrative_generator.py: After starting the EvoSpikeNet LM service, check operation with NarrativeGenerator(use_lm=True)
  4. temporal_action_localizer.py: Register ST-GCN ONNX model to STGCNRealActionBackend in backends.py
  5. E2E: Latency measurement with pipeline.run(frames, audio, fps=25, enable_extended=True)

Update date: 2026-04-24