Skip to content

Video/audio analysis platform complete implementation specifications

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: This document is a specification for the complete implementation of the multimodal (video + audio) time series analysis infrastructure in evospikenet. Contains requirements, architecture, module decomposition, performance goals, test criteria, deployment requirements, privacy/security requirements, and a phased implementation plan for production rather than PoC.

Target Audience: Developers, ML Engineers, SREs, Product Owners

Outline requirements - Processing target: Input video (frame sequence) of single/multiple people and the same recording track (wav). Supports both real-time (streaming) and batch. - Output: Time series events (track ID, timestamp, action label, confidence), ASR subtitles (with timestamp), multimodal integrated timeline, natural language summary (short text) - Availability: 99.5% (Service SLA), Scalability: Horizontal scale to N nodes - Latency goal: Synchronous API (small input): - P95 < 2s (lightweight backends/fallback) - GPU acceleration P95 < 300ms (small batch)

non-functional requirements - Modularization: Each process (pose/tracking/action/asr/fusion/summary) is a plug-in format that can be replaced with the back end. - Resource management: GPU/CPU/memory allocation control, batching, back pressure support - Fault tolerance: job retries, inflight requeue, persistent store (Redis/SQLite/JSON fallback) - Observability: Metrics (processing delay, throughput, success rate), traces, logs (structured) - Privacy: Facial/Personally Identifiable Information Anonymization Options, Data Retention Policy, Consent Management

Architecture overview - Input layer: Upload/stream reception (HTTP/GRPC), metadata added upon reception - Worker layer: job queue (Redis/Celery or VideoAnalysisJobStore) + worker group (synchronous/asynchronous) - Backend layer: Model runtime plugin (Torch/TensorRT/ONNX Runtime/Python fallback) - Aggregation layer: EventFusionEngine + SummaryGenerator (LLM wrapper or template) - API layer: synchronous analysis endpoint (/api/video-analysis/analyze), asynchronous job API (/api/video-analysis/submit/status/result) - Storage: input artifacts (optional), result cache, job meta (Redis/SQLite)

Main components (functional items) Each section is aimed at "complete implementation." Priority is listed as P0/P1/P2.

1) Data acceptance & preprocessing (P0) - HTTP/GRPC/Multipart upload endpoint - Streaming ingestion (chunked frames/audio) - Input validation (resolution, sample rate, length limits) - Preprocessing: frame normalization, audio normalization, sampling, cropping

2) Keypoint extraction module pose (P0) - Backend: MoveNet / MediaPipe / ONNX MoveNet / TensorRT export - API: estimate(frame) -> List[Pose] (timestamp, keypoints(dict{name:[x,y,conf]})) - Requirements: batch/streaming compatible, optimized with or without GPU, deterministic fallback - Performance: Throughput that can continuously process 30fps at 1080p (on GPU)

3) Tracking module tracking (P0) - Algorithm: DeepSORT / IoU based tracker (selectable) - API: update(detections, timestamp) -> List[Tracked] (track_id, bbox, keypoints) - Supports re-identification (ReID) plugin - Robustness: return to cover, short-term ID continuation

4) Action recognition module action_recognition (P0) - Model: ST-GCN / TCN / Lightweight 3D-CNN supported - API: classify(tracked_sequence) -> List[ActionEvent] (track_id, start, end, label, score) - Sliding window inference, online updates

5) ASR module asr (P0) - Backend: Whisper (small/medium), Kaldi/other, ONNX, etc. - API: transcribe(audio_chunk) -> Transcript (segments with timestamps, confidence) - Speaker diarization/overlapping talk countermeasures (plug-in)

6) Event fusion engine fusion (P0) - Timeline synchronization: integrate video events and ASR segments - Trigger extraction: Important event extraction (thresholds, rules, ML-based) - Output format: timeline, summary_candidates, metrics

7) Summary generation summary (P1) - Lightweight template summarization + LLM base summarization plugin - API: generate_summary(timeline, transcript) -> str (multilingual support)

8) Persistent Job Store & Workers (P0) - Highly reliable job store: Redis primary, SQLite persistent fallback, JSON fallback - Worker pool: inflight recovery on restart, job priority, retry policy - Monitoring: queue length, worker health, processing_latency

9) Backend Management & Plugin Framework (P0) - Backend registration API: register_backend(name, capabilities) - Runtime switching: GPU/CPU/FP16/INT8 selection, model cache - Health check/automatic failover

10) API layer (P0) - Synchronization: /api/video-analysis/analyze (short batch) - Asynchronous: /api/video-analysis/submit, /api/video-analysis/status/{id}, /api/video-analysis/result/{id} - Authentication: API Key / OAuth2 / RBAC

11) Test & Evaluation (P0) - Unit tests: deterministic edge cases for each module - Integration test: small sample video end-to-end (mAP/IDF1/WER) - Performance tests: throughput, latency, scaling tests - Data quality testing: ASR WER report, IDF1

12) Privacy & Security (P0) - Face/voice anonymization filter (optional) - Encryption at rest (DataAtRestEncryption), TLS in transit - Consent/Delete API, log minimization

13) Deployment & Orchestration (P1) - Docker + Kubernetes manifest, GPU node configuration - Metrics output for Horizontal Pod Autoscaler (custom metrics) - Canary/Blue-Green deployment strategy, model hot swap

14) Visualization/Frontend (P1) - Timeline visualization, track visualization, summary view - Real-time updates via WebSocket or SSE

Acceptance criteria (example) - Unit test coverage: 90% or more per module (important logic) - Integration E2E: pipeline successful with sample video, IDF1 >= 0.60 (simple data), WER <= 0.30 (low noise) - Latency: P95 < 2s with lightweight fallback, P95 < 300ms in GPU environment - Security: Uses AES-256-GCM when storing data, delete API works on request

Phased implementation plan (short-term → medium-term) Phase 1 (2 weeks): Basic API, job queue, pose fallback, job store, unit testing Phase 2 (3 weeks): Tracking, simple action_recognition, ASR fallback, fusion, simple summary Phase 3 (4 weeks): Full-scale model integration (MoveNet/Whisper/ST-GCN), GPU optimization, performance tuning Phase 4 (continued): Privacy enhancement, K8s deployment, monitoring/SLO achieved

Development/CI requirements - Add optional extras to requirements.txt: video[torch], onnxruntime-gpu, whisper, opencv-python-headless, redis, fastapi[all], etc. - CI: Label tests that require GPU, always run tests with CPU fallback

document placement - Specification: Docs/VIDEO_AUDIO_ANALYSIS_SPEC.md (this file) - Implementation checklist: docs/implementation/video_analysis_checklist.md (links implementation progress and file trail)


Implementation status (updated on 2026-04-24)

List of implemented modules

All of the following modules have been implemented in evospikenet/video_analysis/.

module file phase state
Pose estimation (fallback) pose.py Phase 1 ✅ Done
Tracking (IoU) tracking.py Phase 1 ✅ Completed
Action recognition (position difference) action_recognition.py Phase 1 ✅ Completed
ASR wrapper asr.py + asr_policy.py Phase 1 ✅ Completed (fallback)
Event fusion fusion.py Phase 1 ✅ Completed
Job Queue job_queue.py Phase 1 ✅ Completed
Backend Management backends.py + backends_real.py Phase 2 ✅ Completed
Celery worker worker.py Phase 2 ✅ Done
Privacy processing privacy.py Phase 2 ✅ Completed
Evaluation Metrics metrics.py + runtime_metrics.py Phase 2 ✅ Completed
Unified Event Schema event_schema.py Phase 3 2026-04-24 New
Shot boundary detection shot_boundary_detector.py Phase 3 2026-04-24 New
VAD/Speaker separation vad.py Phase 3 2026-04-24 New
Monocular depth estimation depth_estimation.py Phase 3 2026-04-24 New
Spatial relations extraction spatial_relations.py Phase 3 2026-04-24 New
Narrative Generation narrative_generator.py Phase 3 2026-04-24 New
Temporal action localization temporal_action_localizer.py Phase 3 2026-04-24 New
Pipeline integration pipeline.py (enable_extended) Phase 3 Updated 2026-04-24

New module details

event_schema.py — Unified event schema

  • Actor(id, type, attributes, bbox, keypoints, track_id)
  • SpatialFrame(type, coords, depth_m, relation_desc)
  • EventEvidence(detection_ids, transcript_snippet, frame_indices, pose_refs)
  • Event — dict base factory: from_vision_detection(det, timestamp), from_asr_segment(seg), from_scene_change(frame_idx, timestamp)
  • AnalysisTimelinesorted_events(), filter_by_type(), filter_by_confidence(), to_dict()

shot_boundary_detector.py — Shot boundary detection

  • Method: Histogram difference / Pixel difference / Combined (selectable)
  • ShotBoundaryDetector(threshold, method, min_scene_len).detect(frames, fps) → [(frame_idx, timestamp), ...]
  • Output with scores is also possible with detect_scores()

vad.py — VAD/Speaker separation

  • VoiceActivityDetector(energy_threshold, frame_duration_ms, merge_gap_s, min_speech_s)
  • detect(audio_wave, sample_rate) → [{"start", "end", "energy"}, ...]
  • SpeakerDiarizer(max_speakers) — pyannote.audio replacement hook

depth_estimation.py — Monocular depth estimation

  • Fallback: Intensity gradient-based pseudo-depth map
  • Production hook: torch.hub.load("intel-isl/MiDaS", "MiDaS_small")
  • DepthEstimator.estimate(frame) → {"depth_map": ndarray(H,W), "depth_at_center": float, "backend": str}
  • estimate_depth_at_bbox(frame, bbox) → float

spatial_relations.py — Spatial relations extraction

  • get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str
  • Return value: left_of / right_of / above / below / in_front_of / behind / near / touching / unknown
  • extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...]
  • MotionVectorEstimator — Add velocity vector and direction from track history

narrative_generator.py — Narrative generation

  • NarrativeGenerator(use_lm, max_events, min_confidence).generate(timeline) → str
  • Template mode (default): "At {ts}s, {actor} is detected {action} (confidence {conf}%)"
  • LM mode: Call EvoSpikeNet LM via NeuralLanguageAdapter, fallback to template on failure
  • generate_shot_summary(shot_start, shot_end, events, relations) → str

temporal_action_localizer.py — Temporal action localizer

  • TemporalActionLocalizer(window_size, stride, min_confidence, nms_overlap_threshold)
  • localize(frames, fps) → [{"start", "end", "action", "confidence"}, ...]
  • localize_with_poses(frames, pose_results, fps) → [...] — Keypoint change based
  • Deduplication with Temporal IoU NMS

VideoAnalysisPipeline updates

pipeline = VideoAnalysisPipeline(
    pose_backend="fallback",
    action_backend="fallback",
    asr_backend="fallback",
    depth_backend="fallback",
)
result = pipeline.run(frames, audio_wave, fps=10.0, enable_extended=True)
# The following is added to result:
#   shot_boundaries, vad_segments, temporal_segments,
#   depth_sample, spatial_relations, narrative

Test status

Test file Number of tests Status
tests/unit/test_video_analysis_new_modules.py 62 ✅ All passes
tests/unit/test_video_analysis_components.py 15 ✅ All passed
tests/unit/test_video_analysis_metrics.py 1 ✅ All passes
tests/unit/test_video_analysis_backends.py 1 ✅ All passes
Total 82

Guide to switching to EvoSpikeNet LM backend (2026-04-24)

The pipeline has two independent LM paths.

Route A: lm_summary (summary of the entire pipeline)

lm_summary key returned by VideoAnalysisPipeline.run().
Use EvoLMBackendAutoModelSelector.get_model("text")SpikingEvoTextLM.

Control method

Method Setting example Effect
Constructor arguments VideoAnalysisPipeline(lm_backend="evospikenet_lm") Default. Using SpikingEvoTextLM
Disable with environment variable VIDEO_ANALYSIS_ENABLE_LM=0 lm_summary="", lm_backend=None
Enable via environment variable VIDEO_ANALYSIS_ENABLE_LM=1 (default) Start LM to generate summaries
import os
os.environ["VIDEO_ANALYSIS_ENABLE_LM"] = "1"  # Or "0" to disable

from evospikenet.video_analysis.pipeline import VideoAnalysisPipeline

pipeline = VideoAnalysisPipeline(lm_backend="evospikenet_lm")
result = pipeline.run(frames, audio_wave, fps=25.0)
print(result["lm_summary"])   # e.g., "A person is walking."
print(result["lm_backend"])   # "evospikenet_lm"

Backend stack (path A)

pipeline.run()
  └─ EvoLMBackend.generate(prompt, max_new_tokens=40, temperature=0.6)
       └─ AutoModelSelector.get_model("text")
            └─ SpikingEvoTextLM.generate(input_ids, max_new_tokens)

Path B: narrative (chronological narrative generation)

narrative key when pipeline.run(enable_extended=True).
Use NarrativeGenerator_LMBridgeEvoLMBackendSpikingEvoTextLM.

Operation flow

NarrativeGenerator.generate(timeline)
  ├─ Assemble template sentences (Precomputed)
  ├─ If use_lm=True: _LMBridge.generate(prompt, max_new_tokens=200)
  │    └─ EvoLMBackend.generate(prompt, max_new_tokens=200)
  │         └─ SpikingEvoTextLM.generate(...)
  └─ If LM output < 20 chars or None -> Template fallback

Control method

Method Setting example Effect
Default (recommended) NarrativeGenerator(use_lm=True) LM priority, failure template
Disable LM NarrativeGenerator(use_lm=False) Always template
Via pipeline VideoAnalysisPipeline() use_lm=True is set by default
from evospikenet.video_analysis.narrative_generator import NarrativeGenerator

# Generate narrative using LM
gen = NarrativeGenerator(use_lm=True, max_events=15, min_confidence=0.3)
text = gen.generate(timeline)
print(text)

# Shot-based summary
shot_text = gen.generate_shot_summary(
    shot_start=0.0, shot_end=5.0,
    events=timeline.filter_by_confidence(0.5),
    relations=[{"subject_id": 0, "object_id": 1, "relation": "left_of", "confidence": 0.8}],
)

Checking backend availability

from evospikenet.video_analysis.backends import get_backend_status

status = get_backend_status()
print(status["lm"])
# {'evospikenet_lm': {'available': True, 'tier': 'real',
#                     'note': 'local SpikingEvo models via AutoModelSelector'}}

Common problems and solutions

Symptom Cause Treatment
lm_summary is an empty string VIDEO_ANALYSIS_ENABLE_LM=0 is set Change the environment variable to 1
narrative becomes a template statement _LMBridge loading failure torch installation confirmation (pip install torch)
lm_backend is None EvoLMBackend initialization failure (missing dependent library) pip install torch transformers
LM output is too short/incomprehensible SpikingEvoTextLM is untrained Replace the weights of AutoModelSelector in the trained model

Weight replacement and model customization

SpikingEvoTextLM is loaded with AutoModelSelector.get_model("text").
If you want to use trained weights, please pre-train with tools/train_spiking_lm.py.

# Training
python tools/train_spiking_lm.py --node-type text --epochs 10

# Inference check
python -c "
from evospikenet.llm_backend import EvoLMBackend
lm = EvoLMBackend(task_type='text')
print(lm.generate('Events: walking:3. Please summarize in one sentence.', max_new_tokens=40))
"

Remaining assignments (Phase 4)

Function Priority Responsible module
ByteTrack/DeepSORT Production MOT High tracking.py / backends.py
MediaPipe/OpenPose production pose High MoveNetRealPoseBackend
TimeSformer/SlowFast real model High STGCNRealActionBackend
pyannote.audio speaker separation medium vad.py (SpeakerDiarizer)
SAM/Mask R-CNN Segmentation Medium New module planned
TensorRT/ONNX GPU optimization Medium backends.py
WebSocket Real-time API Low video_analysis_api.py
Annotation UI Low Separate tools

Creation date: 2026-04-19 Update date: 2026-04-24 Author: Engineering Team