Video/audio analysis platform complete implementation specifications

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: This document is a specification for the complete implementation of the multimodal (video + audio) time series analysis infrastructure in evospikenet. Contains requirements, architecture, module decomposition, performance goals, test criteria, deployment requirements, privacy/security requirements, and a phased implementation plan for production rather than PoC.

Target Audience: Developers, ML Engineers, SREs, Product Owners

Outline requirements - Processing target: Input video (frame sequence) of single/multiple people and the same recording track (wav). Supports both real-time (streaming) and batch. - Output: Time series events (track ID, timestamp, action label, confidence), ASR subtitles (with timestamp), multimodal integrated timeline, natural language summary (short text) - Availability: 99.5% (Service SLA), Scalability: Horizontal scale to N nodes - Latency goal: Synchronous API (small input): - P95 < 2s (lightweight backends/fallback) - GPU acceleration P95 < 300ms (small batch)

non-functional requirements - Modularization: Each process (pose/tracking/action/asr/fusion/summary) is a plug-in format that can be replaced with the back end. - Resource management: GPU/CPU/memory allocation control, batching, back pressure support - Fault tolerance: job retries, inflight requeue, persistent store (Redis/SQLite/JSON fallback) - Observability: Metrics (processing delay, throughput, success rate), traces, logs (structured) - Privacy: Facial/Personally Identifiable Information Anonymization Options, Data Retention Policy, Consent Management

Architecture overview - Input layer: Upload/stream reception (HTTP/GRPC), metadata added upon reception - Worker layer: job queue (Redis/Celery or VideoAnalysisJobStore) + worker group (synchronous/asynchronous) - Backend layer: Model runtime plugin (Torch/TensorRT/ONNX Runtime/Python fallback) - Aggregation layer: EventFusionEngine + SummaryGenerator (LLM wrapper or template) - API layer: synchronous analysis endpoint (/api/video-analysis/analyze), asynchronous job API (/api/video-analysis/submit/status/result) - Storage: input artifacts (optional), result cache, job meta (Redis/SQLite)

Main components (functional items) Each section is aimed at "complete implementation." Priority is listed as P0/P1/P2.

1) Data acceptance & preprocessing (P0) - HTTP/GRPC/Multipart upload endpoint - Streaming ingestion (chunked frames/audio) - Input validation (resolution, sample rate, length limits) - Preprocessing: frame normalization, audio normalization, sampling, cropping

2) Keypoint extraction module pose (P0) - Backend: MoveNet / MediaPipe / ONNX MoveNet / TensorRT export - API: estimate(frame) -> List[Pose] (timestamp, keypoints(dict{name:[x,y,conf]})) - Requirements: batch/streaming compatible, optimized with or without GPU, deterministic fallback - Performance: Throughput that can continuously process 30fps at 1080p (on GPU)

3) Tracking module tracking (P0) - Algorithm: DeepSORT / IoU based tracker (selectable) - API: update(detections, timestamp) -> List[Tracked] (track_id, bbox, keypoints) - Supports re-identification (ReID) plugin - Robustness: return to cover, short-term ID continuation

4) Action recognition module action_recognition (P0) - Model: ST-GCN / TCN / Lightweight 3D-CNN supported - API: classify(tracked_sequence) -> List[ActionEvent] (track_id, start, end, label, score) - Sliding window inference, online updates

5) ASR module asr (P0) - Backend: Whisper (small/medium), Kaldi/other, ONNX, etc. - API: transcribe(audio_chunk) -> Transcript (segments with timestamps, confidence) - Speaker diarization/overlapping talk countermeasures (plug-in)

6) Event fusion engine fusion (P0) - Timeline synchronization: integrate video events and ASR segments - Trigger extraction: Important event extraction (thresholds, rules, ML-based) - Output format: timeline, summary_candidates, metrics

7) Summary generation summary (P1) - Lightweight template summarization + LLM base summarization plugin - API: generate_summary(timeline, transcript) -> str (multilingual support)

8) Persistent Job Store & Workers (P0) - Highly reliable job store: Redis primary, SQLite persistent fallback, JSON fallback - Worker pool: inflight recovery on restart, job priority, retry policy - Monitoring: queue length, worker health, processing_latency

9) Backend Management & Plugin Framework (P0) - Backend registration API: register_backend(name, capabilities) - Runtime switching: GPU/CPU/FP16/INT8 selection, model cache - Health check/automatic failover

10) API layer (P0) - Synchronization: /api/video-analysis/analyze (short batch) - Asynchronous: /api/video-analysis/submit, /api/video-analysis/status/{id}, /api/video-analysis/result/{id} - Authentication: API Key / OAuth2 / RBAC

11) Test & Evaluation (P0) - Unit tests: deterministic edge cases for each module - Integration test: small sample video end-to-end (mAP/IDF1/WER) - Performance tests: throughput, latency, scaling tests - Data quality testing: ASR WER report, IDF1

12) Privacy & Security (P0) - Face/voice anonymization filter (optional) - Encryption at rest (DataAtRestEncryption), TLS in transit - Consent/Delete API, log minimization

13) Deployment & Orchestration (P1) - Docker + Kubernetes manifest, GPU node configuration - Metrics output for Horizontal Pod Autoscaler (custom metrics) - Canary/Blue-Green deployment strategy, model hot swap

14) Visualization/Frontend (P1) - Timeline visualization, track visualization, summary view - Real-time updates via WebSocket or SSE

Acceptance criteria (example) - Unit test coverage: 90% or more per module (important logic) - Integration E2E: pipeline successful with sample video, IDF1 >= 0.60 (simple data), WER <= 0.30 (low noise) - Latency: P95 < 2s with lightweight fallback, P95 < 300ms in GPU environment - Security: Uses AES-256-GCM when storing data, delete API works on request

Phased implementation plan (short-term → medium-term) Phase 1 (2 weeks): Basic API, job queue, pose fallback, job store, unit testing Phase 2 (3 weeks): Tracking, simple action_recognition, ASR fallback, fusion, simple summary Phase 3 (4 weeks): Full-scale model integration (MoveNet/Whisper/ST-GCN), GPU optimization, performance tuning Phase 4 (continued): Privacy enhancement, K8s deployment, monitoring/SLO achieved

Development/CI requirements - Add optional extras to requirements.txt: video[torch], onnxruntime-gpu, whisper, opencv-python-headless, redis, fastapi[all], etc. - CI: Label tests that require GPU, always run tests with CPU fallback

document placement - Specification: Docs/VIDEO_AUDIO_ANALYSIS_SPEC.md (this file) - Implementation checklist: docs/implementation/video_analysis_checklist.md (links implementation progress and file trail)

Implementation status (updated on 2026-04-24)

List of implemented modules

All of the following modules have been implemented in evospikenet/video_analysis/.

module	file	phase	state
Pose estimation (fallback)	`pose.py`	Phase 1	✅ Done
Tracking (IoU)	`tracking.py`	Phase 1	✅ Completed
Action recognition (position difference)	`action_recognition.py`	Phase 1	✅ Completed
ASR wrapper	`asr.py` + `asr_policy.py`	Phase 1	✅ Completed (fallback)
Event fusion	`fusion.py`	Phase 1	✅ Completed
Job Queue	`job_queue.py`	Phase 1	✅ Completed
Backend Management	`backends.py` + `backends_real.py`	Phase 2	✅ Completed
Celery worker	`worker.py`	Phase 2	✅ Done
Privacy processing	`privacy.py`	Phase 2	✅ Completed
Evaluation Metrics	`metrics.py` + `runtime_metrics.py`	Phase 2	✅ Completed
Unified Event Schema	`event_schema.py`	Phase 3	✅ 2026-04-24 New
Shot boundary detection	`shot_boundary_detector.py`	Phase 3	✅ 2026-04-24 New
VAD/Speaker separation	`vad.py`	Phase 3	✅ 2026-04-24 New
Monocular depth estimation	`depth_estimation.py`	Phase 3	✅ 2026-04-24 New
Spatial relations extraction	`spatial_relations.py`	Phase 3	✅ 2026-04-24 New
Narrative Generation	`narrative_generator.py`	Phase 3	✅ 2026-04-24 New
Temporal action localization	`temporal_action_localizer.py`	Phase 3	✅ 2026-04-24 New
Pipeline integration	`pipeline.py` (enable_extended)	Phase 3	✅ Updated 2026-04-24

New module details

`event_schema.py` — Unified event schema

Actor(id, type, attributes, bbox, keypoints, track_id)
SpatialFrame(type, coords, depth_m, relation_desc)
EventEvidence(detection_ids, transcript_snippet, frame_indices, pose_refs)
Event — dict base factory: from_vision_detection(det, timestamp), from_asr_segment(seg), from_scene_change(frame_idx, timestamp)
AnalysisTimeline — sorted_events(), filter_by_type(), filter_by_confidence(), to_dict()

`shot_boundary_detector.py` — Shot boundary detection

Method: Histogram difference / Pixel difference / Combined (selectable)
ShotBoundaryDetector(threshold, method, min_scene_len).detect(frames, fps) → [(frame_idx, timestamp), ...]
Output with scores is also possible with detect_scores()

`vad.py` — VAD/Speaker separation

VoiceActivityDetector(energy_threshold, frame_duration_ms, merge_gap_s, min_speech_s)
detect(audio_wave, sample_rate) → [{"start", "end", "energy"}, ...]
SpeakerDiarizer(max_speakers) — pyannote.audio replacement hook

`depth_estimation.py` — Monocular depth estimation

Fallback: Intensity gradient-based pseudo-depth map
Production hook: torch.hub.load("intel-isl/MiDaS", "MiDaS_small")
DepthEstimator.estimate(frame) → {"depth_map": ndarray(H,W), "depth_at_center": float, "backend": str}
estimate_depth_at_bbox(frame, bbox) → float

`spatial_relations.py` — Spatial relations extraction

get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str
Return value: left_of / right_of / above / below / in_front_of / behind / near / touching / unknown
extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...]
MotionVectorEstimator — Add velocity vector and direction from track history

`narrative_generator.py` — Narrative generation

NarrativeGenerator(use_lm, max_events, min_confidence).generate(timeline) → str
Template mode (default): "At {ts}s, {actor} is detected {action} (confidence {conf}%)"
LM mode: Call EvoSpikeNet LM via NeuralLanguageAdapter, fallback to template on failure
generate_shot_summary(shot_start, shot_end, events, relations) → str

`temporal_action_localizer.py` — Temporal action localizer

TemporalActionLocalizer(window_size, stride, min_confidence, nms_overlap_threshold)
localize(frames, fps) → [{"start", "end", "action", "confidence"}, ...]
localize_with_poses(frames, pose_results, fps) → [...] — Keypoint change based
Deduplication with Temporal IoU NMS

`VideoAnalysisPipeline` updates

pipeline = VideoAnalysisPipeline(
    pose_backend="fallback",
    action_backend="fallback",
    asr_backend="fallback",
    depth_backend="fallback",
)
result = pipeline.run(frames, audio_wave, fps=10.0, enable_extended=True)
# The following is added to result:
#   shot_boundaries, vad_segments, temporal_segments,
#   depth_sample, spatial_relations, narrative

Test status

Test file	Number of tests	Status
`tests/unit/test_video_analysis_new_modules.py`	62	✅ All passes
`tests/unit/test_video_analysis_components.py`	15	✅ All passed
`tests/unit/test_video_analysis_metrics.py`	1	✅ All passes
`tests/unit/test_video_analysis_backends.py`	1	✅ All passes
Total	82	✅

Guide to switching to EvoSpikeNet LM backend (2026-04-24)

The pipeline has two independent LM paths.

Route A: `lm_summary` (summary of the entire pipeline)

lm_summary key returned by VideoAnalysisPipeline.run().
Use EvoLMBackend → AutoModelSelector.get_model("text") → SpikingEvoTextLM.

Control method

Method	Setting example	Effect
Constructor arguments	`VideoAnalysisPipeline(lm_backend="evospikenet_lm")`	Default. Using SpikingEvoTextLM
Disable with environment variable	`VIDEO_ANALYSIS_ENABLE_LM=0`	`lm_summary=""`, `lm_backend=None`
Enable via environment variable	`VIDEO_ANALYSIS_ENABLE_LM=1` (default)	Start LM to generate summaries

import os
os.environ["VIDEO_ANALYSIS_ENABLE_LM"] = "1"  # Or "0" to disable

from evospikenet.video_analysis.pipeline import VideoAnalysisPipeline

pipeline = VideoAnalysisPipeline(lm_backend="evospikenet_lm")
result = pipeline.run(frames, audio_wave, fps=25.0)
print(result["lm_summary"])   # e.g., "A person is walking."
print(result["lm_backend"])   # "evospikenet_lm"

Backend stack (path A)

pipeline.run()
  └─ EvoLMBackend.generate(prompt, max_new_tokens=40, temperature=0.6)
       └─ AutoModelSelector.get_model("text")
            └─ SpikingEvoTextLM.generate(input_ids, max_new_tokens)

Path B: `narrative` (chronological narrative generation)

narrative key when pipeline.run(enable_extended=True).
Use NarrativeGenerator → _LMBridge → EvoLMBackend → SpikingEvoTextLM.

Operation flow

NarrativeGenerator.generate(timeline)
  ├─ Assemble template sentences (Precomputed)
  ├─ If use_lm=True: _LMBridge.generate(prompt, max_new_tokens=200)
  │    └─ EvoLMBackend.generate(prompt, max_new_tokens=200)
  │         └─ SpikingEvoTextLM.generate(...)
  └─ If LM output < 20 chars or None -> Template fallback

Control method

Method	Setting example	Effect
Default (recommended)	`NarrativeGenerator(use_lm=True)`	LM priority, failure template
Disable LM	`NarrativeGenerator(use_lm=False)`	Always template
Via pipeline	`VideoAnalysisPipeline()`	`use_lm=True` is set by default

from evospikenet.video_analysis.narrative_generator import NarrativeGenerator

# Generate narrative using LM
gen = NarrativeGenerator(use_lm=True, max_events=15, min_confidence=0.3)
text = gen.generate(timeline)
print(text)

# Shot-based summary
shot_text = gen.generate_shot_summary(
    shot_start=0.0, shot_end=5.0,
    events=timeline.filter_by_confidence(0.5),
    relations=[{"subject_id": 0, "object_id": 1, "relation": "left_of", "confidence": 0.8}],
)

Checking backend availability

from evospikenet.video_analysis.backends import get_backend_status

status = get_backend_status()
print(status["lm"])
# {'evospikenet_lm': {'available': True, 'tier': 'real',
#                     'note': 'local SpikingEvo models via AutoModelSelector'}}

Common problems and solutions

Symptom	Cause	Treatment
`lm_summary` is an empty string	`VIDEO_ANALYSIS_ENABLE_LM=0` is set	Change the environment variable to `1`
`narrative` becomes a template statement	`_LMBridge` loading failure	`torch` installation confirmation (`pip install torch`)
`lm_backend` is `None`	`EvoLMBackend` initialization failure (missing dependent library)	`pip install torch transformers`
LM output is too short/incomprehensible	`SpikingEvoTextLM` is untrained	Replace the weights of `AutoModelSelector` in the trained model

Weight replacement and model customization

SpikingEvoTextLM is loaded with AutoModelSelector.get_model("text").
If you want to use trained weights, please pre-train with tools/train_spiking_lm.py.

# Training
python tools/train_spiking_lm.py --node-type text --epochs 10

# Inference check
python -c "
from evospikenet.llm_backend import EvoLMBackend
lm = EvoLMBackend(task_type='text')
print(lm.generate('Events: walking:3. Please summarize in one sentence.', max_new_tokens=40))
"

Remaining assignments (Phase 4)

Function	Priority	Responsible module
ByteTrack/DeepSORT Production MOT	High	`tracking.py` / `backends.py`
MediaPipe/OpenPose production pose	High	`MoveNetRealPoseBackend`
TimeSformer/SlowFast real model	High	`STGCNRealActionBackend`
pyannote.audio speaker separation	medium	`vad.py` (`SpeakerDiarizer`)
SAM/Mask R-CNN Segmentation	Medium	New module planned
TensorRT/ONNX GPU optimization	Medium	`backends.py`
WebSocket Real-time API	Low	`video_analysis_api.py`
Annotation UI	Low	Separate tools

Creation date: 2026-04-19 Update date: 2026-04-24 Author: Engineering Team

Video/audio analysis platform complete implementation specifications

Implementation status (updated on 2026-04-24)

List of implemented modules

New module details

event_schema.py — Unified event schema

shot_boundary_detector.py — Shot boundary detection

vad.py — VAD/Speaker separation

depth_estimation.py — Monocular depth estimation

spatial_relations.py — Spatial relations extraction

narrative_generator.py — Narrative generation

temporal_action_localizer.py — Temporal action localizer

VideoAnalysisPipeline updates

Test status

Guide to switching to EvoSpikeNet LM backend (2026-04-24)

Route A: lm_summary (summary of the entire pipeline)

Control method

Backend stack (path A)

Path B: narrative (chronological narrative generation)

Operation flow

Control method

Checking backend availability

Common problems and solutions

Weight replacement and model customization

Remaining assignments (Phase 4)

`event_schema.py` — Unified event schema

`shot_boundary_detector.py` — Shot boundary detection

`vad.py` — VAD/Speaker separation

`depth_estimation.py` — Monocular depth estimation

`spatial_relations.py` — Spatial relations extraction

`narrative_generator.py` — Narrative generation

`temporal_action_localizer.py` — Temporal action localizer

`VideoAnalysisPipeline` updates

Route A: `lm_summary` (summary of the entire pipeline)

Path B: `narrative` (chronological narrative generation)