Video/audio analysis platform complete implementation specifications
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Purpose: This document is a specification for the complete implementation of the multimodal (video + audio) time series analysis infrastructure in evospikenet. Contains requirements, architecture, module decomposition, performance goals, test criteria, deployment requirements, privacy/security requirements, and a phased implementation plan for production rather than PoC.
Target Audience: Developers, ML Engineers, SREs, Product Owners
Outline requirements - Processing target: Input video (frame sequence) of single/multiple people and the same recording track (wav). Supports both real-time (streaming) and batch. - Output: Time series events (track ID, timestamp, action label, confidence), ASR subtitles (with timestamp), multimodal integrated timeline, natural language summary (short text) - Availability: 99.5% (Service SLA), Scalability: Horizontal scale to N nodes - Latency goal: Synchronous API (small input): - P95 < 2s (lightweight backends/fallback) - GPU acceleration P95 < 300ms (small batch)
non-functional requirements - Modularization: Each process (pose/tracking/action/asr/fusion/summary) is a plug-in format that can be replaced with the back end. - Resource management: GPU/CPU/memory allocation control, batching, back pressure support - Fault tolerance: job retries, inflight requeue, persistent store (Redis/SQLite/JSON fallback) - Observability: Metrics (processing delay, throughput, success rate), traces, logs (structured) - Privacy: Facial/Personally Identifiable Information Anonymization Options, Data Retention Policy, Consent Management
Architecture overview
- Input layer: Upload/stream reception (HTTP/GRPC), metadata added upon reception
- Worker layer: job queue (Redis/Celery or VideoAnalysisJobStore) + worker group (synchronous/asynchronous)
- Backend layer: Model runtime plugin (Torch/TensorRT/ONNX Runtime/Python fallback)
- Aggregation layer: EventFusionEngine + SummaryGenerator (LLM wrapper or template)
- API layer: synchronous analysis endpoint (/api/video-analysis/analyze), asynchronous job API (/api/video-analysis/submit/status/result)
- Storage: input artifacts (optional), result cache, job meta (Redis/SQLite)
Main components (functional items) Each section is aimed at "complete implementation." Priority is listed as P0/P1/P2.
1) Data acceptance & preprocessing (P0) - HTTP/GRPC/Multipart upload endpoint - Streaming ingestion (chunked frames/audio) - Input validation (resolution, sample rate, length limits) - Preprocessing: frame normalization, audio normalization, sampling, cropping
2) Keypoint extraction module pose (P0)
- Backend: MoveNet / MediaPipe / ONNX MoveNet / TensorRT export
- API: estimate(frame) -> List[Pose] (timestamp, keypoints(dict{name:[x,y,conf]}))
- Requirements: batch/streaming compatible, optimized with or without GPU, deterministic fallback
- Performance: Throughput that can continuously process 30fps at 1080p (on GPU)
3) Tracking module tracking (P0)
- Algorithm: DeepSORT / IoU based tracker (selectable)
- API: update(detections, timestamp) -> List[Tracked] (track_id, bbox, keypoints)
- Supports re-identification (ReID) plugin
- Robustness: return to cover, short-term ID continuation
4) Action recognition module action_recognition (P0)
- Model: ST-GCN / TCN / Lightweight 3D-CNN supported
- API: classify(tracked_sequence) -> List[ActionEvent] (track_id, start, end, label, score)
- Sliding window inference, online updates
5) ASR module asr (P0)
- Backend: Whisper (small/medium), Kaldi/other, ONNX, etc.
- API: transcribe(audio_chunk) -> Transcript (segments with timestamps, confidence)
- Speaker diarization/overlapping talk countermeasures (plug-in)
6) Event fusion engine fusion (P0)
- Timeline synchronization: integrate video events and ASR segments
- Trigger extraction: Important event extraction (thresholds, rules, ML-based)
- Output format: timeline, summary_candidates, metrics
7) Summary generation summary (P1)
- Lightweight template summarization + LLM base summarization plugin
- API: generate_summary(timeline, transcript) -> str (multilingual support)
8) Persistent Job Store & Workers (P0) - Highly reliable job store: Redis primary, SQLite persistent fallback, JSON fallback - Worker pool: inflight recovery on restart, job priority, retry policy - Monitoring: queue length, worker health, processing_latency
9) Backend Management & Plugin Framework (P0)
- Backend registration API: register_backend(name, capabilities)
- Runtime switching: GPU/CPU/FP16/INT8 selection, model cache
- Health check/automatic failover
10) API layer (P0)
- Synchronization: /api/video-analysis/analyze (short batch)
- Asynchronous: /api/video-analysis/submit, /api/video-analysis/status/{id}, /api/video-analysis/result/{id}
- Authentication: API Key / OAuth2 / RBAC
11) Test & Evaluation (P0) - Unit tests: deterministic edge cases for each module - Integration test: small sample video end-to-end (mAP/IDF1/WER) - Performance tests: throughput, latency, scaling tests - Data quality testing: ASR WER report, IDF1
12) Privacy & Security (P0) - Face/voice anonymization filter (optional) - Encryption at rest (DataAtRestEncryption), TLS in transit - Consent/Delete API, log minimization
13) Deployment & Orchestration (P1) - Docker + Kubernetes manifest, GPU node configuration - Metrics output for Horizontal Pod Autoscaler (custom metrics) - Canary/Blue-Green deployment strategy, model hot swap
14) Visualization/Frontend (P1) - Timeline visualization, track visualization, summary view - Real-time updates via WebSocket or SSE
Acceptance criteria (example) - Unit test coverage: 90% or more per module (important logic) - Integration E2E: pipeline successful with sample video, IDF1 >= 0.60 (simple data), WER <= 0.30 (low noise) - Latency: P95 < 2s with lightweight fallback, P95 < 300ms in GPU environment - Security: Uses AES-256-GCM when storing data, delete API works on request
Phased implementation plan (short-term → medium-term) Phase 1 (2 weeks): Basic API, job queue, pose fallback, job store, unit testing Phase 2 (3 weeks): Tracking, simple action_recognition, ASR fallback, fusion, simple summary Phase 3 (4 weeks): Full-scale model integration (MoveNet/Whisper/ST-GCN), GPU optimization, performance tuning Phase 4 (continued): Privacy enhancement, K8s deployment, monitoring/SLO achieved
Development/CI requirements
- Add optional extras to requirements.txt: video[torch], onnxruntime-gpu, whisper, opencv-python-headless, redis, fastapi[all], etc.
- CI: Label tests that require GPU, always run tests with CPU fallback
document placement
- Specification: Docs/VIDEO_AUDIO_ANALYSIS_SPEC.md (this file)
- Implementation checklist: docs/implementation/video_analysis_checklist.md (links implementation progress and file trail)
Implementation status (updated on 2026-04-24)
List of implemented modules
All of the following modules have been implemented in evospikenet/video_analysis/.
| module | file | phase | state |
|---|---|---|---|
| Pose estimation (fallback) | pose.py |
Phase 1 | ✅ Done |
| Tracking (IoU) | tracking.py |
Phase 1 | ✅ Completed |
| Action recognition (position difference) | action_recognition.py |
Phase 1 | ✅ Completed |
| ASR wrapper | asr.py + asr_policy.py |
Phase 1 | ✅ Completed (fallback) |
| Event fusion | fusion.py |
Phase 1 | ✅ Completed |
| Job Queue | job_queue.py |
Phase 1 | ✅ Completed |
| Backend Management | backends.py + backends_real.py |
Phase 2 | ✅ Completed |
| Celery worker | worker.py |
Phase 2 | ✅ Done |
| Privacy processing | privacy.py |
Phase 2 | ✅ Completed |
| Evaluation Metrics | metrics.py + runtime_metrics.py |
Phase 2 | ✅ Completed |
| Unified Event Schema | event_schema.py |
Phase 3 | ✅ 2026-04-24 New |
| Shot boundary detection | shot_boundary_detector.py |
Phase 3 | ✅ 2026-04-24 New |
| VAD/Speaker separation | vad.py |
Phase 3 | ✅ 2026-04-24 New |
| Monocular depth estimation | depth_estimation.py |
Phase 3 | ✅ 2026-04-24 New |
| Spatial relations extraction | spatial_relations.py |
Phase 3 | ✅ 2026-04-24 New |
| Narrative Generation | narrative_generator.py |
Phase 3 | ✅ 2026-04-24 New |
| Temporal action localization | temporal_action_localizer.py |
Phase 3 | ✅ 2026-04-24 New |
| Pipeline integration | pipeline.py (enable_extended) |
Phase 3 | ✅ Updated 2026-04-24 |
New module details
event_schema.py — Unified event schema
Actor(id, type, attributes, bbox, keypoints, track_id)SpatialFrame(type, coords, depth_m, relation_desc)EventEvidence(detection_ids, transcript_snippet, frame_indices, pose_refs)Event— dict base factory:from_vision_detection(det, timestamp),from_asr_segment(seg),from_scene_change(frame_idx, timestamp)AnalysisTimeline—sorted_events(),filter_by_type(),filter_by_confidence(),to_dict()
shot_boundary_detector.py — Shot boundary detection
- Method: Histogram difference / Pixel difference / Combined (selectable)
ShotBoundaryDetector(threshold, method, min_scene_len).detect(frames, fps) → [(frame_idx, timestamp), ...]- Output with scores is also possible with
detect_scores()
vad.py — VAD/Speaker separation
VoiceActivityDetector(energy_threshold, frame_duration_ms, merge_gap_s, min_speech_s)detect(audio_wave, sample_rate) → [{"start", "end", "energy"}, ...]SpeakerDiarizer(max_speakers)— pyannote.audio replacement hook
depth_estimation.py — Monocular depth estimation
- Fallback: Intensity gradient-based pseudo-depth map
- Production hook:
torch.hub.load("intel-isl/MiDaS", "MiDaS_small") DepthEstimator.estimate(frame) → {"depth_map": ndarray(H,W), "depth_at_center": float, "backend": str}estimate_depth_at_bbox(frame, bbox) → float
spatial_relations.py — Spatial relations extraction
get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str- Return value:
left_of / right_of / above / below / in_front_of / behind / near / touching / unknown extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...]MotionVectorEstimator— Add velocity vector and direction from track history
narrative_generator.py — Narrative generation
NarrativeGenerator(use_lm, max_events, min_confidence).generate(timeline) → str- Template mode (default):
"At {ts}s, {actor} is detected {action} (confidence {conf}%)" - LM mode: Call EvoSpikeNet LM via
NeuralLanguageAdapter, fallback to template on failure generate_shot_summary(shot_start, shot_end, events, relations) → str
temporal_action_localizer.py — Temporal action localizer
TemporalActionLocalizer(window_size, stride, min_confidence, nms_overlap_threshold)localize(frames, fps) → [{"start", "end", "action", "confidence"}, ...]localize_with_poses(frames, pose_results, fps) → [...]— Keypoint change based- Deduplication with Temporal IoU NMS
VideoAnalysisPipeline updates
pipeline = VideoAnalysisPipeline(
pose_backend="fallback",
action_backend="fallback",
asr_backend="fallback",
depth_backend="fallback",
)
result = pipeline.run(frames, audio_wave, fps=10.0, enable_extended=True)
# The following is added to result:
# shot_boundaries, vad_segments, temporal_segments,
# depth_sample, spatial_relations, narrative
Test status
| Test file | Number of tests | Status |
|---|---|---|
tests/unit/test_video_analysis_new_modules.py |
62 | ✅ All passes |
tests/unit/test_video_analysis_components.py |
15 | ✅ All passed |
tests/unit/test_video_analysis_metrics.py |
1 | ✅ All passes |
tests/unit/test_video_analysis_backends.py |
1 | ✅ All passes |
| Total | 82 | ✅ |
Guide to switching to EvoSpikeNet LM backend (2026-04-24)
The pipeline has two independent LM paths.
Route A: lm_summary (summary of the entire pipeline)
lm_summary key returned by VideoAnalysisPipeline.run().
Use EvoLMBackend → AutoModelSelector.get_model("text") → SpikingEvoTextLM.
Control method
| Method | Setting example | Effect |
|---|---|---|
| Constructor arguments | VideoAnalysisPipeline(lm_backend="evospikenet_lm") |
Default. Using SpikingEvoTextLM |
| Disable with environment variable | VIDEO_ANALYSIS_ENABLE_LM=0 |
lm_summary="", lm_backend=None |
| Enable via environment variable | VIDEO_ANALYSIS_ENABLE_LM=1 (default) |
Start LM to generate summaries |
import os
os.environ["VIDEO_ANALYSIS_ENABLE_LM"] = "1" # Or "0" to disable
from evospikenet.video_analysis.pipeline import VideoAnalysisPipeline
pipeline = VideoAnalysisPipeline(lm_backend="evospikenet_lm")
result = pipeline.run(frames, audio_wave, fps=25.0)
print(result["lm_summary"]) # e.g., "A person is walking."
print(result["lm_backend"]) # "evospikenet_lm"
Backend stack (path A)
pipeline.run()
└─ EvoLMBackend.generate(prompt, max_new_tokens=40, temperature=0.6)
└─ AutoModelSelector.get_model("text")
└─ SpikingEvoTextLM.generate(input_ids, max_new_tokens)
Path B: narrative (chronological narrative generation)
narrative key when pipeline.run(enable_extended=True).
Use NarrativeGenerator → _LMBridge → EvoLMBackend → SpikingEvoTextLM.
Operation flow
NarrativeGenerator.generate(timeline)
├─ Assemble template sentences (Precomputed)
├─ If use_lm=True: _LMBridge.generate(prompt, max_new_tokens=200)
│ └─ EvoLMBackend.generate(prompt, max_new_tokens=200)
│ └─ SpikingEvoTextLM.generate(...)
└─ If LM output < 20 chars or None -> Template fallback
Control method
| Method | Setting example | Effect |
|---|---|---|
| Default (recommended) | NarrativeGenerator(use_lm=True) |
LM priority, failure template |
| Disable LM | NarrativeGenerator(use_lm=False) |
Always template |
| Via pipeline | VideoAnalysisPipeline() |
use_lm=True is set by default |
from evospikenet.video_analysis.narrative_generator import NarrativeGenerator
# Generate narrative using LM
gen = NarrativeGenerator(use_lm=True, max_events=15, min_confidence=0.3)
text = gen.generate(timeline)
print(text)
# Shot-based summary
shot_text = gen.generate_shot_summary(
shot_start=0.0, shot_end=5.0,
events=timeline.filter_by_confidence(0.5),
relations=[{"subject_id": 0, "object_id": 1, "relation": "left_of", "confidence": 0.8}],
)
Checking backend availability
from evospikenet.video_analysis.backends import get_backend_status
status = get_backend_status()
print(status["lm"])
# {'evospikenet_lm': {'available': True, 'tier': 'real',
# 'note': 'local SpikingEvo models via AutoModelSelector'}}
Common problems and solutions
| Symptom | Cause | Treatment |
|---|---|---|
lm_summary is an empty string |
VIDEO_ANALYSIS_ENABLE_LM=0 is set |
Change the environment variable to 1 |
narrative becomes a template statement |
_LMBridge loading failure |
torch installation confirmation (pip install torch) |
lm_backend is None |
EvoLMBackend initialization failure (missing dependent library) |
pip install torch transformers |
| LM output is too short/incomprehensible | SpikingEvoTextLM is untrained |
Replace the weights of AutoModelSelector in the trained model |
Weight replacement and model customization
SpikingEvoTextLM is loaded with AutoModelSelector.get_model("text").
If you want to use trained weights, please pre-train with tools/train_spiking_lm.py.
# Training
python tools/train_spiking_lm.py --node-type text --epochs 10
# Inference check
python -c "
from evospikenet.llm_backend import EvoLMBackend
lm = EvoLMBackend(task_type='text')
print(lm.generate('Events: walking:3. Please summarize in one sentence.', max_new_tokens=40))
"
Remaining assignments (Phase 4)
| Function | Priority | Responsible module |
|---|---|---|
| ByteTrack/DeepSORT Production MOT | High | tracking.py / backends.py |
| MediaPipe/OpenPose production pose | High | MoveNetRealPoseBackend |
| TimeSformer/SlowFast real model | High | STGCNRealActionBackend |
| pyannote.audio speaker separation | medium | vad.py (SpeakerDiarizer) |
| SAM/Mask R-CNN Segmentation | Medium | New module planned |
| TensorRT/ONNX GPU optimization | Medium | backends.py |
| WebSocket Real-time API | Low | video_analysis_api.py |
| Annotation UI | Low | Separate tools |
Creation date: 2026-04-19 Update date: 2026-04-24 Author: Engineering Team