Tracking & action recognition model candidates and evaluation dataset
[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).
Purpose: This file summarizes production candidate models, evaluation metrics, recommended datasets, and evaluation procedures for tracking (ID continuity) and action recognition.
1) Detection + Tracking Candidates - Detection: YOLOv5/YOLOv8, YOLOX (high throughput) - Tracker: DeepSORT, StrongSORT, ByteTrack, SORT (lightweight) - Recommended combination: YOLOX + ByteTrack (high speed + ID continuity)
Evaluation index - IDF1, MOTA, MOTP, Precision, Recall - Implementation evaluation: shielding return rate, ID persistence after short-term shielding
2) Action recognition candidates - ST-GCN (skeleton-based) -TCN (temporal conv) - I3D / SlowFast (RGB based) - TSN (lightweight batch evaluation)
Evaluation index - mAP, F1, Accuracy, Segmental-F1
3) Recommended dataset - PoseTrack (pose + tracking) — tracking evaluation - MOTChallenge (MOT17, MOT20) — ID/Tracking Evaluation - AVA (spatio-temporal action detection) - Kinetics, UCF101, HMDB51 — Behavior classification benchmark - JHMDB, THUMOS — Behavior segmentation and detection
4) Evaluation procedure (integration) 1. Detector accuracy/latency evaluation (device: CPU/GPU) 2. Tracker integrated evaluation (calculating IDF1 etc. with fixed detection results) 3. Action recognition is based on segment accuracy after inputting tracking results 4. E2E integration: Measure latency/success in the detection → tracking → action recognition → ASR → fusion pipeline
5) Data preparation & annotation - Prepare a sample set (short clips, 10-30s) and add the following: - per frame bounding box + track_id - Pause keypoints (COCO/17 keypoints format) - Action label (start/end/frame-level) - Audio transcript (with timestamp)
6) Bench automation
- Bench script: Create tools/bench_video_pipeline.py (future implementation)
Implemented modules (2026-04-24)
Shot boundary detection — shot_boundary_detector.py
| Item | Details |
|---|---|
| Class | ShotBoundaryDetector(threshold=0.3, method="histogram", min_scene_len=5) |
| Method | detect(frames, fps) → [(frame_idx, timestamp), ...] |
| Method | Histogram difference / Pixel difference / Combined (select with method) |
| NMS | Remove duplicate boundaries less than min_scene_len frames |
| Depends | NumPy only (no OpenCV fallback) |
Temporal action localization — temporal_action_localizer.py
| Item | Details |
|---|---|
| Class | TemporalActionLocalizer(window_size=16, stride=8, min_confidence=0.3, nms_overlap_threshold=0.5) |
| Method | localize(frames, fps) / localize_with_poses(frames, pose_results, fps) |
| Output | [{"start": float, "end": float, "action": str, "confidence": float}, ...] |
| Features | Sliding window + Temporal IoU NMS, integration of pause keypoint changes |
| Production hook | backend="stgcn_real" (ST-GCN) / backend="slowfast_real" |
Spatial relations extraction — spatial_relations.py
| Item | Details |
|---|---|
| Function | get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str |
| Return value | left_of / right_of / above / below / in_front_of / behind / near / touching / unknown |
| Batch | extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...] |
| Add | MotionVectorEstimator — Add velocity vector and direction from track history |
Future model integration priorities
| Features | Recommended model | Priority |
|---|---|---|
| Production Tracking | ByteTrack + YOLOX | High |
| Production behavior recognition | ST-GCN (skeleton) or SlowFast (RGB) | High |
| Shot Boundary (High Accuracy) | PySceneDetect + CNN Classifier | Medium |
| Motion estimation | Optical flow (RAFT / Farneback) | Medium |
Update date: 2026-04-24