Skip to content

Tracking & action recognition model candidates and evaluation dataset

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: This file summarizes production candidate models, evaluation metrics, recommended datasets, and evaluation procedures for tracking (ID continuity) and action recognition.

1) Detection + Tracking Candidates - Detection: YOLOv5/YOLOv8, YOLOX (high throughput) - Tracker: DeepSORT, StrongSORT, ByteTrack, SORT (lightweight) - Recommended combination: YOLOX + ByteTrack (high speed + ID continuity)

Evaluation index - IDF1, MOTA, MOTP, Precision, Recall - Implementation evaluation: shielding return rate, ID persistence after short-term shielding

2) Action recognition candidates - ST-GCN (skeleton-based) -TCN (temporal conv) - I3D / SlowFast (RGB based) - TSN (lightweight batch evaluation)

Evaluation index - mAP, F1, Accuracy, Segmental-F1

3) Recommended dataset - PoseTrack (pose + tracking) — tracking evaluation - MOTChallenge (MOT17, MOT20) — ID/Tracking Evaluation - AVA (spatio-temporal action detection) - Kinetics, UCF101, HMDB51 — Behavior classification benchmark - JHMDB, THUMOS — Behavior segmentation and detection

4) Evaluation procedure (integration) 1. Detector accuracy/latency evaluation (device: CPU/GPU) 2. Tracker integrated evaluation (calculating IDF1 etc. with fixed detection results) 3. Action recognition is based on segment accuracy after inputting tracking results 4. E2E integration: Measure latency/success in the detection → tracking → action recognition → ASR → fusion pipeline

5) Data preparation & annotation - Prepare a sample set (short clips, 10-30s) and add the following: - per frame bounding box + track_id - Pause keypoints (COCO/17 keypoints format) - Action label (start/end/frame-level) - Audio transcript (with timestamp)

6) Bench automation - Bench script: Create tools/bench_video_pipeline.py (future implementation)


Implemented modules (2026-04-24)

Shot boundary detection — shot_boundary_detector.py

Item Details
Class ShotBoundaryDetector(threshold=0.3, method="histogram", min_scene_len=5)
Method detect(frames, fps) → [(frame_idx, timestamp), ...]
Method Histogram difference / Pixel difference / Combined (select with method)
NMS Remove duplicate boundaries less than min_scene_len frames
Depends NumPy only (no OpenCV fallback)

Temporal action localization — temporal_action_localizer.py

Item Details
Class TemporalActionLocalizer(window_size=16, stride=8, min_confidence=0.3, nms_overlap_threshold=0.5)
Method localize(frames, fps) / localize_with_poses(frames, pose_results, fps)
Output [{"start": float, "end": float, "action": str, "confidence": float}, ...]
Features Sliding window + Temporal IoU NMS, integration of pause keypoint changes
Production hook backend="stgcn_real" (ST-GCN) / backend="slowfast_real"

Spatial relations extraction — spatial_relations.py

Item Details
Function get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str
Return value left_of / right_of / above / below / in_front_of / behind / near / touching / unknown
Batch extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...]
Add MotionVectorEstimator — Add velocity vector and direction from track history

Future model integration priorities

Features Recommended model Priority
Production Tracking ByteTrack + YOLOX High
Production behavior recognition ST-GCN (skeleton) or SlowFast (RGB) High
Shot Boundary (High Accuracy) PySceneDetect + CNN Classifier Medium
Motion estimation Optical flow (RAFT / Farneback) Medium

Update date: 2026-04-24