Tracking & action recognition model candidates and evaluation dataset

[!NOTE] For the latest implementation status, please refer to Functional Implementation Status (Remaining Functionality).

Purpose: This file summarizes production candidate models, evaluation metrics, recommended datasets, and evaluation procedures for tracking (ID continuity) and action recognition.

1) Detection + Tracking Candidates - Detection: YOLOv5/YOLOv8, YOLOX (high throughput) - Tracker: DeepSORT, StrongSORT, ByteTrack, SORT (lightweight) - Recommended combination: YOLOX + ByteTrack (high speed + ID continuity)

Evaluation index - IDF1, MOTA, MOTP, Precision, Recall - Implementation evaluation: shielding return rate, ID persistence after short-term shielding

2) Action recognition candidates - ST-GCN (skeleton-based) -TCN (temporal conv) - I3D / SlowFast (RGB based) - TSN (lightweight batch evaluation)

Evaluation index - mAP, F1, Accuracy, Segmental-F1

3) Recommended dataset - PoseTrack (pose + tracking) — tracking evaluation - MOTChallenge (MOT17, MOT20) — ID/Tracking Evaluation - AVA (spatio-temporal action detection) - Kinetics, UCF101, HMDB51 — Behavior classification benchmark - JHMDB, THUMOS — Behavior segmentation and detection

4) Evaluation procedure (integration) 1. Detector accuracy/latency evaluation (device: CPU/GPU) 2. Tracker integrated evaluation (calculating IDF1 etc. with fixed detection results) 3. Action recognition is based on segment accuracy after inputting tracking results 4. E2E integration: Measure latency/success in the detection → tracking → action recognition → ASR → fusion pipeline

5) Data preparation & annotation - Prepare a sample set (short clips, 10-30s) and add the following: - per frame bounding box + track_id - Pause keypoints (COCO/17 keypoints format) - Action label (start/end/frame-level) - Audio transcript (with timestamp)

6) Bench automation - Bench script: Create tools/bench_video_pipeline.py (future implementation)

Implemented modules (2026-04-24)

Shot boundary detection — `shot_boundary_detector.py`

Item	Details
Class	`ShotBoundaryDetector(threshold=0.3, method="histogram", min_scene_len=5)`
Method	`detect(frames, fps) → [(frame_idx, timestamp), ...]`
Method	Histogram difference / Pixel difference / Combined (select with `method`)
NMS	Remove duplicate boundaries less than `min_scene_len` frames
Depends	NumPy only (no OpenCV fallback)

Temporal action localization — `temporal_action_localizer.py`

Item	Details
Class	`TemporalActionLocalizer(window_size=16, stride=8, min_confidence=0.3, nms_overlap_threshold=0.5)`
Method	`localize(frames, fps)` / `localize_with_poses(frames, pose_results, fps)`
Output	`[{"start": float, "end": float, "action": str, "confidence": float}, ...]`
Features	Sliding window + Temporal IoU NMS, integration of pause keypoint changes
Production hook	`backend="stgcn_real"` (ST-GCN) / `backend="slowfast_real"`

Spatial relations extraction — `spatial_relations.py`

Item	Details
Function	`get_spatial_relation(obj_a, obj_b, depth_a, depth_b) → str`
Return value	`left_of / right_of / above / below / in_front_of / behind / near / touching / unknown`
Batch	`extract_pairwise_relations(detections, depth_maps) → [{"subject_id", "object_id", "relation", "confidence"}, ...]`
Add	`MotionVectorEstimator` — Add velocity vector and direction from track history

Future model integration priorities

Features	Recommended model	Priority
Production Tracking	ByteTrack + YOLOX	High
Production behavior recognition	ST-GCN (skeleton) or SlowFast (RGB)	High
Shot Boundary (High Accuracy)	PySceneDetect + CNN Classifier	Medium
Motion estimation	Optical flow (RAFT / Farneback)	Medium

Update date: 2026-04-24

Tracking & action recognition model candidates and evaluation dataset

Implemented modules (2026-04-24)

Shot boundary detection — shot_boundary_detector.py

Temporal action localization — temporal_action_localizer.py

Spatial relations extraction — spatial_relations.py

Future model integration priorities

Shot boundary detection — `shot_boundary_detector.py`

Temporal action localization — `temporal_action_localizer.py`

Spatial relations extraction — `spatial_relations.py`