Project Page

MotionAtlas: Detailed Region Captioning for Motion-Centric Videos

Weisong Liu1*, Haochen Wang1*, Kuan Gao3, Yuhao Wang4, YiKang Zhou5, Zhongwei Ren6, Jacky Mai2, Anna Wang2, Yanwei Li3, Jason Li2‡, Zhaoxiang Zhang1†

1CASIA    2SJTU    3NTU    4PKU    5WHU    6BJTU

* Equal contribution. Project lead. Corresponding author.

TL;DR: MotionAtlas shifts motion captioning from global video descriptions to region-aware motion captions, enabling precise evaluation through MotionAtlas-Bench and scalable training through MotionAtlas-Data.

Consistent Gains Across Benchmarks

Quantitative comparison between MotionAtlas and Qwen3-VL baselines across eight benchmarks
Quantitative comparison between our MotionAtlas and baselines (Qwen3-VL). Our MotionAtlas brings significant improvements over baselines consistently.
2,073

Fine-Grained MCQs

MotionAtlas-Bench uses dense checklist-style questions to judge detailed motion captions over referred objects.

159K

Training Samples

MotionAtlas-Data provides scalable region-level motion captions refined to suppress fine-grained hallucinations.

23

Verbs / Sample

MotionAtlas-Data emphasizes dense action verbs and detailed temporal motion descriptions.

MotionAtlas-Bench

Each video is decomposed into events, and each event is checked through multiple-choice questions over temporal cues, kinematics, references, and local regions.

MotionAtlas teaser showing region-aware motion captioning and checklist-style evaluation
Illustration of our MotionAtlas-Bench. Each video is first decomposed into events; for each event, the judge model answers checklist MCQs from candidate captions, enabling reliable diagnostic evaluation.

Results

Training on MotionAtlas-Data improves both region-level motion captioning and broader motion-related video understanding.

Table 3

Main Results on MotionAtlas-Bench

Model SF Overall SF Parts SF Kin. FS Overall FS Parts FS Kin.
Gemini 3 Pro36.434.732.036.533.538.1
GPT-5.236.934.034.237.638.836.6
Qwen3-VL-235B30.527.828.933.733.231.1
Qwen3-VL-4B19.320.014.121.722.416.5
+ MotionAtlas-Data27.7 ↑ 8.427.926.930.1 ↑ 8.430.329.3
Qwen3-VL-8B24.323.920.326.724.626.7
+ MotionAtlas-Data31.6 ↑ 7.331.230.634.1 ↑ 7.433.633.0

SF = Single-Frame Grounding, FS = Full-Sequence Grounding. Values are accuracy.

Table 4

Motion-Related Video Understanding

Model MotionBench DREAM-1K TOMATO NExT-QA TempCompass FAVOR TVBench
GPT-5.265.442.253.079.973.056.853.8
Gemini 2.5 Pro62.042.748.679.873.758.859.9
Qwen3-VL-4B55.935.627.471.669.647.047.2
+ MotionAtlas-Data61.9 ↑ 6.038.9 ↑ 3.335.2 ↑ 7.874.0 ↑ 2.474.2 ↑ 4.655.0 ↑ 8.151.2 ↑ 4.0
Qwen3-VL-8B59.038.734.076.971.854.151.4
+ MotionAtlas-Data62.6 ↑ 3.639.6 ↑ 0.936.5 ↑ 2.577.2 ↑ 0.275.1 ↑ 3.357.7 ↑ 3.652.9 ↑ 1.5
Table 5

Training Data Scale

Method MotionAtlas MotionBench DREAM-1K TOMATO NExT-QA TempCompass FAVOR TVBench
Qwen3-VL-4B19.255.935.927.471.669.647.047.2
w/ MotionAtlas-Data
20% (32K)22.958.936.928.472.271.248.147.0
60% (95K)24.659.537.030.173.072.350.949.0
100% (159K)28.361.938.935.274.074.255.051.2
w/o MotionAtlas-Data
20% (32K)12.957.437.329.070.970.847.446.7
60% (95K)12.958.836.930.771.372.350.647.4
100% (159K)12.260.538.328.471.973.352.248.5
Fig. 3

Data Scaling Curve

Data scaling curve Average benchmark score rises more sharply when training includes MotionAtlas-Data. Baseline 50.7 0% 20% 60% 100% 55.8 53.3 w/ MotionAtlas-Data w/o MotionAtlas-Data

Adding MotionAtlas-Data brings more significant improvements as training data scales.

Table 6

Pipeline Ablation

MethodAccRecallPrecision
MA Pipeline (full)39.968.258.5
w/o Self-Bootstrap36.464.156.8
w/o Full-Video Caption33.258.956.4
w/o Spatial Crop32.760.953.6

Each component contributes to more accurate and recall-rich motion captions.

Data Release

Qualitative examples from MotionAtlas-Data
MotionAtlas-Data captions emphasize temporal sequence, posture changes, spatial transition, and region-specific motion details.

Demos

Each example will pair an input video or frame strip with the referred region and its detailed motion caption.

Input video

Target Region Caption

Placeholder for a detailed region motion caption. Drop an MP4/WebM into assets/videos/ and replace this panel with the final case text.

Region visualization

Temporal Motion Details

Placeholder for a caption that describes body-part motion, speed, direction, and temporal order within the referred object.

Data Pipeline

MotionAtlas uses event segmentation, localized captioning, self-bootstrap verification, and multi-source narrative synthesis to build high-quality motion captions.

Event segmentation and description guidelines used in MotionAtlas

BibTeX

@article{liu2026motionatlas,
  title={MotionAtlas: Detailed Region Captioning for Motion-Centric Videos},
  author={Liu, Weisong and Wang, Haochen and Gao, Kuan and Wang, Yuhao and Zhou, YiKang and Ren, Zhongwei and Mai, Jacky and Wang, Anna and Li, Yanwei and Li, Jason and Zhang, Zhaoxiang},
  journal={arXiv preprint},
  year={2026}
}