MotionAtlas

Consistent Gains Across Benchmarks

2,073

Fine-Grained MCQs

MotionAtlas-Bench uses dense checklist-style questions to judge detailed motion captions over referred objects.

159K

Training Samples

MotionAtlas-Data provides scalable region-level motion captions refined to suppress fine-grained hallucinations.

Verbs / Sample

MotionAtlas-Data emphasizes dense action verbs and detailed temporal motion descriptions.

MotionAtlas-Bench

Each video is decomposed into events, and each event is checked through multiple-choice questions over temporal cues, kinematics, references, and local regions.

MotionAtlas teaser showing region-aware motion captioning and checklist-style evaluation — Illustration of our MotionAtlas-Bench. Each video is first decomposed into events; for each event, the judge model answers checklist MCQs from candidate captions, enabling reliable diagnostic evaluation.

Results

Training on MotionAtlas-Data improves both region-level motion captioning and broader motion-related video understanding.

Table 3

Main Results on MotionAtlas-Bench

Model	SF Overall	SF Parts	SF Kin.	FS Overall	FS Parts	FS Kin.
Gemini 3 Pro	36.4	34.7	32.0	36.5	33.5	38.1
GPT-5.2	36.9	34.0	34.2	37.6	38.8	36.6
Qwen3-VL-235B	30.5	27.8	28.9	33.7	33.2	31.1
Qwen3-VL-4B	19.3	20.0	14.1	21.7	22.4	16.5
+ MotionAtlas-Data	27.7 ↑ 8.4	27.9	26.9	30.1 ↑ 8.4	30.3	29.3
Qwen3-VL-8B	24.3	23.9	20.3	26.7	24.6	26.7
+ MotionAtlas-Data	31.6 ↑ 7.3	31.2	30.6	34.1 ↑ 7.4	33.6	33.0

SF = Single-Frame Grounding, FS = Full-Sequence Grounding. Values are accuracy.

Table 4

Motion-Related Video Understanding

Model	MotionBench	DREAM-1K	TOMATO	NExT-QA	TempCompass	FAVOR	TVBench
GPT-5.2	65.4	42.2	53.0	79.9	73.0	56.8	53.8
Gemini 2.5 Pro	62.0	42.7	48.6	79.8	73.7	58.8	59.9
Qwen3-VL-4B	55.9	35.6	27.4	71.6	69.6	47.0	47.2
+ MotionAtlas-Data	61.9 ↑ 6.0	38.9 ↑ 3.3	35.2 ↑ 7.8	74.0 ↑ 2.4	74.2 ↑ 4.6	55.0 ↑ 8.1	51.2 ↑ 4.0
Qwen3-VL-8B	59.0	38.7	34.0	76.9	71.8	54.1	51.4
+ MotionAtlas-Data	62.6 ↑ 3.6	39.6 ↑ 0.9	36.5 ↑ 2.5	77.2 ↑ 0.2	75.1 ↑ 3.3	57.7 ↑ 3.6	52.9 ↑ 1.5

Table 5

Training Data Scale

Method	MotionAtlas	MotionBench	DREAM-1K	TOMATO	NExT-QA	TempCompass	FAVOR	TVBench
Qwen3-VL-4B	19.2	55.9	35.9	27.4	71.6	69.6	47.0	47.2
w/ MotionAtlas-Data
20% (32K)	22.9	58.9	36.9	28.4	72.2	71.2	48.1	47.0
60% (95K)	24.6	59.5	37.0	30.1	73.0	72.3	50.9	49.0
100% (159K)	28.3	61.9	38.9	35.2	74.0	74.2	55.0	51.2
w/o MotionAtlas-Data
20% (32K)	12.9	57.4	37.3	29.0	70.9	70.8	47.4	46.7
60% (95K)	12.9	58.8	36.9	30.7	71.3	72.3	50.6	47.4
100% (159K)	12.2	60.5	38.3	28.4	71.9	73.3	52.2	48.5

Fig. 3

Data Scaling Curve

Adding MotionAtlas-Data brings more significant improvements as training data scales.

Table 6

Pipeline Ablation

Method	Acc	Recall	Precision
MA Pipeline (full)	39.9	68.2	58.5
w/o Self-Bootstrap	36.4	64.1	56.8
w/o Full-Video Caption	33.2	58.9	56.4
w/o Spatial Crop	32.7	60.9	53.6

Each component contributes to more accurate and recall-rich motion captions.

Data Release

Benchmark MotionAtlas-Bench Human-annotated evaluation samples and checklist-style MCQs. Training Data MotionAtlas-Data Region-level motion caption data for training Video-MLLMs.

Qualitative examples from MotionAtlas-Data — MotionAtlas-Data captions emphasize temporal sequence, posture changes, spatial transition, and region-specific motion details.

Demos

Each example will pair an input video or frame strip with the referred region and its detailed motion caption.

Input video

Target Region Caption

Placeholder for a detailed region motion caption. Drop an MP4/WebM into assets/videos/ and replace this panel with the final case text.

Region visualization

Temporal Motion Details

Placeholder for a caption that describes body-part motion, speed, direction, and temporal order within the referred object.

Data Pipeline

MotionAtlas uses event segmentation, localized captioning, self-bootstrap verification, and multi-source narrative synthesis to build high-quality motion captions.

Event segmentation and description guidelines used in MotionAtlas

BibTeX

@article{liu2026motionatlas,
  title={MotionAtlas: Detailed Region Captioning for Motion-Centric Videos},
  author={Liu, Weisong and Wang, Haochen and Gao, Kuan and Wang, Yuhao and Zhou, YiKang and Ren, Zhongwei and Mai, Jacky and Wang, Anna and Li, Yanwei and Li, Jason and Zhang, Zhaoxiang},
  journal={arXiv preprint},
  year={2026}
}