CVPR 2026

Token Warping Helps MLLMs Look from Nearby Viewpoints

KAIST
Token Warping Teaser: Given an input image (View A), the model is asked to imagine how the scene looks from a rotated viewpoint (View B). Token warping rearranges image tokens to simulate the viewpoint change, enabling the MLLM to correctly answer spatial reasoning questions.
Figure 1. Viewpoint Change via Token Warping. Given an input image (View A), a question is posed from a different viewpoint (View B). Token warping rearranges the image tokens to simulate the view change, enabling the MLLM to correctly answer: "the cup appears on the right side of the book."
TL;DR
By warping image tokens instead of pixels, we enable MLLMs to reason about scenes from nearby viewpoints — no training, no fine-tuning, no pixel synthesis — outperforming specialist models and generative approaches with minimal inference cost.
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Why Tokens, Not Pixels?

Image tokens are inherently robust to positional perturbations, making them ideal for geometric viewpoint transformations.

Tokens Are Robust Perceptual Atoms

Modern ViT-based MLLMs represent images as a sequence of image tokens — localized, semantically meaningful units that function as perceptual atoms. Inspired by theories of mental imagery, we hypothesize that tokens provide the right part-level granularity for viewpoint transformation.

Object-level representations are too coarse, sacrificing important spatial and appearance details. Pixel-level representations are too fine-grained and sensitive to even small depth or geometric noise. Image tokens lie between these extremes, retaining rich visual detail while remaining robust to local perturbations.

The model exhibits only mild degradation in the large-perturbation regime (19–20 pixels) — token representations in MLLMs are highly robust to noise in the image positions from which tokens are fetched.

Fetching Position Noise Sensitivity (Fig. 5)
Token representations in MLLMs are highly robust to positional noise. Accuracy remains stable even with perturbations approaching 19-20 pixels, exhibiting only mild degradation.

Token Warping for Viewpoint Changes

A lightweight, training-free approach that warps image tokens using depth maps to simulate novel viewpoints.

🖼
Source Image
IS
📏
Depth Map
Monocular depth
🔄
Token Warping
Backward fetching
🧩
Target Tokens
Regular grid
🤖
MLLM
Spatial reasoning

Baseline

Pixel-Wise Warping

Retrieves individual pixels at each target coordinate. Even small depth errors cause severe local distortions and semantic degradation, leading to degraded MLLM understanding.

Ours

Token Warping

Retrieves intact tokens (patches) from the source view, preserving local semantics as coherent units. Robust to positional noise, enabling reliable viewpoint-aware perception.

Pixel-wise vs Token Warping comparison showing how pixel warping introduces distortions while token warping preserves semantic content.
Figure 4. Pixel-Wise vs. Token Warping. (A) Pixel-wise warping retrieves pixels for each target coordinate, but patchifying the warped image introduces local distortions. (B) Token warping directly retrieves intact tokens from the source view, preserving semantics and improving viewpoint-aware perception.
Limitations of pixel-wise warping: both forward and backward pixel warping produce significant distortions.
Figure 3. Limitations of Pixel-Wise Warping. Pixel-wise warping to a target viewpoint often introduces local distortions and semantic degradation. In both forward (top) and backward (bottom) warping, the book from the source view appears significantly distorted after transformation.

Warping Direction & Fetching Strategy

We explore two axes of design: the warping direction (forward vs. backward) and the token fetching strategy (nearest vs. adaptive).

Forward

Forward Warping

Projects source tokens to target viewpoint positions. Results in irregular, sparse token placement with holes — out-of-distribution for MLLMs trained on dense grids.

Backward › Fetching
Nearest Fetching

Finds the closest existing token in the source grid for each mapped coordinate. Simple and efficient — performs comparably to adaptive fetching.

Backward › Fetching
Adaptive Fetching

Dynamically crops a new patch centered at the mapped coordinate for re-encoding. More flexible but requires additional computation.

Interactive Demo: Forward & Backward Warping. Toggle between warping modes and adjust camera parameters to explore how tokens map between viewpoints. Notice holes and overlaps in forward mode; compare nearest vs. adaptive fetching in backward mode.
Token fetching strategies: nearest fetching selects the closest existing token, adaptive fetching dynamically crops a new patch.
Figure 7. Token Fetching Strategies. (A) Nearest fetching selects the closest existing token from the source image grid. (B) Adaptive fetching dynamically crops a patch centered at the mapped coordinate to derive a token precisely centered at the target location.

ViewBench

A benchmark for evaluating MLLMs on spatial reasoning tasks that require imagining a scene from alternative viewpoints.

ViewBench-Text

Evaluate left-right spatial reasoning between two text-labeled points after a viewpoint change. Tests whether MLLMs can infer how relationships reverse.

Text Labels

ViewBench-Shape

Same spatial reasoning task but using simple geometric shapes (star, triangle) as reference points instead of text labels.

Shape Labels

ViewBench-Object

Assess whether the MLLM can accurately describe objects as they would appear from the target viewpoint, preserving fine-grained visual details.

Description (1-10)

ViewBench examples showing source-target image pairs with spatial reasoning questions.
Figure 6. ViewBench. Example source-target image pairs with corresponding questions and answers. Tasks evaluate MLLM's ability to infer spatial relationships from nearby viewpoints (Text, Shape) and to describe object properties visible in the warped target view (Object).

Quantitative Evaluation

Backward token warping consistently outperforms all baselines across every task and difficulty level.

+14.1%p
vs. Specialist MLLMs
ViewBench-Text (5-15 overlap)
+10.3%p
vs. Pixel Warping
ViewBench-Shape (5-15 overlap)
+0.65
vs. GenWarp (NVS)
ViewBench-Object (5-15 score)
Method ViewBench-Text (%) ViewBench-Shape (%) ViewBench-Object (1-10)
5-1515-2525-35 5-1515-2525-35 5-1515-2525-35
Specialist MLLMs
SpatialReasoner 46.7353.3053.71 33.7238.2748.15
VLM-3R 63.8270.5660.57 49.2249.7950.21
ViLaSR 44.2252.2848.00 22.8723.0534.57
Qwen2.5-VL 46.2359.3952.00 24.4225.1037.86
Novel View Synthesis
GenWarp 69.3571.0766.29 53.1047.3355.14 4.324.814.34
Pixel-Wise Warping
Forward 70.8573.6062.86 56.2056.7960.49 3.224.044.78
Backward 71.8675.6368.57 62.4058.0266.67 4.535.525.94
Token Warping
Forward 60.3064.4754.86 55.0455.1453.09 4.094.274.07
Backward-Nearest Ours 74.8780.7174.86 67.4462.9673.25 4.805.396.19
Backward-Adaptive Ours 77.8979.7078.86 67.4466.2675.72 4.975.766.11

Bold = best, underline = second best. View overlap ranges indicate difficulty (lower = harder). All results use GT depth. Evaluated with Qwen2.5-VL 14B.

Qualitative Comparisons

Visual comparisons of warped results across methods. Token warping (our backward variants) preserves scene structure and enables correct spatial reasoning, while pixel-wise warping introduces severe artifacts and generative approaches may hallucinate content. The pixelated images in the token warping columns are displayed solely for visualization; the framework operates entirely on token embeddings. Below each image we show the response from Qwen2.5-VL when given the corresponding warped result.

Our method   Ground truth   green = correct   red = wrong

We thank Daehyeon Choi and Sangwoo Youn for their valuable discussions. This work was supported by the National Research Foundation of Korea (NRF), the Institute of Information & Communications Technology Planning & Evaluation (IITP), the Industrial Technology Innovation Program, and the National Supercomputing Center, funded by the Korean government (MSIT/MOTIE).

BibTeX

@inproceedings{lee2026tokenwarping,
  title={Token Warping Helps MLLMs Look from Nearby Viewpoints},
  author={Lee, Phillip Y. and Park, Chanho and Park, Mingue and Yoo, Seungwoo and Koo, Juil and Sung, Minhyuk},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}