Token Warping Helps MLLMs Look from Nearby Viewpoints

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

Key Insight

Why Tokens, Not Pixels?

Image tokens are inherently robust to positional perturbations, making them ideal for geometric viewpoint transformations.

Tokens Are Robust Perceptual Atoms

Modern ViT-based MLLMs represent images as a sequence of image tokens — localized, semantically meaningful units that function as perceptual atoms. Inspired by theories of mental imagery, we hypothesize that tokens provide the right part-level granularity for viewpoint transformation.

Object-level representations are too coarse, sacrificing important spatial and appearance details. Pixel-level representations are too fine-grained and sensitive to even small depth or geometric noise. Image tokens lie between these extremes, retaining rich visual detail while remaining robust to local perturbations.

The model exhibits only mild degradation in the large-perturbation regime (19–20 pixels) — token representations in MLLMs are highly robust to noise in the image positions from which tokens are fetched.

Fetching Position Noise Sensitivity (Fig. 5)

Token representations in MLLMs are highly robust to positional noise. Accuracy remains stable even with perturbations approaching 19-20 pixels, exhibiting only mild degradation.

Method

Token Warping for Viewpoint Changes

A lightweight, training-free approach that warps image tokens using depth maps to simulate novel viewpoints.

🖼

Source Image

I_S

📏

Depth Map

Monocular depth

🔄
Token Warping
Backward fetching

🧩

Target Tokens

Regular grid

🤖

MLLM

Spatial reasoning

Baseline

Pixel-Wise Warping

Retrieves individual pixels at each target coordinate. Even small depth errors cause severe local distortions and semantic degradation, leading to degraded MLLM understanding.

Ours

Token Warping

Retrieves intact tokens (patches) from the source view, preserving local semantics as coherent units. Robust to positional noise, enabling reliable viewpoint-aware perception.

Figure 4. Pixel-Wise vs. Token Warping. (A) Pixel-wise warping retrieves pixels for each target coordinate, but patchifying the warped image introduces local distortions. (B) Token warping directly retrieves intact tokens from the source view, preserving semantics and improving viewpoint-aware perception.

Limitations of pixel-wise warping: both forward and backward pixel warping produce significant distortions.

Figure 3. Limitations of Pixel-Wise Warping. Pixel-wise warping to a target viewpoint often introduces local distortions and semantic degradation. In both forward (top) and backward (bottom) warping, the book from the source view appears significantly distorted after transformation.

Warping Direction & Fetching Strategy

We explore two axes of design: the warping direction (forward vs. backward) and the token fetching strategy (nearest vs. adaptive).

Forward

Forward Warping

Projects source tokens to target viewpoint positions. Results in irregular, sparse token placement with holes — out-of-distribution for MLLMs trained on dense grids.

Backward (Ours)

Backward Warping

Defines a regular grid at the target view and maps each position back to the source. Produces dense, regularly spaced tokens that MLLMs expect.

Best Performance

Backward › Fetching

Nearest Fetching

Finds the closest existing token in the source grid for each mapped coordinate. Simple and efficient — performs comparably to adaptive fetching.

Backward › Fetching

Adaptive Fetching

Dynamically crops a new patch centered at the mapped coordinate for re-encoding. More flexible but requires additional computation.

Interactive Demo: Forward & Backward Warping. Toggle between warping modes and adjust camera parameters to explore how tokens map between viewpoints. Notice holes and overlaps in forward mode; compare nearest vs. adaptive fetching in backward mode.

Token fetching strategies: nearest fetching selects the closest existing token, adaptive fetching dynamically crops a new patch.

Figure 7. Token Fetching Strategies. (A) Nearest fetching selects the closest existing token from the source image grid. (B) Adaptive fetching dynamically crops a patch centered at the mapped coordinate to derive a token precisely centered at the target location.

Benchmark

ViewBench

A benchmark for evaluating MLLMs on spatial reasoning tasks that require imagining a scene from alternative viewpoints.

ViewBench-Text

Evaluate left-right spatial reasoning between two text-labeled points after a viewpoint change. Tests whether MLLMs can infer how relationships reverse.

Text Labels

ViewBench-Shape

Same spatial reasoning task but using simple geometric shapes (star, triangle) as reference points instead of text labels.

Shape Labels

ViewBench-Object

Assess whether the MLLM can accurately describe objects as they would appear from the target viewpoint, preserving fine-grained visual details.

Description (1-10)

ViewBench examples showing source-target image pairs with spatial reasoning questions.

Figure 6. ViewBench. Example source-target image pairs with corresponding questions and answers. Tasks evaluate MLLM's ability to infer spatial relationships from nearby viewpoints (Text, Shape) and to describe object properties visible in the warped target view (Object).

Results

Quantitative Evaluation

Backward token warping consistently outperforms all baselines across every task and difficulty level.

+14.1%p

vs. Specialist MLLMs

ViewBench-Text (5-15 overlap)

+10.3%p

vs. Pixel Warping

ViewBench-Shape (5-15 overlap)

+0.65

vs. GenWarp (NVS)

ViewBench-Object (5-15 score)

Method	ViewBench-Text (%)			ViewBench-Shape (%)			ViewBench-Object (1-10)
Method	5-15	15-25	25-35	5-15	15-25	25-35	5-15	15-25	25-35
Specialist MLLMs
SpatialReasoner	46.73	53.30	53.71	33.72	38.27	48.15	—	—	—
VLM-3R	63.82	70.56	60.57	49.22	49.79	50.21	—	—	—
ViLaSR	44.22	52.28	48.00	22.87	23.05	34.57	—	—	—
Qwen2.5-VL	46.23	59.39	52.00	24.42	25.10	37.86	—	—	—
Novel View Synthesis
GenWarp	69.35	71.07	66.29	53.10	47.33	55.14	4.32	4.81	4.34
Pixel-Wise Warping
Forward	70.85	73.60	62.86	56.20	56.79	60.49	3.22	4.04	4.78
Backward	71.86	75.63	68.57	62.40	58.02	66.67	4.53	5.52	5.94
Token Warping
Forward	60.30	64.47	54.86	55.04	55.14	53.09	4.09	4.27	4.07
Backward-Nearest Ours	74.87	80.71	74.86	67.44	62.96	73.25	4.80	5.39	6.19
Backward-Adaptive Ours	77.89	79.70	78.86	67.44	66.26	75.72	4.97	5.76	6.11

Bold = best, underline = second best. View overlap ranges indicate difficulty (lower = harder). All results use GT depth. Evaluated with Qwen2.5-VL 14B.

Qualitative Comparisons

Visual comparisons of warped results across methods. Token warping (our backward variants) preserves scene structure and enables correct spatial reasoning, while pixel-wise warping introduces severe artifacts and generative approaches may hallucinate content. The pixelated images in the token warping columns are displayed solely for visualization; the framework operates entirely on token embeddings. Below each image we show the response from Qwen2.5-VL when given the corresponding warped result.

ViewBench-Text Q: "Is the A point on the right or left of the B point?" Answer: "left"

Our method Ground truth green = correct red = wrong