Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin¹, Bomin Wei², Jiaxin Ge¹, Konstantinos Kallidromitis³,
Stephanie Fu¹, Trevor Darrell¹, XuDong Wang^1*

UC Berkeley¹ UCLA² Panasonic AI Research³

*Corresponding author

Paper arXiv Code HF Model HF Data

Abstract

Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

Method

CoVT generates interleaved discrete (textual) and continuous (visual) thinking tokens first, and then leverages these thinking tokens to condition next-token prediction and reason the final answer.

The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.

👀 Explore how each type of CoVT token contributes to reasoning

01 · 2D Perception

Segmentation Tokens · SAM

Segmentation tokens capture instance-level masks, equipping VLMs with the 2D perception capability.

8 segmenation tokens are aligned to the SAM decoder-space at the prompt-level.
During training, the decoded masks are supervised by using Hungarian Mathing algorithm, along with the cost function of dice loss and focal loss.

Multiple CoVT Tokens Visualization

Different CoVT tokens contribute complementary cues that enable the model to solve complex perceptual reasoning tasks.

Click different <CoVT Tokens> to view the decoded visualization results.

Question

Describe the scene in the picture in detail, and find out how many clouds are there in the sky.

Answer

<Think> Because the segmentation of the image is <SEGMENT Token>, the depth map of the image is <DEPTH Token>, and the edge map of the image is <EDGE Token>. </Think>

<Answer> The image depicts a charming, handcrafted scene featuring a small doll-like figure standing on a grassy patch. The figure is dressed in an orange outfit with matching boots, and it has a cheerful expression. The doll is holding a stick with a smiling sun at the top, which appears to be made from felt or similar material.
In the background, there are two green felt trees, each with a simple triangular shape and a brown trunk. Above the trees, there are four fluffy white clouds, each adorned with a brown button-like detail. The sky is painted in a soft blue color, providing a serene backdrop for the scene.
The overall composition is whimsical and playful, likely intended to evoke a sense of joy and innocence. The use of bright colors and simple shapes contributes to the overall aesthetic appeal of the image.
To answer the question about the number of clouds: There are four clouds visible in the sky above the trees. </Answer>