Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Abstract
Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.
Method
CoVT generates interleaved discrete (textual) and continuous (visual) thinking tokens first, and then leverages these thinking tokens to condition next-token prediction and reason the final answer.
The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.
01 · 2D Perception
Segmentation Tokens · SAM
Segmentation tokens capture instance-level masks, equipping VLMs with the 2D perception capability.
- 8 segmenation tokens are aligned to the SAM decoder-space at the prompt-level.
- During training, the decoded masks are supervised by using Hungarian Mathing algorithm, along with the cost function of dice loss and focal loss.
Multiple CoVT Tokens Visualization
Different CoVT tokens contribute complementary cues that enable the model to solve complex perceptual reasoning tasks.
Click different <CoVT Tokens> to view the decoded visualization results.
Describe the scene in the picture in detail, and find out how many clouds are there in the sky.
<Think>
Because the segmentation of the image is <SEGMENT Token>, the depth map of the image is <DEPTH Token>, and the edge map of the image is <EDGE Token>.
</Think>
<Answer>
The image depicts a charming, handcrafted scene featuring a small doll-like figure standing on a grassy patch. The figure is dressed in an orange outfit with matching boots, and it has a cheerful expression. The doll is holding a stick with a smiling sun at the top, which appears to be made from felt or similar material.
In the background, there are two green felt trees, each with a simple triangular shape and a brown trunk. Above the trees, there are four fluffy white clouds, each adorned with a brown button-like detail. The sky is painted in a soft blue color, providing a serene backdrop for the scene.
The overall composition is whimsical and playful, likely intended to evoke a sense of joy and innocence. The use of bright colors and simple shapes contributes to the overall aesthetic appeal of the image.
To answer the question about the number of clouds: There are four clouds visible in the sky above the trees.
</Answer>
Two points are circled on the image, labeled by A and B. Which point is closer to the camera?
<Think>
Because the segmentation of the image is <SEGMENT Token>, the depth map of the image is <DEPTH Token>.
<SEGMENT Token>
</Think>
<Answer>
The answer is the point B.
</Answer>
Which object is closer to the camera: the books (red box) or the chair (blue box)?
<Think>
The depth map of the image is <DEPTH Token>, and the edge map of the image is <EDGE Token>.
</Think>
<Answer>
The books are closer to the camera than the chair.
</Answer>
How many white vertical lines are drawn on the ground?
<Think>
Because the segmentation of the image is <SEGMENT Token>, the edge map of the image is <EDGE Token>.
</Think>
<Answer>
There are five white vertical lines.
</Answer>
Token Playground
Drag the slider to view the decoded visualization results.
Single CoVT Token Visualization
Inspect how a single type of CoVT token guides reasoning.
Original
—
—
Gallery
Click arrows to navigate through reasoning examples.
—
Experiment Results
BibTeX
Coming soon...