Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin1, Bomin Wei2, Jiaxin Ge1, Konstantinos Kallidromitis3,
Stephanie Fu1, Trevor Darrell1, XuDong Wang1*

UC Berkeley1        UCLA2        Panasonic AI Research3

*Corresponding author

Teaser Image

Abstract

Vision–Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (CoVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens — compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, CoVT distills knowledge from lightweight vision experts capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, a VLM equipped with CoVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual-token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating CoVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16%, demonstrating that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

Method

CoVT generates interleaved discrete (textual) and continuous (visual) thinking tokens first, and then leverages these thinking tokens to condition next-token prediction and reason the final answer.

CoVT Method Overview

The training pipeline of CoVT. CoVT first generates the thinking process, containing visual thinking tokens, and then leverages these visual thoughts to condition next-token prediction and reason the final answer. To endow these tokens with perceptual meaning, we align them with lightweight vision experts (SAM, DepthAnything, PIDINet, DINO) on their respective tasks during training. Specifically: SAM uses 8 visual tokens as mask prompts; DepthAnything uses 4 tokens to reconstruct depth; PIDINet uses 4 tokens to reconstruct edges; and DINO uses 4 tokens to match patch-level features. The VLM is finetuned with LoRA and all projection layers are trainable. Note: During inference, dense predictions are decoded only when interpretability is desired; otherwise, reasoning occurs entirely in the latent visual space.

👀 Explore how each type of CoVT token contributes to reasoning

01 · 2D Perception

Segmentation Tokens · SAM

Segmentation tokens capture instance-level masks, equipping VLMs with the 2D perception capability.

  • 8 segmenation tokens are aligned to the SAM decoder-space at the prompt-level.
  • During training, the decoded masks are supervised by using Hungarian Mathing algorithm, along with the cost function of dice loss and focal loss.

Multiple CoVT Tokens Visualization

Different CoVT tokens contribute complementary cues that enable the model to solve complex perceptual reasoning tasks.

Click different <CoVT Tokens> to view the decoded visualization results.

demo1 visual
Question

Describe the scene in the picture in detail, and find out how many clouds are there in the sky.

Answer

<Think> Because the segmentation of the image is <SEGMENT Token>, the depth map of the image is <DEPTH Token>, and the edge map of the image is <EDGE Token>. </Think>

<Answer> The image depicts a charming, handcrafted scene featuring a small doll-like figure standing on a grassy patch. The figure is dressed in an orange outfit with matching boots, and it has a cheerful expression. The doll is holding a stick with a smiling sun at the top, which appears to be made from felt or similar material.
In the background, there are two green felt trees, each with a simple triangular shape and a brown trunk. Above the trees, there are four fluffy white clouds, each adorned with a brown button-like detail. The sky is painted in a soft blue color, providing a serene backdrop for the scene.
The overall composition is whimsical and playful, likely intended to evoke a sense of joy and innocence. The use of bright colors and simple shapes contributes to the overall aesthetic appeal of the image.
To answer the question about the number of clouds: There are four clouds visible in the sky above the trees. </Answer>

demo1 visual
Question

Two points are circled on the image, labeled by A and B. Which point is closer to the camera?

Answer

<Think> Because the segmentation of the image is <SEGMENT Token>, the depth map of the image is <DEPTH Token>. <SEGMENT Token> </Think>

<Answer> The answer is the point B. </Answer>

demo2 visual
Question

Which object is closer to the camera: the books (red box) or the chair (blue box)?

Answer

<Think> The depth map of the image is <DEPTH Token>, and the edge map of the image is <EDGE Token>. </Think>

<Answer> The books are closer to the camera than the chair. </Answer>

demo4 visual
Question

How many white vertical lines are drawn on the ground?

Answer

<Think> Because the segmentation of the image is <SEGMENT Token>, the edge map of the image is <EDGE Token>. </Think>

<Answer> There are five white vertical lines. </Answer>

Token Playground

Drag the slider to view the decoded visualization results.

Original Original
Edited
Visual Token
Original Original
Edited
Visual Token
Original Original
Edited
Visual Token

Single CoVT Token Visualization

Inspect how a single type of CoVT token guides reasoning.

Single token reference Original
Original Original
Visual Token
Segment Token
Question

Answer

Gallery

Click arrows to navigate through reasoning examples.

Question

Answer

Experiment Results

Major Benchmarks
Comparison of CoVT with the baseline and closed-source models. CoVT delivers consistent improvements across all vision-centric benchmarks and further reveals that each visual token type contributes most effectively to the tasks related to its rich information.
More Experiments 1
Comparison between CoVT and Aurora based on LLaVA-v1.5-13B. † indicates our reproduced results based on the provided checkpoints.
More Experiments 2
Beyond the gains on vision-centric benchmarks, CoVT also achieves slight improvements on non–vision-centric tasks.

BibTeX

Coming soon...