Major Benchmarks
Comparison of CoVT with the baseline and closed-source models across CV-Bench and other vision-centric benchmarks.
| Setting | Visual tokens | CV-Bench | Other vision-centric benchmarks | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Seg | Depth | DINO | Edge | CVBench | Count | Depth | Dist. | BLINK | RW-QA | MMT | MMStar-P | MMVP | MME-RW | V* | HR4K | HR8K | |
| Closed-source Models | |||||||||||||||||
| Claude-4-Sonnet | - | - | - | - | 76.3 | 62.2 | 77.7 | 80.5 | 39.6 | 63.7 | - | 58.8 | 48.7 | - | 15.2 | 32.3 | 22.7 |
| GPT-4o | - | - | - | - | 79.2 | 65.6 | 86.7 | 81.0 | 63.0 | 69.7 | - | 65.2 | 72.0 | - | 42.9 | 50.6 | 46.7 |
| Qwen2.5-VL-7B | - | - | - | - | 74.5 | 65.0 | 72.8 | 75.5 | 55.7 | 68.6 | 61.7 | 67.1 | 56.0 | 60.0 | 76.4 | 68.6 | 64.9 |
| CoVT (1 Visual Token) | |||||||||||||||||
| Seg only | ✓ | - | - | - | 77.9 | 66.0 | 80.8 | 80.5 | 57.4 | 71.1 | 62.1 | 68.5 | 58.7 | 62.1 | 79.1 | 71.9 | 69.0 |
| Depth only | - | ✓ | - | - | 78.7 | 65.4 | 83.2 | 78.2 | 56.4 | 71.5 | 62.7 | 69.9 | 58.7 | 62.0 | 79.1 | 71.9 | 69.4 |
| DINO only | - | - | ✓ | - | 71.3 | 64.7 | 72.3 | 66.7 | 55.8 | 71.5 | 62.5 | 67.9 | 57.3 | 61.1 | 77.5 | 71.0 | 68.6 |
| CoVT (3 Visual Tokens) | |||||||||||||||||
| Seg + Depth + DINO | ✓ | ✓ | ✓ | - | 80.0 | 66.2 | 86.8 | 82.5 | 56.0 | 71.6 | 62.1 | 69.2 | 58.7 | 63.7 | 78.0 | 72.9 | 69.4 |
| Δ (vs Baseline) | - | - | - | - | +5.5 | +1.2 | +14.0 | +7.0 | +0.3 | +3.0 | +0.4 | +2.1 | +2.7 | +3.7 | +1.6 | +4.3 | +4.5 |
| CoVT (4 Visual Tokens) | |||||||||||||||||
| Seg + Depth + DINO + Edge | ✓ | ✓ | ✓ | ✓ | 79.8 | 66.1 | 89.2 | 80.5 | 56.2 | 71.8 | 61.9 | 68.4 | 56.7 | 63.3 | 78.5 | 72.5 | 69.9 |
| Δ (vs Baseline) | - | - | - | - | +5.3 | +1.1 | +16.4 | +5.0 | +0.5 | +3.2 | +0.2 | +1.3 | +0.7 | +3.3 | +2.1 | +3.9 | +5.0 |
Comparison of CoVT with the baseline and closed-source models. CoVT delivers consistent improvements across all vision-centric benchmarks and further reveals that each visual token type contributes most effectively to the tasks related to its rich information.