visres-Bench — Level 1 Visual Reasoning Traces
Paper (CVPR 2026): arxiv.org/abs/2512.21194 | Benchmark site: visres-bench.github.io
the reality of visual reasoning in vlms
over the last few years vlms like gemini, claude, and gpt-5 got really good at semantic understanding — they can tell you what’s in an image, describe a scene, answer questions about it. but tbh, that’s not the same as actually seeing.
strip away the text context. no captions, no descriptions, no linguistic scaffolding. just raw visual structure. and even the best frontier models struggle hard.
that’s the whole point of visres-bench.
what is visres-bench?
we built a benchmark to answer one question: can vlms actually reason visually, or are they relying on text shortcuts?
19k real images across 3 levels of difficulty:
| level | description |
|---|---|
| l1 — basic perception | complete a pattern or fix an occlusion. can the model see what’s missing? |
| l2 — single rules | raven’s matrices-style tasks with real objects — color, count, orientation |
| l3 — multi-attribute | complex rules mixing multiple attributes at the same time |
accepted at CVPR 2026.
here we have level 1 reasoning traces from testing Meta’s Muse Park model on our benchmark.
why muse park?
muse park is one of the first large vlms to do genuine visual chain-of-thought — instead of just describing what it sees in words, it actively crops regions, zooms into patches, and compares local structure before committing to an answer. that makes it a really interesting case study: does visual cot actually help on tasks where the answer is purely perceptual with zero language involved?
level 1 is the right starting point because the tasks require no language at all. the model just needs to match a missing patch to its surrounding context. we test this across four sub-tasks, each designed to isolate a specific failure mode:
- blur — patches differ in sharpness. the model has to ignore texture and focus on structure
- brightness — lighting varies across candidates. the model has to reason about continuity not color
- edges — the correct patch must align edge directions at all four boundaries simultaneously
- rotation — one patch is correct, the others are rotated copies of each other
each example in this repo shows:
- the main image with the occluded region
- the four candidate patches (A, B, C, D)
- muse park’s full visual reasoning trace — the crops, the comparisons, the logic
- the final answer with justification
structure
l1_blur/ ← blur-invariant patch completion tasks
l1_brightness/ ← brightness-invariant patch completion tasks
l1_edges/ ← edge-continuity patch completion tasks
l1_rotation/ ← rotation-discriminating patch completion tasks
image-xxxxx/
answer.md ← full reasoning trace + answer
*.png ← supporting images referenced in the trace