visres-Bench — Level 1 Visual Reasoning Traces

Paper (CVPR 2026): arxiv.org/abs/2512.21194 | Benchmark site: visres-bench.github.io

the reality of visual reasoning in vlms

over the last few years vlms like gemini, claude, and gpt-5 got really good at semantic understanding — they can tell you what’s in an image, describe a scene, answer questions about it. but tbh, that’s not the same as actually seeing.

strip away the text context. no captions, no descriptions, no linguistic scaffolding. just raw visual structure. and even the best frontier models struggle hard.

that’s the whole point of visres-bench.

what is visres-bench?

we built a benchmark to answer one question: can vlms actually reason visually, or are they relying on text shortcuts?

19k real images across 3 levels of difficulty:

level	description
l1 — basic perception	complete a pattern or fix an occlusion. can the model see what’s missing?
l2 — single rules	raven’s matrices-style tasks with real objects — color, count, orientation
l3 — multi-attribute	complex rules mixing multiple attributes at the same time

accepted at CVPR 2026.

here we have level 1 reasoning traces from testing Meta’s Muse Park model on our benchmark.

why muse park?

muse park is one of the first large vlms to do genuine visual chain-of-thought — instead of just describing what it sees in words, it actively crops regions, zooms into patches, and compares local structure before committing to an answer. that makes it a really interesting case study: does visual cot actually help on tasks where the answer is purely perceptual with zero language involved?

level 1 is the right starting point because the tasks require no language at all. the model just needs to match a missing patch to its surrounding context. we test this across four sub-tasks, each designed to isolate a specific failure mode:

blur — patches differ in sharpness. the model has to ignore texture and focus on structure
brightness — lighting varies across candidates. the model has to reason about continuity not color
edges — the correct patch must align edge directions at all four boundaries simultaneously
rotation — one patch is correct, the others are rotated copies of each other

each example in this repo shows:

the main image with the occluded region
the four candidate patches (A, B, C, D)
muse park’s full visual reasoning trace — the crops, the comparisons, the logic
the final answer with justification

structure

l1_blur/          ← blur-invariant patch completion tasks
l1_brightness/    ← brightness-invariant patch completion tasks
l1_edges/         ← edge-continuity patch completion tasks
l1_rotation/      ← rotation-discriminating patch completion tasks
  image-xxxxx/
    answer.md     ← full reasoning trace + answer
    *.png         ← supporting images referenced in the trace

Keyboard shortcuts

visres-Bench

visres-Bench — Level 1 Visual Reasoning Traces

the reality of visual reasoning in vlms

what is visres-bench?

why muse park?

structure