Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

visres-Bench — Level 1 Visual Reasoning Traces

Paper (CVPR 2026): arxiv.org/abs/2512.21194  |  Benchmark site: visres-bench.github.io


the reality of visual reasoning in vlms

over the last few years vlms like gemini, claude, and gpt-5 got really good at semantic understanding — they can tell you what’s in an image, describe a scene, answer questions about it. but tbh, that’s not the same as actually seeing.

strip away the text context. no captions, no descriptions, no linguistic scaffolding. just raw visual structure. and even the best frontier models struggle hard.

that’s the whole point of visres-bench.


what is visres-bench?

we built a benchmark to answer one question: can vlms actually reason visually, or are they relying on text shortcuts?

19k real images across 3 levels of difficulty:

leveldescription
l1 — basic perceptioncomplete a pattern or fix an occlusion. can the model see what’s missing?
l2 — single rulesraven’s matrices-style tasks with real objects — color, count, orientation
l3 — multi-attributecomplex rules mixing multiple attributes at the same time

accepted at CVPR 2026.


here we have level 1 reasoning traces from testing Meta’s Muse Park model on our benchmark.

why muse park?

muse park is one of the first large vlms to do genuine visual chain-of-thought — instead of just describing what it sees in words, it actively crops regions, zooms into patches, and compares local structure before committing to an answer. that makes it a really interesting case study: does visual cot actually help on tasks where the answer is purely perceptual with zero language involved?

level 1 is the right starting point because the tasks require no language at all. the model just needs to match a missing patch to its surrounding context. we test this across four sub-tasks, each designed to isolate a specific failure mode:

  • blur — patches differ in sharpness. the model has to ignore texture and focus on structure
  • brightness — lighting varies across candidates. the model has to reason about continuity not color
  • edges — the correct patch must align edge directions at all four boundaries simultaneously
  • rotation — one patch is correct, the others are rotated copies of each other

each example in this repo shows:

  1. the main image with the occluded region
  2. the four candidate patches (A, B, C, D)
  3. muse park’s full visual reasoning trace — the crops, the comparisons, the logic
  4. the final answer with justification

structure

l1_blur/          ← blur-invariant patch completion tasks
l1_brightness/    ← brightness-invariant patch completion tasks
l1_edges/         ← edge-continuity patch completion tasks
l1_rotation/      ← rotation-discriminating patch completion tasks
  image-xxxxx/
    answer.md     ← full reasoning trace + answer
    *.png         ← supporting images referenced in the trace