Waldo

WaldoBench: Where’s Waldo?

A benchmark for AI visual search.

Finding Waldo requires spatial reasoning, pattern recognition, and attention to detail across densely cluttered illustrations. We hand-annotated ground truth coordinates for Waldo in each scene, then tested whether top vision models from Anthropic, OpenAI, and Google can predict his location. Models return normalized coordinates, and a prediction is a hit if it falls within a close percentage of the true position.

Each model is tested in two modes: a single-shot mode where the model must find Waldo from the full image in one pass, and an agentic mode where the model is given a crop_image tool and can zoom into up to 20 regions of the scene before giving its final answer.

Benchmark results coming soon

We’re running evaluations across models and scenes. Full rankings, cost analysis, and detailed breakdowns will be published here.