WaldoBench: Where’s Waldo?

0 scenes·0 models·vunknown

LLMs have been getting much better, and now hit high-80% scores on many popular benchmarks. But the real test is one that every 7 year old kid did while waiting for the dentist: finding Waldo.

Finding Waldo requires spatial reasoning, pattern recognition, and attention to detail across densely cluttered illustrations. We hand-annotated ground truth coordinates for Waldo in each scene, then tested whether top vision models from could find his location. Models returned normalized coordinates, and a prediction was a hit if it fell within 5% of the true position.

We evaluated models in an agentic mode where the model was given a crop_image tool and could zoom into up to 20 regions of the scene before giving its final answer.

We recognize the dataset is smaller than we'd like at just 0 images, this benchmark should be taken with a grain of salt, and be used as a fun way to compare vision capabilities of new models.

Model ranking

Models ranked by accuracy in correctly identifying Waldo’s location across all scenes. See our full methodology for validation details.

* Some models were run through OpenRouter. The rest were run through their native provider.

Scenes

0 scenes across all difficulties. Each scene varies in crowd density, visual distractors, and Waldo placement.

Sort

Hide 0%

Model-scene matrix

Hide 0%

0.0k

1.0kunsolved

A detailed view of which scenes each model solved or failed. This helps identify models that handle specific visual patterns well, even if their overall score is lower.

Cost efficiency

We map total API cost (or token volume) against success rate. The Pareto frontier (red line) highlights the most cost-efficient models for a given performance level.

Pareto frontier

Model	Pass Rate		Tokens

Speed vs. quality

Average response time versus detection accuracy. Faster models aren’t always less accurate.

Confidence alignment

X = average confidence, Y = empirical accuracy. Points closer to the diagonal line are better calibrated.

In this benchmark, all models were wildly overconfident in their ability to find Waldo.

Methodology

Benchmark Data

We selected a wide array of scenes across many years and settings to stress visual search in dense, cluttered layouts. Waldo’s visibility varies widely across scenes: in some only a tiny head is visible, while in others his full body is clearly visible. Some scenes also include Wenda, which can act as a visual false flag because her outfit closely resembles Waldo’s.

Each scene was manually annotated with ground truth pixel coordinates for Waldo and other characters (Wenda, Wizard Whitebeard, Odlaw, and scrolls). Coordinates were stored as normalized values (0–1) relative to image dimensions. Models were prompted with the full scene image and asked to return predicted coordinates in structured JSON via each provider’s structured output / function calling API.

Crop Tool

Models in agentic mode were given a crop_image tool that accepted normalized coordinates x_min, y_min, x_max, y_max and returned the cropped image region for closer inspection. Each run allowed up to 20 crops total before the final answer.

System Prompt + Configuration

The system prompt constrained the output format, emphasized visual precision, and directed the model to search for Waldo-specific features while avoiding look-alikes. We kept the instructions consistent across providers to ensure comparable results.

All models used temperature 0 for deterministic output. Extended thinking / reasoning was enabled where supported (Anthropic thinking budget, OpenAI reasoning effort, Google thinking config). Models ran in parallel across a thread pool, with results saved incrementally to prevent data loss.

Output

We requested a single-point coordinate output, plus a confidence score. Models also provided a brief textual description of where they believed Waldo was located to aid qualitative review.

Scoring

Models were asked to return a single coordinate for Waldo’s location. A prediction was graded as a hit if it landed within 5% of the final ground-truth Waldo position.

Example Failures

A few representative runs, with images and captions. The numbered box shows the crop number that was used to look at that section of the image.

Example 1: Common failure path - finding another red + white striped object in the image.

Future work

WaldoBench is an ongoing project. Here’s what we’re planning next.

Find the whole gang

Our annotations already include coordinates for Wenda, Wizard Whitebeard, Odlaw, and the scroll. Future runs will benchmark per-character detection and multi-target search.

Open source tooling

Open source the benchmark runner and tooling, while we evaluate whether and how to publish the underlying images given licensing constraints.

Bounding box mode

The schema supports bounding box output (x_min/y_min/x_max/y_max), but current runs use single-point predictions only. We plan to evaluate explicit bounding boxes scored with IoU.

Reasoning effort analysis

Some providers support configurable reasoning effort. We want to measure how accuracy scales with reasoning budget and whether the cost tradeoff is worth it.

Smarter agentic strategies

Explicitly providing systematic grid search strategies, multi-scale pyramid analysis, and measuring how efficiently models use their 20-crop budget.

Confidence calibration

Models can optionally return a confidence score. We plan to analyze whether confidence is well-calibrated — does 90% confidence actually correspond to 90% accuracy?

Scene difficulty tiers

Auto-classify images by crowd density, number of red-herring distractors, and degree of occlusion to create stratified difficulty benchmarks.