A small but mighty benchmark for computer-using web agents
🐻 BearCubs 🐻 evaluates the capability of web agents to search, browse, and extract factual information from the live web through complex and diverse text-based and multimodal interactions. For more details, check out our paper! ✨
About the benchmark: BearCubs comprises 111 carefully crafted questions covering a wide range of topics, including but not limited to music, maps, videos, games, and virtual tours. Each question is designed to be adversarial to closed-book LLMs and simple Google searches. Answers are concise and uniquely formulated to eliminate ambiguity and paraphrasing. Additionally, all questions can be answered without accessing content behind paywalls or login restrictions.
Data updates: We continuously validate existing questions and answers while introducing new, more challenging ones. Check the bottom of the webpage for the latest update date. If you're interested in pushing the boundaries of state-of-the-art agents, consider contributing to the BearCubs dataset! 🚀
In the table below, a CU agent is an agent with computer use capabilities that can perform interactive browsing by processing pixels on the screen and controlling a virtual keyboard and mouse.
Accuracy | ||||
---|---|---|---|---|
Category | Model | |||
Human | Human | 84.7% | 83.6% | 85.7% |
Non-CU agents | OpenAI Deep Research | 35.1% | 60.7% | 9.1% |
CU agents | OpenAI Operator | 25.2% | 37.5% | 12.7% |
CU agents | Anthropic Computer Use | 14.4% | 19.6% | 9.1% |
CU agents | Convergence AI Proxy | 12.6% | 16.1% | 9.1% |
Non-CU agents | Grok3 DeepSearch | 11.7% | 21.4% | 1.8% |
LLM baselines | DeepSeek R1 zero-shot | 8.1% | 10.7% | 5.5% |
LLM baselines | GPT-4o zero-shot | 2.7% | 5.4% | 0.0% |
LLM baselines | DeepSeek R1 + Google Search | 1.8% | 3.6% | 0.0% |
LLM baselines | GPT-4o + Google Search | 0.0% | 0.0% | 0.0% |
Yixiao Song
Katherine Thai
Chau Minh Pham
Yapei Chang
Mazin Nadaf
Mohit Iyyer
Website last updated March 11, 2025