BearCubs

Bearcubs Logo

A small but mighty benchmark for computer-using web agents

🐻 BearCubs 🐻 evaluates the capability of web agents to search, browse, and extract factual information from the live web through complex and diverse text-based and multimodal interactions. For more details, check out our paper! ✨

About the benchmark: BearCubs comprises 111 carefully crafted questions covering a wide range of topics, including but not limited to music, maps, videos, games, and virtual tours. Each question is designed to be adversarial to closed-book LLMs and simple Google searches. Answers are concise and uniquely formulated to eliminate ambiguity and paraphrasing. Additionally, all questions can be answered without accessing content behind paywalls or login restrictions.

Data updates: We continuously validate existing questions and answers while introducing new, more challenging ones. Check the bottom of the webpage for the latest update date. If you're interested in pushing the boundaries of state-of-the-art agents, consider contributing to the BearCubs dataset! 🚀

In the table below, a CU agent is an agent with computer use capabilities that can perform interactive browsing by processing pixels on the screen and controlling a virtual keyboard and mouse.

Accuracy
CategoryModel
HumanHuman84.7%83.6%85.7%
Non-CU agentsOpenAI Deep Research35.1%60.7%9.1%
CU agentsOpenAI Operator25.2%37.5%12.7%
CU agentsAnthropic Computer Use14.4%19.6%9.1%
CU agentsConvergence AI Proxy12.6%16.1%9.1%
Non-CU agentsGrok3 DeepSearch11.7%21.4%1.8%
LLM baselinesDeepSeek R1 zero-shot8.1%10.7%5.5%
LLM baselinesGPT-4o zero-shot2.7%5.4%0.0%
LLM baselinesDeepSeek R1 + Google Search1.8%3.6%0.0%
LLM baselinesGPT-4o + Google Search0.0%0.0%0.0%

Example Question

Example Question

View trajectories for:

Computer UseProxyOperator

Team

CN

Yixiao Song

CN

Katherine Thai

CN

Chau Minh Pham

CN

Yapei Chang

CN

Mazin Nadaf

CN

Mohit Iyyer

Website last updated March 11, 2025