PinchBench 推出开源基准测试,针对 23 项真实 OpenClaw 代理任务评估大语言模型性能。

测试调试📅 2026/03/28
#Agent 互联#开发者#GitHub#LLM 优化#低风险#手动触发#可复用#半自动#代码仓库#报告#评测
PinchBench 数据看板展示 32+ 个大模型在执行邮件分类、日历安排等真实 OpenClaw 任务时的成功率与成本对比
Finally, an open benchmark for evaluating LLM models for OpenClaw!

PinchBench is an open-source benchmark that tests LLMs on 23 real OpenClaw tasks like scheduling, coding, and email management.

Most LLM benchmarks test isolated capabilities. Can the model answer questions? Can it reason through math problems? Can it write code snippets?

Those benchmarks don't predict how a model performs when it's running an actual agent. An agent needs to choose the right tools, chain together multiple actions, handle ambiguous instructions, and recover from failures. A model can score high on MMLU and still struggle when OpenClaw asks it to schedule a meeting or triage email.

PinchBench fixes this by testing models on what OpenClaw actually does.

The benchmark includes 23 real-world tasks across scheduling, coding, email management, research, writing, data analysis, and memory operations. Tasks like creating calendar events, scraping stock prices, triaging inbox, generating blog posts, building weather scripts, processing spreadsheets.

Each task gets graded automatically, by an LLM judge, or both. Success means the agent actually completed the task. Did it create the file? Send the email? Schedule the meeting?

Results show up on a public leaderboard. It tracks success rate, speed, and cost across 32+ models. You can see which models finish tasks fastest, which ones cost least per successful task, and which ones balance performance with price.

You can run the benchmarks yourself. Clone the repo, point it at your model, and the results upload automatically.

The benchmark is fully open source. All tasks and grading criteria are public on GitHub.

Built by the team at Kilo (the folks behind KiloClaw).

Link to the PinchBench in comments!