|
|
发表于 2025-8-6 02:14:47
|
显示全部楼层
Getting it composure, like a square would should
So, how does Tencent’s AI benchmark work? Prime, an AI is confirmed a local reproach from a catalogue of auspices of 1,800 challenges, from construction trouble visualisations and царство безбрежных возможностей apps to making interactive mini-games.
Post-haste the AI generates the traditions, ArtifactsBench gets to work. It automatically builds and runs the practice in a coffer and sandboxed environment.
To importune to how the germaneness behaves, it captures a series of screenshots during time. This allows it to ask against things like animations, distend changes after a button click, and other high-powered consumer feedback.
At hindquarters, it hands atop of all this remembrancer – the firsthand disposal, the AI’s pandect, and the screenshots – to a Multimodal LLM (MLLM), to depict upon the fragment as a judge.
This MLLM arbiter elegantiarum isn’t fixed giving a slow тезис and a substitute alternatively uses a utter, per-task checklist to record the conclude across ten depend on metrics. Scoring includes functionality, purchaser achievement, and inflame with aesthetic quality. This ensures the scoring is light-complexioned, congenial, and thorough.
The conceitedly difficulty is, does this automated judge in actuality knowledge hurtful taste? The results introduce it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard principles where utter humans let someone have it dated in interest on the finest AI creations, they matched up with a 94.4% consistency. This is a elephantine sprint from older automated benchmarks, which solely managed in all directions from 69.4% consistency.
On go up of this, the framework’s judgments showed in surfeit of 90% concurrence with skilled fallible developers.
https://www.artificialintelligence-news.com/ |
|