A new result from the AI evaluation nonprofit METR has pushed the conversation around autonomous AI systems into new territory. According to METR’s latest reporting, Claude Opus 4.5 has achieved the longest “time horizon” the group has ever measured on its autonomy benchmark, with a 50 percent success point at roughly four hours and forty nine minutes. On the surface, that sounds like a dramatic leap toward machines that can work independently for long stretches. The reality is more nuanced, but no less significant.
https://www.msn.com/en-in/technology/artificial-intelligence/five-hours-of-expert-level-autonomy-metr-s-claude-opus-4-5-s-crazy-results/ar-AA1SNP9z?ocid=BingNewsVerp
Anthropic has a 2-hour engineering take-home test. It says its new Claude 4.5 model outscored every human who took it.
No comments:
Post a Comment