Model Scoring and Benchmarking

10don MSN

Yann LeCun: Meta ‘fudged a little bit’ when benchmark-testing Llama 4 model

Yann LeCun, Meta’s outgoing chief AI scientist, says his employer tested its latest Llama model in a way that may have made ...

Artificial Analysis overhauls its AI Intelligence Index, replacing popular benchmarks with 'real-world' tests

Artificial Analysis overhauls its AI Intelligence Index, replacing saturated benchmarks with real-world tests measuring ...

Business Wire

iAsk AI Outperforms OpenAI’s o1 Model in Comprehensive Generative AI Benchmark Test

CHICAGO--(BUSINESS WIRE)--iAsk, a Generative AI-powered answer engine designed for Gen Z, today announced that iAsk Pro, its most advanced model, has surpassed both human experts and the OpenAI o1 ...

techtimes

OpenAI o3 Model: Lower Benchmark Scores Raise Questions About Claims, Transparency Over AI

OpenAI has long been touting the capabilities of its artificial intelligence (AI) developments, especially with their o-series models that are capable of reasoning and more advanced capabilities. The ...

Hosted on MSN

Popular AI model performance benchmark may be flawed, Meta researchers warn

'We've identified multiple loopholes with SWE-bench Verified,' the manager at Meta Platforms' AI research lab Fair says A popular benchmark for measuring the performance of artificial intelligence ...

TechCrunch

Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Earlier this week, Meta landed in hot water for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM Arena. The incident ...

CCN on MSN

OpenAI Accused of Manipulating Benchmark Results as Chinese Models Close AI Performance Gap

It was recently revealed that OpenAI secretly funded and accessed data related to the FrontierMath AI benchmark. The ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results