Free Quiz
Write for Us
Learn Artificial Intelligence and Machine Learning
  • Artificial Intelligence
  • Data Science
    • Language R
    • Deep Learning
    • Tableau
  • Machine Learning
  • Python
  • Blockchain
  • Crypto
  • Big Data
  • NFT
  • Technology
  • Interview Questions
  • Others
    • News
    • Startups
    • Books
  • Artificial Intelligence
  • Data Science
    • Language R
    • Deep Learning
    • Tableau
  • Machine Learning
  • Python
  • Blockchain
  • Crypto
  • Big Data
  • NFT
  • Technology
  • Interview Questions
  • Others
    • News
    • Startups
    • Books
Learn Artificial Intelligence and Machine Learning
No Result
View All Result

Home » Tencent improves testing creative AI models with new benchmark

Tencent improves testing creative AI models with new benchmark

Tarun Khanna by Tarun Khanna
July 10, 2025
in Artificial Intelligence
Reading Time: 3 mins read
0
Tencent improves testing creative AI models with new benchmark

Photo Credit: https://telecom.economictimes.indiatimes.com/

Share on FacebookShare on TwitterShare on LinkedInShare on WhatsApp

Tencent has introduced a new benchmark, ArtifactsBench, that aim to fix recent problems with testing creative AI models.

Ever asked an AI to create something like a simple webpage or a chart and acquired something that works however has a bad ser experience? The buttons is probably within the incorrect place, the colours might clash, or the animations feel clunky. It’s a common problem, and it highlights a big challenge within the world of AI development: how do you teach a machine to have good taste?

For a long term, we’ve been testing out AI models on their ability to write code this is functionally accurate. These tests could confirm the code would run, however they have been absolutely “unaware of the visual fidelity and interactive integrity that define modern user experiences.”

Also Read:

Google introduces a new protocol to enable AI-driven commerce

JPMorgan Drops Proxy Advisers and Shifts Shareholder Voting to In-House AI

2026 to be the year of the agentic AI intern

Scientists Form a “Periodic Table” for Artificial Intelligence

This is the exact problem ArtifactsBench has been designed to solve. It’s much less of a test and more of an automated art critic for AI-generated code

Getting it right, like a human would should

So, how does Tencent’s AI benchmark works? First, an AI is given a creative task from a catalogue of over 1,800 demanding situations, from building data visualizations and web apps to making interactive mini-games.

Once the AI generates the code, ArtifactsBench gets to work. It automatically builds and runs the code in a safe and sandboxed environment.

To see how the application behaves, it captures a series of screenshots over the time period. This permits it to check for things like animations, state changes after a button click on, and other dynamic user feedback.

Finally, it hands over all this proof – the original request, the AI’s code, and the screenshots – to a Multimodal LLM (MLLM), to act as a judge.

This MLLM choose isn’t just giving a vague opinion and instead uses a detailed, per task-checklist to score the result across ten exceptional metrics. Scoring consists functionality, user experience, and even aesthetic quality. This ensures the scoring is fair, consistent, and thorough.

The large question is, does this automated judge actually have goods taste? The result propose it does.

When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-general platform in which real human vote at the exceptional AI creations, they matched up with a 94.4% consistency. This is a large jump from older automated benchmarks, which only controlled around 69.4% consistency.

On top of this, the framework’s judgments showed over 90% agreement with professional human developers.

Tencent evaluates the creativity of top AI models with its new benchmark

When Tencent put more than 30 of the world’s top AI models by their paces, the leaderboard was revealing. While top commercial models from Google (Gemini-2.5-Pro) and Anthropic (Claude 4.0-Sonnet) took the lead, the tests unearthed a fascinating insight.

You may assume that an AI specialized in writing code will be the best at those tasks. But the opposite was true. The research found that “the holistic capabilities of generalist models often surpass those of specialized ones.”

A general-motive model, Qwen-2.5-Instruct, actually beat its more specialized siblings, Qwen-2.5-coder (a code-particular model) and Qwen2.5-VL (a vision-specialized model).

The researchers believe this is because generating a great visual application isn’t just coding or visible understanding in isolation and demand a blend of skills.

“Robust reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” the researchers highlight as example vital skills. These are the sorts of well-rounded, almost human-like abilities that the best generalist models are starting to grow.

Tencent hopes its ArtifactsBench benchmark can reliably evaluate those qualities and thus measure future progress in the ability for AI to create things that aren’t just purposeful but what users actually want to use.

ShareTweetShareSend
Previous Post

Change-Aware Data Validation with Column-Level Lineage

Next Post

Study could lead to LLMs that are better at complex reasoning

Tarun Khanna

Tarun Khanna

Founder DeepTech Bytes - Data Scientist | Author | IT Consultant
Tarun Khanna is a versatile and accomplished Data Scientist, with expertise in IT Consultancy as well as Specialization in Software Development and Digital Marketing Solutions.

Related Posts

NVIDIA RTX Speeds Up 4K AI Video Generation With LTX-2 and ComfyUI Upgrades
Artificial Intelligence

NVIDIA RTX Speeds Up 4K AI Video Generation With LTX-2 and ComfyUI Upgrades

January 8, 2026
AMD’s Lisa Su Says AI Isn’t Replacing People, however Is Changing Who Gets Hired
Artificial Intelligence

AMD’s Lisa Su Says AI Isn’t Replacing People, however Is Changing Who Gets Hired

January 8, 2026
Gemini AI Predicts the Price of XRP, ETH, and BNB For the Beginning of 2026
Artificial Intelligence

Gemini AI Predicts the Price of XRP, ETH, and BNB For the Beginning of 2026

December 31, 2025
Meta just offered Manus, an AI startup everyone has been talking about
Artificial Intelligence

Meta just offered Manus, an AI startup everyone has been talking about

December 30, 2025
Next Post
Study could lead to LLMs that are better at complex reasoning

Study could lead to LLMs that are better at complex reasoning

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

6 + 2 =

TRENDING

2026 to be the year of the agentic AI intern

2026 to be the year of the agentic AI intern

Photo Credit: https://www.artificialintelligence-news.com/

by Tarun Khanna
January 9, 2026
0
ShareTweetShareSend

Huawei readies new AI chip for mass shipment as China seeks Nvidia options, sources stated

Huawei readies new AI chip for mass shipment as China seeks Nvidia options, sources stated

Photo Credit: https://www.reuters.com/

by Tarun Khanna
April 23, 2025
0
ShareTweetShareSend

What are NFT games and how do they actually work?

famous-nft-games
by Tarun Khanna
January 17, 2022
0
ShareTweetShareSend

What occurs while AI data facilities run out of space? NVIDIA’s new solution explained

What occurs while AI data facilities run out of space? NVIDIA’s new solution explained

Photo Credit: https://www.artificialintelligence-news.com/

by Tarun Khanna
August 26, 2025
0
ShareTweetShareSend

Data Science Strategy of Oracle

oracle-data-science
by Tarun Khanna
May 10, 2021
0
ShareTweetShareSend

CFTC embraces Nasdaq Tool to Hunt Insider Trading in Crypto

CFTC embraces Nasdaq Tool to Hunt Insider Trading in Crypto

Photo Credit: https://cryptonews.com/

by Tarun Khanna
August 28, 2025
0
ShareTweetShareSend

DeepTech Bytes

Deep Tech Bytes is a global standard digital zine that brings multiple facets of deep technology including Artificial Intelligence (AI), Machine Learning (ML), Data Science, Blockchain, Robotics,Python, Big Data, Deep Learning and more.
Deep Tech Bytes on Google News

Quick Links

  • Home
  • Affiliate Programs
  • About Us
  • Write For Us
  • Submit Startup Story
  • Advertise With Us
  • Terms of Service
  • Disclaimer
  • Cookies Policy
  • Privacy Policy
  • DMCA
  • Contact Us

Topics

  • Artificial Intelligence
  • Data Science
  • Python
  • Machine Learning
  • Deep Learning
  • Big Data
  • Blockchain
  • Tableau
  • Cryptocurrency
  • NFT
  • Technology
  • News
  • Startups
  • Books
  • Interview Questions

Connect

For PR Agencies & Content Writers:

connect@deeptechbytes.com

Facebook Twitter Linkedin Instagram
Listen on Apple Podcasts
Listen on Google Podcasts
Listen on Google Podcasts
Listen on Google Podcasts
DMCA.com Protection Status

© 2024 Designed by AK Network Solutions

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Artificial Intelligence
  • Data Science
    • Language R
    • Deep Learning
    • Tableau
  • Machine Learning
  • Python
  • Blockchain
  • Crypto
  • Big Data
  • NFT
  • Technology
  • Interview Questions
  • Others
    • News
    • Startups
    • Books

© 2023. Designed by AK Network Solutions