Global players look to create baseline to evaluate generative AI applications

The Singapore-led initiative aims to provide a standard set of benchmarks to assess generative AI products, pooling resources from big market players that include Anthropic, Google, and Microsoft.
Written by Eileen Yu, Senior Contributing Editor
Lightbulb on puzzles
courtneyk/Getty Images

Efforts are underway to provide a common set of benchmarks to assess generative artificial intelligence (AI) products and to create a "body of knowledge" on how these tools should be tested. 

The aim is to provide a standard approach to the evaluation of generative AI applications and to galvanize efforts to address the risks. This common approach is a shift away from existing "piecemeal" efforts.

Also: Six skills you need to become an AI prompt engineer

Dubbed Sandbox, the initiative is led by Singapore's Infocomm Media Development Authority (IMDA) and AI Verify Foundation, and has garnered support from global market players, such as Amazon Web Services (AWS), AnthropicGoogle, and Microsoft. These organizations are part of a current group of 15 participants, which also comprises Deloitte, EY, and IBM, as well as Singapore-based OCBC Bank and telco Singtel. 

Sandbox is guided by a new draft catalog that categorizes current benchmarks and methods used to evaluate large language models (LLMs). The catalog compiles commonly used technical testing tools, organizing these according to what they test for and their methods, and recommends a baseline set of tests to evaluate generative AI products, IMDA said. 

Also: Want a job in AI? These are the skills you need

The goal is to establish a common language and support "broader, safe and trustworthy adoption of generative AI", it said. 

"Systematic and robust evaluation of models is a critical component of LLM governance and helps form the bedrock of trust in the use of these technologies," IMDA said. 

"Through rigorous evaluation, the capabilities of a model are revealed, which can assist in determining its intended uses and potential limitations. Evaluation [also] provides a vital roadmap for developers to make improvements."

Achieving this common language requires a standardized taxonomy and baseline set of pre-deployment safety evaluations for LLMs, it noted. The Singapore government agency hopes the draft catalog offers a starting point for global discussions, with the aim of driving consensus on safety standards for LLMs. 

Also: How to write better ChatGPT prompts (and this applies to most other text-based AIs, too)

Moving toward common standards also means involving other stakeholders in the ecosystem, beyond the model developers, such as application developers that build on top of the models and developers of third-party testing tools. 

Through Sandbox, IMDA wants to offer use cases that include a generative AI model developer, application deployer, and third-party tester to demonstrate how the different players can work together. For instance, model developers, such as Anthropic or Google, can work with app developers OCBC or Singtel, alongside third-party testers, such as Deloitte and EY, and on generative AI use cases for the financial services or telecommunications sector. 

Regulators, such as Singapore's Personal Data Protection Commission, should also be involved, so Sandbox can provide an environment for experimentation and development where all parties in the ecosystem can be "transparent" about their needs, IMDA said. 

IMDA expects Sandbox to uncover gaps in the current state of generative AI evaluations, including domain-specific applications, such as human resources and cultural-specific areas, which are currently under-developed. 

"Sandbox will develop benchmarks for evaluating model performance in specific areas that are important for use cases, and for countries like Singapore because of cultural and language specificities," IMDA said. 

Also: 6 things ChatGPT can't do (and another 20 it refuses to do)

The Singapore agency said it is collaborating with Anthropic on a Sandbox project that uses the catalog to identify aspects for red teaming, which looks to challenge policies and assumptions used in AI systems by taking on an adversarial approach. 

IMDA will tap Anthropic's models and research tooling platform to develop red-teaming methodologies customized for Singapore's diverse linguistic and cultural landscape. For instance, AI models will be evaluated for their abilities to perform within the country's multi-lingual context. 

In July, the Singapore government launched two sandboxes running on Google Cloud's generative AI toolsets, one of which is used exclusively by government agencies to develop and test generative AI applications. The other sandbox is available to local organizations and provided at no cost for three months, for up to 100 use cases.

Editorial standards