Scale AI Partners with DoD’s Chief Digital and Artificial Intelligence Office to Test and Evaluate Large Language Models

Scale AI, the leading test and evaluation (T&E) partner for frontier artificial intelligence companies, is partnering with the U.S. Department of Defense’s (DoD) Chief Digital and Artificial Intelligence Office (CDAO) to create a comprehensive T&E framework for the responsible use of large language models (LLMs) within the DoD.

Through this partnership, Scale will develop benchmark tests tailored to DoD use cases, integrate them into Scale’s T&E platform, and support CDAO’s T&E strategy for using LLMs. The outcomes will provide the CDAO a framework to deploy AI safely by measuring model performance, offering real-time feedback for warfighters, and creating specialized public sector evaluation sets to test AI models for military support applications, such as organizing the findings from after action reports.

This work will enable the DoD to mature its T&E policies to address generative AI by measuring and assessing quantitative data via benchmarking and assessing qualitative feedback from users. The evaluation metrics will help identify generative AI models that are ready to support military applications with accurate and relevant results using DoD terminology and knowledge bases. The rigorous T&E process aims to enhance the robustness and resilience of AI systems in classified environments, enabling the adoption of LLM technology in secure environments.

Alexandr Wang, founder and CEO of Scale AI, emphasized Scale’s commitment to protecting the integrity of future AI applications for defense and solidifying the U.S.’s global leadership in the adoption of safe, secure, and trustworthy AI. “Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology, so it can be deployed responsibly. Scale is honored to partner with the DoD on this framework,” said Wang.

For decades, T&E has been standard in product development across industries, ensuring products meet safety requirements for market readiness, but AI safety standards have yet to be codified. Scale’s methodology, published last summer, is the industry’s first comprehensive technical methodology for LLM T&E. Its adoption by the DoD reflects Scale’s commitment to understanding the opportunities and limitations of LLMs, mitigating risks, and meeting the unique needs of the military.

Learn more about Scale’s approach to test and evaluation at https://scale.com/llm-test-evaluation

About Scale AI

Scale is fueling the Generative AI revolution. Built on a foundation of high-quality data and human insight, Scale’s proprietary Data Engine powers the world’s most advanced models. Our years of deep partnership with every major model builder enables us to provide the roadmap for any organization to apply AI. Scale is trusted by industry leaders including Meta, Microsoft, the U.S. Army, the DoD’s Defense Innovation Unit, OpenAI, Cohere, Anthropic, General Motors, Toyota Research Institute, and NVIDIA.