
My benchmark for large language models

I’ve just released a new benchmark for large language models on my GitHub. It’s a collection of nearly 100 tests I’ve extracted from my actual conversation history with various LLMs. Among the tests included in the benchmark are tests that ask a model to
• convert a python function to an equivalent-but-faster c function;
• identify the encoding format (in this case,…

This story appeared on nicholas.carlini.com.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *