I’ve just released a new benchmark for large language models on my GitHub. It’s a collection of nearly 100 tests I’ve extracted from my actual conversation history with various LLMs. Among the tests included in the benchmark are tests that ask a model to
• convert a python function to an equivalent-but-faster c function;
• identify the encoding format (in this case,…
This story appeared on nicholas.carlini.com.