Uncategorized

Accuracy Improves When Large Language Models Collaborate


Our culture often values the notion of the self-reliant, rugged individual who isn’t easily swayed by the opinions of others, but in reality, we know that teamwork and building consensus among a group of individuals can be just as essential to getting a project off the ground.

Not surprisingly, this idea of group-based collaboration also makes sense with large language models (LLMs), as recent research from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) is now showing. In particular, the study focused on getting a group of these powerful AI systems to work with each other using a kind of “discuss and debate” approach, in order to arrive at the best and most factually accurate answer.

Powerful large language model AI systems, like OpenAI’s GPT-4 and Meta’s open source LLaMA 2, have been attracting a lot of attention lately with their ability to generate convincing human-like textual responses about history, politics and mathematical problems, as well as producing passable code, marketing copy and poetry.

However, the tendency of these AI tools to “hallucinate”, or come up with plausible but false answers, is well-documented; thus making LLMs potentially unreliable as a source of verified information.

To tackle this problem, the MIT team claims that the tendency of LLMs to generate inaccurate information will be significantly reduced with their collaborative approach, especially when combined with other methods like better prompt design, verification and scratchpads for breaking down a larger computational task into smaller, intermediate steps.

Multi-Agent Debate

The team’s process involves asking one AI agent questions in mathematics, reasoning, or chess problems and then assessing and critiquing other AI agents’ answers to the same question. This process is then repeated for several rounds, where the collective feedback is subsequently reincorporated into the original AI agent’s response in order to update it.

“This process induces models to construct answers that are consistent with both their internal critic as well as sensible in light of the responses of other agents,” explained the team.

“The resulting quorum of models can hold and maintain multiple chains of reasoning and possible answers simultaneously before proposing the final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. Also, our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to.”

Diagram showing how the multi-agent debate process can help to generate more accurate biographies of a computer scientist (only the first three generated bullets are shown here).

As the team points out, this process is similar to how group discussions might unfold, with individuals hashing out all the different aspects of an issue or problem before arriving at some kind of consensus.

In theory, even though it’s not guaranteed that all agents will eventually come to a definitive agreement, the researchers discovered that they could tweak certain parameters of this multi-agent debate method so that a timely consensus and better outcomes are made much more likely.

“We found that we could control the duration of debates by how changing how much a language model trusts its own outputs over those generated by other models through different prompts,” noted the team. “In general, we found that prompts that encouraged models to be more ‘stubborn’ based on their own solutions led to longer debates and better final solutions. Overall, we observed that language model agents were relatively ‘agreeable’, perhaps as a result of instruction tuning or reinforcement learning based on human feedback [RLHF].”

In addition, the research team also suggested that one big advantage of this iterative process is how it could be seamlessly incorporated into so-called “black box” AI models, because it wouldn’t require developers or other experts to tinker with the internal workings of poorly understood models. This approach would make the verification process for LLMs much more consistent and more easily implemented.

However, the researchers acknowledge that processing longer contexts and more complex group discussions may present additional challenges, and may require more computational resources. Nevertheless, the team believes that these difficulties might be alleviated with better-performing models in the future, and could very well be a worthy trade-off in the long run for LLMs as they evolve.

“Not only does this approach offer a pathway to elevate the performance of existing language models, but it also presents an automatic means of self-improvement,” said the paper’s lead author Yilun Du in a statement. “By utilizing the debate process as supervised data, language models can enhance their factuality and reasoning autonomously, reducing reliance on human feedback and offering a scalable approach to self-improvement. As researchers continue to refine and explore this approach, we can get closer to a future where language models not only mimic human-like language but also exhibit more systematic and reliable thinking, forging a new era of language understanding and application.”

Group Created with Sketch.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *