Introduction
In Natural Language Processing (NLP), developing Large Language Models (LLMs) has proven to be a transformative and revolutionary endeavor. These models, equipped with massive parameters and trained on extensive datasets, have demonstrated unprecedented proficiency across many NLP tasks. However, the exorbitant costs of training these models from scratch have prompted researchers to explore alternative strategies. A pioneering strategy that has emerged to enhance the capabilities of Large Language Models (LLMs) is knowledge fusion, a concept explored in-depth in the research paper titled Knowledge “Fusion of Large Language Models” by Wan, Huang, Cai, Quan, and others.
Recognizing the need to address redundancy in the functionalities of newly developed LLMs, this innovative approach offers a compelling solution. The paper delves into the intricate process of merging the knowledge of various LLMs, presenting a promising avenue to refine and amplify the performance of these language models.
The fundamental idea is to combine the strengths and capabilities of existing LLMs, transcending the limitations of individual models. By merging existing pre-trained LLMs, we can create a more powerful model that surpasses the individual strengths of each source model.
Understanding the Knowledge Fusion of LLMs
The paper begins by highlighting the challenges and costs of training LLMs from scratch. The authors propose knowledge fusion as an efficient and cost-effective alternative. Rather than merging weights directly, the approach focuses on externalizing the collective knowledge of source LLMs and transferring it to a target model. The research introduces FUSELLM, a method that leverages the generative distributions of source LLMs, aiming to enhance the target model’s capabilities beyond any individual source LLM.
The primary objective of LLMs fusion is to externalize the inherent knowledge embedded within multiple source LLMs and integrate their capabilities into a target LLM. The paper emphasizes stimulating LLMs to manifest knowledge by predicting the next token in a given text. The probabilistic distributions generated by different source LLMs for the same text are then fused into a single representation, creating a unified probabilistic understanding over the text.
Implementation Details: Token Alignment and Fusion Strategies
The paper introduces two crucial implementation details to ensure effective knowledge fusion: token alignment and fusion strategies.
Token alignment is achieved through a Minimum Edit Distance (MinED) strategy, enhancing the success rate of aligning tokens from different LLMs.
Fusion strategies, namely MinCE and AvgCE, evaluate the quality of different LLMs and assign varying levels of importance to their distribution matrices based on cross-entropy scores.
Experiments and Evaluation
The research conducts experiments on a challenging scenario of LLMs fusion, where the source models exhibit minimal commonalities. Three representative open-source models – Llama-2, OpenLLaMA, and MPT – are selected as source LLMs for fusion, with another Llama-2 serving as the target LLM. The experiments span benchmarks assessing reasoning, commonsense, and code generation capabilities.
Performance Across Different Benchmarks
The comprehensive evaluation of FUSELLM’s performance across various benchmarks provides valuable insights into its efficacy. Table 1 showcases the overall results of FUSELLM in comparison to baseline methods on the Big-Bench Hard (BBH). Notably, FUSELLM demonstrates an average relative performance gain of 5.16% over the original Llama-2 across all 27 tasks. The specific tasks, such as Hyperbaton, show substantial enhancements, underscoring FUSELLM’s ability to leverage collective knowledge for improved performance.
Moving on to the Common Sense (CS) benchmark in Table 2, FUSELLM consistently outperforms baselines across all tasks, achieving a relative performance improvement of 1.25% over Llama-2. This trend holds true even in challenging tasks like ARC-challenge and OpenBookQA, where FUSELLM exhibits significant improvements, highlighting its effectiveness in addressing intricate problems.
In the context of code generation, Table 3 illustrates the zero-shot performance of FUSELLM on the MultiPL-E (ME) benchmark. Outperforming Llama-2 in 9 out of 10 tasks, FUSELLM showcases a notable enhancement in the pass@1 score, particularly for specific programming languages like R. Despite a performance gap compared to OpenLLaMA or MPT, FUSELLM still achieves a remarkable average performance gain of 6.36%, surpassing the 1.37% improvement observed in Llama-2 CLM.
The Fused Probabilistic Distributions: Accelerating Optimization
A crucial aspect of FUSELLM’s success lies in its ability to utilize fused probabilistic distributions from multiple LLMs. Figure 2 compares the few-shot Chain-of-Thought (CoT) performance between Llama-2 CLM and FUSELLM with varying scales of training data on BBH. FUSELLM significantly enhances the exact match (EM) accuracy by 2.5%, achieving the best performance of Llama-2 CLM within 0.52 billion tokens. This represents a 3.9× reduction in token requirements compared to Llama-2 CLM, indicating that the probabilistic distributions derived from LLMs contain more readily learnable knowledge than the original text sequences, thereby accelerating the optimization process.
Analysis of the Implementation Process
Delving into the implementation details of FUSELLM reveals critical considerations for its success. The number of source LLMs, token alignment criteria, and the choice of fusion function play pivotal roles in shaping FUSELLM’s performance.
- Number of Source LLMs: Table 4 demonstrates the performance improvement of FUSELLM with varying numbers of models. The results show an apparent enhancement as the number of models increases from 1 to 3, with consistent improvements observed in BBH.
- Criteria for Token Alignment: Proper token alignment is crucial during the fusion of LLMs. The proposed MinED method consistently outperforms the EM method, showcasing the effectiveness of MinED in aligning tokens from multiple models.
- Fusion Function: The choice of the fusion function is critical, and FUSELLM with MinCE consistently outperforms AvgCE across all benchmarks. This emphasizes the importance of the fusion function in preserving the distinct advantages of individual LLMs.
FUSELLM vs. Knowledge Distillation and Ensemble/Merging
Comparative analyses with traditional techniques like knowledge distillation and ensemble/merging shed light on FUSELLM’s unique strengths.
- FUSELLM vs. Knowledge Distillation: FUSELLM outperforms knowledge distillation, especially in BBH, where the improvement achieved by FUSELLM (5.16%) surpasses the modest gain of knowledge distillation (2.97%). This highlights FUSELLM’s ability to harness collective knowledge from multiple LLMs more effectively.
- FUSELLM vs. Ensemble/Merging: In scenarios where multiple LLMs originated from the same base model but were trained on distinct corpora, FUSELLM consistently achieves the lowest average perplexity across three domains compared to ensemble and weight merging methods. This reinforces FUSELLM’s potential to leverage collective knowledge more effectively than traditional fusion methods.
You can find the code, model weights, and data public here: GitHub FUSELLM
Conclusion: Unveiling Future Possibilities
The paper concludes with compelling results, showcasing the effectiveness of FUSELLM over individual source LLMs and established baselines. The study opens up a promising avenue for future exploration in LLMs fusion. The findings emphasize the potential of combining the diverse capabilities and strengths of structurally different LLMs, shedding light on a cost-effective and powerful approach to developing large language models.
The knowledge fusion of large language models is an innovative solution in a world where the demand for advanced natural language processing capabilities continues to rise. This research paves the way for future endeavors in creating unified models that harness the collective intelligence of diverse LLMs, pushing the boundaries of what is achievable in the realm of natural language understanding and generation.
I’m eager to learn about your opinions regarding the Knowledge Fusion of Large Language Models (LLMs). Feel free to share your insights on any other noteworthy and informative papers you may have encountered in the comments section.
Also read: A Comprehensive Guide to Fine-Tuning Large Language Models