BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics

Abstract

Large Language Models LLMs have shown great promise in their knowledge integration and problemsolving capabilities but their ability to assist in bioinformatics research has not been systematically evaluated To bridge this gap we present BioLLMBench a novel benchmarking framework coupled with a scoring metric scheme for comprehensively evaluating LLMs in solving bioinformatics tasks Through BioLLMBench we conducted a thorough evaluation of 2160 experimental runs of the three most widely used models GPT4 Bard and LLaMA focusing on 36 distinct tasks within the field of bioinformatics The tasks come from six key areas of emphasis within bioinformatics that directly relate to the daily challenges and tasks faced by individuals within the field These areas are domain expertise mathematical problemsolving coding proficiency data visualization summarizing research papers and developing machine learning models The tasks also span across varying levels of complexity ranging from fundamental concepts to expertlevel challenges Each key area was evaluated using seven specifically designed task metrics which were then used to conduct an overall evaluation of the LLMs response To enhance our understanding of model responses under varying conditions we implemented a Contextual Response Variability Analysis Our results reveal a diverse spectrum of model performance with GPT4 leading in all tasks except mathematical problem solving GPT4 was able to achieve an overall proficiency score of 913 in domain knowledge tasks while Bard excelled in mathematical problemsolving with a 975 success rate While GPT4 outperformed in machine learning model development tasks with an average accuracy of 6532 both Bard and LLaMA were unable to generate executable endtoend code All models faced considerable challenges in research paper summarization with none of them exceeding a 40 score in our evaluation using the RecallOriented Understudy for Gisting Evaluation ROUGE score highlighting a significant area for future improvement We observed an increase in model performance variance when using a new chatting window compared to using the same chat although the average scores between the two contextual environments remained similar Lastly we discuss various limitations of these models and acknowledge the risks associated with their potential misuse.