[2309.02726] Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

[Submitted on 6 Sep 2023 (v1), last revised 16 Feb 2024 (this version, v2)]

Download a PDF of the paper titled Large Language Models for Automated Open-domain Scientific Hypotheses Discovery, by Zonglin Yang and 5 other authors

Download PDF

Abstract:Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first NLP dataset for social science academic hypotheses discovery, consisting of 50 recent top social science publications; and a raw web corpus that contains enough information to make it possible to develop all the research hypotheses in the 50 papers. The final goal is to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Different from the previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, as well as three different feedback mechanisms that empirically show performance gain over the base framework. Finally, our framework exhibits superior performance in terms of both GPT-4 based evaluation and expert-based this http URL the best of our knowledge, this is the first work showing that LLMs are able to generate novel (“not existing in the literature”) and valid (“reflecting reality”) scientific hypotheses.