Leveraging large language models for predictive chemistry

Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).

Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).

Chowdhery, A. et al. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).

Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

Edwards, C. N., Lai, T., Ros, K., Honke, G. & Ji, H. Translation between molecules and natural language. in Conference On Empirical Methods In Natural Language Processing (eds Goldberg, Y. et al.) 375–413 (Association for Computational Linguistics, 2022).

Hocky, G. M. & White, A. D. Natural language processing models that automate programming will transform chemistry research and teaching. Digit. Discov. 1, 79–83 (2022).

Article
PubMed
PubMed Central

Google Scholar

White, A. D. et al. Assessment of chemistry knowledge in large language models that generate. Digit. Discov. 2, 368–376 (2023).

Taylor, R. et al. Galactica: a large language model for science. Preprint at https://arxiv.org/abs/2211.09085 (2022).

Dunn, A. et al. Structured information extraction from complex scientific text with fine-tuned large language models. Adv. Neural Inf. Process. Syst. 35, 11763–11784 (2022).

Choudhary, K. & Kelley, M. L. ChemNLP: a natural language-processing-based library for materials chemistry text data. J. Phys. Chem. C 127, 17545–17555 (2023).

Article
CAS

Google Scholar

Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon. Digit. Discov. 2, 1233–1250 (2023).

Dinh, T. et al. LIFT: language-interfaced fine-tuning for non-language machine learning tasks. Adv. Neural Inf. Process. Syst. 35, 11763–11784 (2022).

Karpov, P., Godin, G. & Tetko, I. V. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J. Cheminform. 12, 17 (2020).

Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).

Article
ADS
CAS
PubMed

Google Scholar

Born, J. & Manica, M. Regression transformer enables concurrent sequence regression and generation for molecular language modelling. Nat. Mach. Intell. 5, 432–444 (2023).

Article

Google Scholar

Yüksel, A., Ulusoy, E., Ünlü, A. & Doğan, T. SELFormer: molecular representation learning via SELFIES language models. Mach. Learn. Sci. Technol. 4, 025035 (2023).

Article
ADS

Google Scholar

van Deursen, R., Ertl, P., Tetko, I. V. & Godin, G. GEN: highly efficient SMILES explorer using autodidactic generative examination networks. J. Cheminform.12, 22 (2020).

Flam-Shepherd, D., Zhu, K. & Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nat. Commun. 13, 3293 (2022).

Grisoni, F. Chemical language models for de novo drug design: challenges and opportunities. Curr. Opin. Struct. Biol. 79, 102527 (2023).

Article
CAS
PubMed

Google Scholar

Ramos, M. C., Michtavy, S. S., Porosoff, M. D. & White, A. D. Bayesian optimization of catalysts with in-context learning. Preprint at https://arxiv.org/abs/2304.05341 (2023).

Guo, T. et al. What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks. Preprint at https://arxiv.org/abs/2305.18365 (2023).

Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 328–339 (Association for Computational Linguistics, 2018); https://aclanthology.org/P18-1031

Pei, Z., Yin, J., Hawk, J. A., Alman, D. E. & Gao, M. C. Machine-learning informed prediction of high-entropy solid solution formation: beyond the Hume–Rothery rules. npj Comput. Mater. https://doi.org/10.1038/s41524-020-0308-7 (2020).

Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput. Mater. https://doi.org/10.1038/s41524-020-00406-3 (2020).

Goldblum, M., Finzi, M., Rowan, K. & Wilson, A. The no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning. ICLR 2024 Conference, OpenReview https://openreview.net/forum?id=X7nz6ljg9Y (2023).

Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Winter, B., Winter, C., Schilling, J. & Bardow, A. A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing. Digit. Discov. 1, 859–869 (2022).

Article
PubMed
PubMed Central

Google Scholar

Dai, D. et al. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. Preprint at https://arxiv.org/abs/2212.10559 (2022).

Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36 (1988).

Article
CAS

Google Scholar

Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).

Article

Google Scholar

Krenn, M. et al. SELFIES and the future of molecular string representations. Patterns 3, 100588 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).

Article
ADS
CAS
PubMed

Google Scholar

Yao, Z. et al. Inverse design of nanoporous crystalline reticular materials with deep generative models. Nat. Mach. Intell. 3, 76–86 (2021).

Article

Google Scholar

Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).

Article
PubMed
PubMed Central

Google Scholar

Kim, B., Lee, S. & Kim, J. Inverse design of porous materials using artificial neural networks. Sci. Adv. 6, eaax9324 (2020).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Lee, S., Kim, B. & Kim, J. Predicting performance limits of methane gas storage in zeolites with an artificial neural network. J. Mater. Chem. A 7, 2709–2716 (2019).

Article
CAS

Google Scholar

Nigam, A., Friederich, P., Krenn, M. & Aspuru-Guzik, A. Augmenting genetic algorithms with deep neural networks for exploring the chemical space. In ICLR (2019).

Jablonka, K. M., Mcilwaine, F., Garcia, S., Smit, B. & Yoo, B. A reproducibility study of ‘augmenting genetic algorithms with deep neural networks for exploring the chemical space’. Preprint at https://arxiv.org/abs/2102.00700 (2021).

Chung, Y. G. et al. In silico discovery of metal-organic frameworks for precombustion CO₂ capture using a genetic algorithm. Sci. Adv. 2, e1600909 (2016).

Article
ADS
PubMed
PubMed Central

Google Scholar

Lee, S. et al. Computational screening of trillions of metal–organic frameworks for high-performance methane storage. ACS Appl. Mater. Interfaces 13, 23647–23654 (2021).

Article
CAS
PubMed

Google Scholar

Collins, S. P., Daff, T. D., Piotrkowski, S. S. & Woo, T. K. Materials design by evolutionary optimization of functional groups in metal–organic frameworks. Sci. Adv. https://doi.org/10.1126/sciadv.1600954 (2016).

Griffiths, R.-R. et al. Data-driven discovery of molecular photoswitches with multioutput Gaussian processes. Chem. Sci. 13, 13541–13551 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).

Article
PubMed
PubMed Central

Google Scholar

Jablonka, K. M., Jothiappan, G. M., Wang, S., Smit, B. & Yoo, B. Bias free multiobjective active learning for materials design and discovery. Nat. Commun. https://doi.org/10.1038/s41467-021-22437-0 (2021).

Bannwarth, C., Ehlert, S. & Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions. J. Chem. Theory Comput. 15, 1652–1671 (2019).

Article
CAS
PubMed

Google Scholar

Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs: quantum mechanical properties of drug-like molecules https://doi.org/10.3929/ethz-b-000482129 (2021).

Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. Sci. Data 9, 273 (2022).

Westermayr, J., Gilkes, J., Barrett, R. & Maurer, R. J. High-throughput property-driven generative design of functional organic molecules. Nat. Comput. Sci. 3, 139–148 (2023).

Jablonka, K. M., Patiny, L. & Smit, B. Making the collective knowledge of chemistry open and machine actionable. Nat. Chem. 14, 365–376 (2022).

Article
CAS
PubMed

Google Scholar

Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096–1108 (2019).

Article
CAS
PubMed

Google Scholar

Wang, B. Mesh-Transformer-JAX: model-parallel implementation of transformer language model with JAX. GitHub https://github.com/kingoflolz/mesh-transformer-jax (2021).

Wang, B. & Komatsuzaki, A. GPT-J-6B: a 6 billion parameter autoregressive language model. GitHub https://github.com/kingoflolz/mesh-transformer-jax (2021).

Gao, L. et al. The Pile: an 800BG dataset of diverse text for language modeling. Preprint at https://arxiv.org/abs/2101.00027 (2020).

Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022).

Dettmers, T., Lewis, M., Shleifer, S. & Zettlemoyer, L. 8-bit optimizers via block-wise quantization. in The Tenth International Conference on Learning Representations (2022).

Hu, E. J. et al. LoRA: low-rank adaptation of large language models. in International Conference On Learning Representations (2021).

Jablonka, K. M. kjappelbaum/gptchem: initial release. Zenodo https://doi.org/10.5281/zenodo.7806672 (2023).

Jablonka, K. M. chemlift. Zenodo https://doi.org/10.5281/zenodo.10233422 (2023).

Dubbeldam, D., Calero, S. & Vlugt, T. J. iRASPA: GPU-accelerated visualization software for materials scientists. Mol. Simul. 44, 653–676 (2018).

Article
CAS

Google Scholar

Le, T. T., Fu, W. & Moore, J. H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 250–256 (2020).

Article
CAS
PubMed

Google Scholar

Wang, A. Y.-T., Kauwe, S. K., Murdock, R. J. & Sparks, T. D. Compositionally restricted attention-based network for materials property predictions. npj Comput. Mater. 7, 77 (2021).

Article
ADS

Google Scholar

RDKit contributors. RDKit: Open-source Cheminformatics; (2023) http://www.rdkit.org

Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58, 1736–1741 (2018).

Article
CAS
PubMed

Google Scholar

Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform. 12, 12 (2020).

Probst, D. & Reymond, J.-L. A probabilistic molecular fingerprint for big data settings. J. Cheminform. 10, 66 (2018).

Ertl, P. & Rohde, B. The Molecule Cloud—compact visualization of large collections of molecules. J. Cheminform. 4, 12 (2012).

Wang, Y., Wang, J., Cao, Z. & Farimani, A. B. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).

Article

Google Scholar

Breuck, P.-P. D., Evans, M. L. & Rignanese, G.-M. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. J. Phys. Condens. Matter 33, 404002 (2021).

Article

Google Scholar

Hollmann, N., Müller, S., Eggensperger, K. & Hutter, F. TabPFN: a transformer that solves small tabular classification problems in a second. Preprint at https://arxiv.org/abs/2207.01848 (2022).

Griffiths, R.-R. et al. Gauche: a library for Gaussian processes in chemistry. in ICML 2022 2nd AI for Science Workshop https://openreview.net/forum?id=i9MKI7zrWal (2022)

Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. in Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).

Moosavi, S. M. et al. Understanding the diversity of the metal-organic framework ecosystem. Nat. Commun. 11, 4068 (2020).

Moosavi, S. M. et al. A data-science approach to predict the heat capacity of nanoporous materials. Nat. Mater. 21, 1419–1425 (2022).

Probst, D., Schwaller, P. & Reymond, J.-L. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digit. Discov. 1, 91–97 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 5485–5551 (2020).

MathSciNet

Google Scholar

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).

Mobley, D. L. & Guthrie, J. P. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J. Comput. Aided Mol. Des. 28, 711–720 (2014).

Article
ADS
CAS
PubMed
PubMed Central

Google Scholar

Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44, 1000–1005 (2004).

Article
CAS
PubMed

Google Scholar

Mitchell, J. B. O. DLS-100 solubility dataset. University of St Andrews https://risweb.st-andrews.ac.uk:443/portal/en/datasets/dls100-solubility-dataset(3a3a5abc-8458-4924-8e6c-b804347605e8).html (2017).

Walters, P. Predicting aqueous solubility—it’s harder than it looks. Practical Cheminformatics https://practicalcheminformatics.blogspot.com/2018/09/predicting-aqueous-solubility-its.html (2018).

Bento, A. P. et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090 (2014).

Article
CAS
PubMed

Google Scholar

Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).

Article
CAS
PubMed

Google Scholar

Nagasawa, S., Al-Naamani, E. & Saeki, A. Computer-aided screening of conjugated polymers for organic solar cell: classification by random forest. J. Phys. Chem. Lett. 9, 2639–2646 (2018).

Article
CAS
PubMed

Google Scholar

Kawazoe, Y., Yu, J.-Z., Tsai, A.-P. & Masumoto, T. (eds) Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys Landolt-Börnstein: Numerical Data and Functional Relationships in Science and Technology—New Series (Springer, 2006).

Zhuo, Y., Tehrani, A. M. & Brgoch, J. Predicting the band gaps of inorganic solids by machine learning. J. Phys. Chem. Lett. 9, 1668–1673 (2018).

Article
CAS
PubMed

Google Scholar

Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018).

Article
ADS
CAS
PubMed

Google Scholar

Perera, D. et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359, 429–434 (2018).

Article
ADS
CAS
PubMed

Google Scholar

Source link