Uncategorized

Large language models improve annotation of prokaryotic viral proteins



  • Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123.e14 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • ter Horst, A. M. et al. Minnesota peat viromes reveal terrestrial and aquatic niche partitioning for local and global viral populations. Microbiome 9, 233 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e8 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, veaa100 (2021).

    Article 
    PubMed 

    Google Scholar
     

  • Glickman, C., Hendrix, J. & Strong, M. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics 22, 329 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Camargo, A.P., et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).

  • Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).

    Article 

    Google Scholar
     

  • Moraru, C. Virclust-a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses. Viruses 15, 1007 (2023).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 3, lqab067 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Zayed, A. A. et al. efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics 37, 4202–4208 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Abdelkareem, A. O., Khalil, M. I., Elaraby, M., Abbas, H. & Elbehery, A. H. A. VirNet: deep attention model for viral reads identification. In 2018 13th Int. Conf. Computer Engineering and Systems (ICCES) 623–626 (IEEE, 2018).

  • Tynecki, P. et al. PhageAI—bacteriophage life cycle recognition with machine learning and natural language processing. Preprint at bioRxiv https://www.biorxiv.org/content/early/2020/07/12/2020.07.11.198606 (2020).

  • Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article 

    Google Scholar
     

  • Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Bepler, T. & Berger, B. Learning the protein language: evolution, structure, and function. Cell Syst. 12, 654–669.e3 (2021).

    PubMed 
    PubMed Central 

    Google Scholar
     

  • Dohan, D., Gane, A., Bileschi, M. L., Belanger, D. & Colwell, L. Improving protein function annotation via unsupervised pre-training: robustness, efficiency, and insights. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21 2782–2791 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3447548.3467163

  • Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).

  • Nasir, A. & Caetano-Anollés, G. A phylogenomic data-driven exploration of viral origins and evolution. Sci. Adv. 1, e1500527 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Balaji, S. & Srinivasan, N. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: inferences on protein evolution. J. Biosci. 32, 83–96 (2007).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Meng, C., Zhang, J., Ye, X., Guo, F. & Zou, Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochim. Biophys. Acta Proteins Proteom. 1868, 140406 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Fang, Z., Feng, T., Zhou, H. & Chen, M. DeePVP: identification and classification of phage virion proteins using deep learning. GigaScience 11, giac076 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 16, e1007845 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Mizuno, C. M., Ghai, R., Saghaï, A., López-García, P. & Rodriguez-Valera, F. Genomes of abundant and widespread viruses from the deep ocean. mBio 7, e00805–16 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hackl, T. et al. Novel integrative elements and genomic plasticity in ocean ecosystems. Cell 186, 47–62.e16 (2023).

    Article 
    PubMed 

    Google Scholar
     

  • Eppley, J. M., Biller, S. J., Luo, E., Burger, A. & DeLong, E. F. Marine viral particles reveal an expansive repertoire of phage-parasitizing mobile elements. Proc. Natl Acad. Sci. USA 119, e2212722119 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Smyshlyaev, G., Bateman, A. & Barabas, O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol. Syst. Biol. 17, e9880 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Gibb, B. et al. Requirements for catalysis in the Cre recombinase active site. Nucleic Acids Res. 38, 5817–5832 (2010).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Williams, K. P. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 30, 866–875 (2002).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Hatfull, G. F. & Hendrix, R. W. Bacteriophages and their genomes. Curr. Opin. Virol. 1, 298–303 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Koonin, E. V., Krupovic, M. & Dolja, V. V. The global virome: how much diversity and how many independent origins? Environ. Microbiol. 25, 40–44 (2023).

    Article 
    PubMed 

    Google Scholar
     

  • Shen, A. & Millard, A. Phage genome annotation: where to begin and end. Phage 2, 183–193 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Borodovich, T., Shkoporov, A. N., Ross, R. P. & Hill, C. Phage-mediated horizontal gene transfer and its implications for the human gut microbiome. Gastroenterol. Rep. 10, goac012 (2022).

    Article 

    Google Scholar
     

  • Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Nicolas, E. et al. The tn3-family of replicative transposons. Microbiol. Spectr. 3, 3.4.14 (2015).

    Article 

    Google Scholar
     

  • Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 1–9, 17112 (2017).

  • Mohssen, M., et al. efam. CyVerse Data Commons https://datacommons.cyverse.org/browse/iplant/home/shared/iVirus/Zayed_efam_2020.1 (2021).

  • Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).


    Google Scholar
     

  • Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems https://www.tensorflow.org/ (2015).

  • Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).

  • Charlier, F. et al. Statannotations. Zenodo (2022); https://doi.org/10.5281/zenodo.7213391

  • Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conference (eds Varoquaux, G. et al.) 11–15 (2008).

  • Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2022).

    Article 
    PubMed Central 

    Google Scholar
     

  • Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Zimmermann, L. et al. A completely reimplemented MPI Bioinformatics Toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Potter, S. C. et al. HMMER web server: 2018 update. Nucleic Acids Res. 46, W200–W204 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019).

    PubMed Central 

    Google Scholar
     

  • Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2020).

    Article 
    PubMed Central 

    Google Scholar
     

  • Paez-Espino, D. et al. IMG/VR v2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2018).

    Article 
    PubMed Central 

    Google Scholar
     

  • Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).

  • Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Camargo, A. geNomad Database. Zenodo (2023); https://doi.org/10.5281/zenodo.7793532

  • Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Kauffman, K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Sci. Data 5, 180114 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Biswas, T. et al. A structural basis for allosteric control of DNA recombination by lambda integrase. Nature 435, 1059–1066 (2005).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. 66, 12–21 (2010).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  • Flamholz, Z. kellylab/viral-protein-function-plm: v1.0. Zenodo (2023); https://doi.org/10.5281/zenodo.10182747

  • Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • pandas development team pandas-dev/pandas: Pandas. Zenodo (2020); https://doi.org/10.5281/zenodo.3509134

  • Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Article 

    Google Scholar
     

  • Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).

    Article 

    Google Scholar
     

  • Granger, B. E. & Pérez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).

    Article 

    Google Scholar
     



  • Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *