PARSID: a Python script for automatic analysis of local BLAST results for a rapid molecular taxonomic identification

Implementation

PARSID.py was developed in Python 3.8 (www.python.org), and its functioning requires the Biopython package (version 1.78; [9]; https://biopython.org), specifically the Bio.Blast, Bio.SearchIO, and Bio.SeqIO subpackages, and the XlsxWriter package (version 3.0.9; https://xlsxwriter.readthedocs.io/index.html). The source code and test files are available on GitHub [10] with detailed instructions. Before starting, BLAST+ [11] has to be locally installed to run BLAST from command line.

Functioning

PARSID.py automates the process of obtaining taxonomically-accurate molecular species identifications of multiple sequences from an input file in five steps and facilitates the visualization of the results (Fig. 1). The RSL is a customized multi-sequence database used to perform the local BLAST against the input sequence file. It contains single-marker reference sequences with unique identifiers for each described and candidate species of the analysed taxonomic group. While running the script, the user sets the cut-off percentage for sequence identity (user’s parameter: cutoff_pident), below which BLAST results are not saved, and the interspecific sequence divergence percentage (user’s parameter: intersp_div), which corresponds to the divergence threshold defining different taxa. Sequences with larger degrees of divergence in the BLAST analysis than the ones defined with the intersp_div are tagged for further checking. The user can provide supplementary tags for specific taxa through a.csv file that the script integrates in the final results file. The five main steps are explained below.

1.

Generation of the local BLAST database using the customized RSL available with the user (Bio.Blast command makeblastdb). This step is the only one that runs from the shell. This step is executed prior using each different RSL.
2.

Local BLAST of the input sequences and generation of the BLAST result file to parse. The output format is set on 12 standard fields in tabular format, adding the query length (qlen), the subject length (slen), the number of gaps (gaps), and the query coverage per subject (qcovs). The result file is saved in plain-text format that make it easier to consult and uses lower computer memory than the .xml format. It contains the hits with an identity percentage above the user’s cutoff_pident.
3.

Parsing and filtering the local BLAST result. For each hit, the query coverage is calculated as:

$$ qcov=\frac{(qend-qstart+1)}{qlen}*100$$

Then the hits are filtered out by query coverage (< 75%), and by alignment length (< 90% of the subject length). These two parameters allow filtering those results that have high percentage of identity calculated on a shorter alignment length, in the case of a short query or short reference sequence. The results are then sorted by percentage of identity. The best-match for each query is retrieved.
4.

Checking the parsed result by user’s parameter (intersp_div) and .csv supplementary file. The results that have a percentage of identity below the sequence similarity threshold, calculated as:

$$ similarity threshold=100-intersp\_div$$

or that have a query coverage below 75%, are tagged for further checking. The variability of the selected genetic marker is not always enough to enable accurate species identification for some taxa. In this case the user can integrate a supplementary file containing taxa-specific warnings so that the results of these specific sequences can be flagged and the metadata of these sequences (e.g. collection locality, voucher information, etc.) can be analysed to enhance identification.
5.

Generation of the final output file. The output file can be visualized in a spreadsheet. The first sheet contains the list of the input sequences with the respective BLAST’s best-matches, including percentage of identity and query coverage. The results below the similarity threshold or with query coverage below 75% are color-coded for an easier visualization. The second sheet contains all the hits for those sequences tagged for further checking, either due to low percentage of identity or low query coverage. The third sheet contains the full list of lineages of the customized RSL with the respective sequences identifiers found in the input file, and also the list of sequences with a percentage of identity below the user’s cutoff_pident. The last sheet contains some general information, as the total number of lineages in the RSL, the taxonomic coverage of the RSL (i.e. the percentage of taxa of the RSL found in the input file), the total number of input sequences, the percentage of sequences processed (i.e. 100% if the script worked correctly), and the number of results below the user’s cutoff_pident.

Testing

To simulate the molecular taxonomic identification of multiple sequences, we built a RSL using 25 sequences available in GenBank belonging to the frog family of Mantellidae of Madagascar for which the taxonomic identification was considered reliable (based on an expert-based approach). We tested the functioning of the script using an input file of 10 sequences available in GenBank (nine belonging to frogs of the family Mantellidae and one belonging to a species of the family Hyperoliidae). We set the cutoff_pident at 90% and the intersp_div at 3%. We included the optional.csv file to flag results of the genus “Mantella” or the RSL sequences of the herpetological collection of Museo Regionale di Scienze Naturali di Torino (“MRSN”). These test files are available in GitHub [10]. The input file included sequences with outdated taxonomic information, an outgroup (i.e. Heterixalus), and some sequences that needed further checking. The script successfully processed all sequences. The highest percentage of identity (%id) match was assigned to the nine mantellid sequences. Each result also included the following information to evaluate the query-subject alignment: %qcov, qlen, slen, aln_len, evalue, and notes. Of these nine sequences, three were highlighted in red and were tagged for further checking as the %id was below the similarity threshold initially set (i.e. 97%). Further results ordered by %id for these three sequences were listed in the second sheet of the result spreadsheet. Other sequences tagged for further checking contained information included in the.csv file. In the third sheet, results are organized by lineage and outgroups’ identifiers (i.e. sequences not belonging to Mantellidae were pooled together). In the fourth sheet, the script provided the taxonomic coverage in the RSL (28%), the number of taxa in the RSL without a sample (18 taxa), and the number of outgroups (1 sequence).