Uncategorized

RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci | Genome Biology



Identification of significant data features

To select features that may be relevant for classifying TR loci as pathogenic or benign, we downloaded 30 different annotation datasets from a variety of sources including UCSC’s Table Browser, PsychENCODE, and relevant publications (Additional file 1: Table S1). We converted all dataset files into BED format. First, we created a list of 40 known pathogenic TRs which are listed as represented in the training dataset in Table 1. These TRs were included in initial analyses because they are well established in their causal link to disease [24,25,26]. Next, we obtained a dataset of all TRs cataloged in the reference genome from UCSC’s Table Browser (Simple Repeats table in hg19). Finally, we used this reference TR dataset to create a set of “matched TRs,” which included all TRs that have a disease motif and occur in the same genic region. For example, all TRs with a CAG motif that occur in exons would be part of the matched TR dataset.

Table 1 Pathogenic TRs used in training/LOOCV and testing datasets. RExPRT ensemble classification, as well as both SVM and XGB scores are provided for each pathogenic TR. Additionally, information is provided for each locus including associated disease, motif, the gene region in which the TR is located, age-of-onset, inheritance pattern of the associated disease, whether a motif change is required to pathogenic, and whether the motif was uniquely represented among other pathogenic TRs in the training dataset

The three TR datasets (pathogenic TRs, reference TRs, and matched TRs) were then compared individually with each of the 30 annotation files. When evaluating a feature, we employed bedtools fisher, which takes two sets of genomic intervals as input (e.g., pathogenic TRs and ORegAnno regions) and performs Fisher’s exact test, evaluating the amount of overlap between the two sets of intervals. This test is designed to determine whether there is a statistically significant nonrandom association between these two sets of genomic intervals. For the example provided, the presence of a significant p-value in Fisher’s exact test would indicate that the intervals in the pathogenic TRs are enriched in ORegAnno regions. It highlights that pathogenic TRs are more likely to occur in ORegAnno regions than expected by chance. Using bedtools fisher, we generated contingency tables for all pairs of comparisons, then calculated 95% confidence intervals for the odds ratios using fisher.test() from the R stats package. If the 95% confidence interval of the odds ratio for an annotation’s intersection with pathogenic TRs was higher than the 95% confidence interval for the matched TRs, the annotation feature was considered significantly enriched for pathogenic TRs. For genic annotation features (3′UTR, 5′UTR, exon, intron, and promoter), as presented in Fig. 2b and c, the comparison was made between pathogenic TRs and reference TRs instead to avoid bias, since matched TRs already selected for certain genic regions.

Quantitative and multivalued features

We further selected additional annotations to test in our models (Additional file 1: Table S2). These included datasets with numerical values rather than the purely categorical datasets discussed above. For the features described above, TRs were given a value of 1 if the locus intersected with the annotation feature, and a value of 0 if there was no intersection. Quantitative datasets that we incorporated include pLI scores, LOEUF scores [27], and GERP scores [28]. The basis for including these is due to their significance in assessing pathogenicity of SNVs. We also created a feature that calculated the distance between the TR locus and the nearest gene, which was important for TRs that are intergenic.

Additionally, we incorporated information regarding the motif assuming a motif-phenotype correlation exists as proposed by Ishiura et al., where TRs with the same motifs, occurring in the same genic regions—albeit in different genes entirely—can produce the same phenotypes, suggesting TR motif may be important in its pathogenicity [29]. We calculated the percentages of each nucleotide in the motif and provided a GC content as well. For motif analysis, we also used the Sequence to Star Network (S2SNet) approach [30], which can transform any character-based sequence into a graph-based star-shaped complex network. We characterized the star network’s topological indices (Tis) with calculations of different metrics, including its Shannon entropy, spectral moments, Harray number, Wiener index, Gutman topological index, Schultz topological index, Balaban distance connectivity index, Kier-Hall connectivity index, and Randic connectivity index [30]. S2SNet was downloaded from GitHub and run using the default parameters on the command line. For the input, we created a sequence of 10 repeating units of the TR motif.

Finally, we accounted for tissue expression of the genes that harbored TRs. We created a categorical feature reflecting the tissue where each gene had its maximum expression according to data from the GTEx Portal (date of accession: September 2021). The TR was then assigned to one of three possible categories: expressed in neurological tissue, expressed in another, or unknown expression.

Training and testing dataset creation and preprocessing

To train RExPRT, we used the same 40 known pathogenic loci that were assessed with Fisher’s exact tests above. Based on our previous work, [8] we used a set of 754 TR loci that were commonly expanded in the 1000 Genomes Project controls as negative training data. Specifically, we selected loci that were present at a size of > 175 bp in more than 1% of the control samples. Since these TRs are commonly expanded, we presumed that they are benign. To annotate our TRs with the features discussed above, we created an input file containing the reference coordinates of the locus in hg19 (chromosome, start position, and end position), as well as the motif. After annotation with the features described, we used the OneHotEncoder from Scikit-Learn to preprocess the data. This was used for categorical variables including GTEx, gene region (intron/exon/intergenic), and gene location (first/middle/last).

For the testing dataset, we used the 21 remaining pathogenic TR loci listed in Table 1. These were loci that were either discovered more recently, have only been found in a small number of affected families, or were previously missed [3]. For the negative loci in the test set, we used sites that can be presumed benign, but are still rare since RExPRT will have to distinguish between rare benign TRs and rare pathogenic TRs. We decided to use 83 rare TRs that were candidate TRs resulting from running the outlier pipeline described subsequently on our 102 positive controls. Since these genomes correspond to patients that were all diagnosed with a Mendelian repeat expansion disorder (caused by a single gene), other TR expansions in these genomes are presumed to be benign. Therefore, we combined all the candidate TRs from these genomes and excluded all known pathogenic TRs to produce a list of 83 rare, benign TRs.

Model testing and assessment

We tested seven different statistical learning models: logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), decision tree (DT), gradient boosted decision tree (XGB), and linear discriminant analysis (LDA). To evaluate the performance of these models, we used a leave-one-out cross validation (LOOCV) approach. The LOOCV approach is useful here because we have a small positive training dataset of known pathogenic TRs. For models that are sensitive to scaling such as SVM, we used StandardScaler to standardize features by subtracting the mean and scaling by the standard deviation. To assess model performance, we obtained the confusion matrix and determined overall accuracy, precision, recall, and F1 metrics. We also produced receiver operating characteristic curve (ROC) and precision-recall curves (PRC) and calculated the area under the curve for each one. All coding was done in Python using Scikit-Learn. Formulas for calculations are listed below:

$$Accuracy=\frac{true\;positives\;\left(tp\right)+true\;negatives\;(tn)}{tp+tn+false\;positives\;\left(fp\right)+false\;negatives\;(fn)}\times100$$

(1)

$$Precision=\frac{tp}{tp+fp}$$

(2)

$$Recall= \frac{tp}{tp+fn}$$

(3)

$$F1=\frac{2 \times Precision \times Recall}{Precision+Recall}$$

(4)

Feature selection

To further improve the accuracy of the two best performing models, we plotted a feature importance graph (XGB) and a permutation importance graph (SVM) to visualize the significance of each feature to the models. We then removed features that were ranked low on the list, indicating that the feature was not adding significant value in the decision-making process. We only removed features if they did not impact confusion matrix scores or the overall area under the ROC curve (AUROC) score.

Additionally, we created a correlation plot of all features and removed features that were highly correlated. In particular, correlated features that produced the highest AUROC value if retained were kept, ensuring that removal of features did not reduce the AUROC score.

Hyperparameter tuning

We used Scikit-Learn’s GridSearchCV to fine-tune the SVM and XGB models. For SVM, we tuned three parameters: C, gamma, and kernel. For XGB, we tuned two parameters: number of estimators, and max depth.

Ensemble method

To improve the overall recall of RExPRT, we devised an ensemble approach, which encompasses the two different models described above (SVM and XGB) [31]. The classification of the TR by the ensemble model will be pathogenic if either of the models predict likelihood of pathogenicity. The threshold is set to a standard 0.5 probability, above which the TR is classified as pathogenic. Furthermore, RExPRT also provides a confidence score, which is the sum of the SVM and XGB scores. Figure 1 is a schematic illustrating the complete methodology and workflow of RExPRT, covering training, LOOCV, model selection, and predictions on the testing dataset.

Random splitting of data

We combined our initial training and testing datasets, then generated five separate training and testing datasets by random division (2/3 for training and 1/3 for testing). This was done using scikit-learn’s train_test_split() function. We then employed the ensemble model and generated summary metrics for these datasets.

Age of onset analysis

We calculated the midpoint of the observed age of onset ranges for each disorder. Next, we generated a scatter plot where each point corresponds to a disease-associated pathogenic TR locus. The x-axis represents the midpoint of the observed range of onset ages for the associated disease (refer to Table 1), while the y-axis reflects the RExPRT ensemble score. Subsequently, we calculated Spearman’s rank correlation coefficient (rho) for these two variables.

Analysis of ~ 800,000 reference repeats classified by RExPRT

We applied RExPRT to 836,545 TRs with motif lengths between 3 and 8 bp listed in the hg19 reference genome. We separated TRs into those classified as pathogenic by RExPRT and those classified as benign. We then characterized the genomic features of pathogenic TRs and benign TRs. For each feature, we created two plots:

  1. (a)

    Bar graph or pie chart: this represents the fraction of pathogenic or benign TRs that overlap a particular feature relative to the total number of TRs within the group.

  2. (b)

    Odds ratio graph: this graph evaluates whether the observed percentage above is significantly deviated from what would be expected by random chance, considering the coverage of that feature within the entire genome.

Bedtools fisher was used to perform Fisher’s exact tests and calculate odds ratios for gene regions and regulatory regions as described above. For gene expression characterization, we excluded intergenic TRs. We then performed Fisher’s exact tests in R, first creating a contingency table for benign and pathogenic TRs. To do this, we calculated the number of genes that overlapped in the different sets of tissues with the two groups of classified TRs.

For analysis of Online Mendelian Inheritance in Man (OMIM) genes, we downloaded the list of genes from their website, and filtered for those that are Mendelian. For those genes that are dominant and recessive, we used this same dataset and filtered for the respective subtype. The ataxia gene list was obtained from GeneDx’s Ataxia Xpanded panel (https://www.genedx.com/tests/detail/ataxia-xpanded-panel-887). The odds ratios were obtained by calculating Fisher’s exact tests in R, as described for the gene expression data.

To analyze the disease motifs, we filtered all pathogenic and benign TRs for those which contained a known repeat expansion disease motif (Table 1). We included each window shift of the motif as well as its reverse complement. For pure GC motifs, we filtered for TRs that only contained G and C in their motif. For polyglutamine (polyQ) and polyalanine (polyA) motifs, we began with only coding TRs in each group and filtered for the trinucleotides which code for these amino acids. Since we do not have the coding frame information, many of these will not actually code for glutamine or alanine, so this is an overestimation. Odds ratios were calculated in R, as described for the gene expression, and disease gene categories.

Undiagnosed Diseases Network (UDN) sample processing

Patient fastq files were aligned with Burrows–Wheeler aligner (BWA) to the GRCh38 reference genome [32]. The resulting SAM files were converted to BAM files. Duplicates were removed with Picard tools MarkDuplicates—https://github.com/broadinstitute/picard. After sorting and reindexing, base quality score recalibration (BQSR) was performed using genome analysis toolkit (GATK) [33]. Next, ExpansionHunter Denovo (EHDn) was run on the resulting BAM files, using additional parameters “–min-unit-len 3 –max-unit-len 8” [6]. The outputs of EHDn for each case was then aggregated separately to the outputs for the 2405 samples from the 1000 Genomes Project controls using EHDn’s helper scripts to allow depth normalization. Finally, the bed file output was expanded from its sparse encoding into a dense matrix format using R. This dense matrix of depth-normalized anchored in-repeat read (IRR) counts was used as the input for the outlier pipeline.

Outlier pipeline

After sample processing, each case is represented by a dense matrix of TRs with anchored IRR counts for the case as well as all controls. Summary statistics were calculated for all TRs before filtering. For each TR, the outlier pipeline calculates a z-score, a kernel density estimation, and the percentage of controls with anchored IRR counts above that seen for the patient sample. The z-score for a case is a measure of their anchored IRR count in terms of the number of standard deviations above or below the mean anchored IRR counts observed in controls. The kernel density estimation gives the expected proportion of controls with anchored IRR counts above that of the patient, based on the density curve distribution of controls. All three measures give us an indication of the likelihood of the anchored IRR count for the patient sample being part of the distribution of controls. Downstream filtering selects for TRs with high z-scores, and low values for the other two measures. Specifically, the pipeline filters out TRs based on anchored IRR counts in patients (≥ 5), frequency of controls with at least 1 allele over 175 bp (< 1%), genomic region the TR is located in (≠ intergenic), and whether the TR stems from a reference repeat locus or Alu element. Moreover, we also filter out a list of 126 false positive sites that were selected based on their presence in a heterogenous group of ~ 600 disease genomes and low occurrence in the 1000 Genomes controls. It is thought that since these variants are too common in a cohort of rare disease cases, they cannot be causal variants. The final output is a list of candidate TRs for each patient genome. Therefore, here we define candidate TRs as TRs that are rare and expanded in a patient, occur within or close to genes, and stem from a reference repeat locus.

Analysis of UDN genomes

We ran 2982 genomes (968 probands and affected/unaffected family members) from the UDN through our outlier pipeline, resulting in a list of candidate TRs that can be defined as rare, genic TR expansions that stem from a reference repeat locus. Since these TRs were called by ExpansionHunter Denovo (EHDn), they do not have precise locus specificity. EHDn provides coordinates of ~ 1–2000 bp surrounding the repeat. To run ExpansionHunter on these loci for determining repeat number and zygosity, we used ehdn-to-eh (https://github.com/francesca-lucas/ehdn-to-eh), which provides precise coordinates for the TR. These coordinates were then used to create the variant catalogs for running ExpansionHunter. Since the UDN genomes are aligned to hg38, we converted the loci into hg19 using UCSC’s LiftOver tool [34], and then ran RExPRT on the candidate TRs.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *