Annotation of Human Exome Gene Variants with Consensus Pathogenicity

Victor Jaravine; James Balmford; Patrick Metzger; Melanie Boerries; Harald Binder; Martin Boeker

doi:10.3390/genes11091076

Annotation of Human Exome Gene Variants with Consensus Pathogenicity

Genes ◽

10.3390/genes11091076 ◽

2020 ◽

Vol 11 (9) ◽

pp. 1076

Author(s):

Victor Jaravine ◽

James Balmford ◽

Patrick Metzger ◽

Melanie Boerries ◽

Harald Binder ◽

...

Keyword(s):

Conservation Score ◽

Species Conservation ◽

Gradient Boosting ◽

Biological Applications ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Novel Approach ◽

Phenotypic Variant ◽

Variant Effect ◽

Direct Use

A novel approach is developed to address the challenge of annotating with phenotypic effects those exome variants for which relevant empirical data are lacking or minimal. The predictive annotation method is implemented as a stacked ensemble of supervised base-learners, including distributed random forest and gradient boosting machines. Ensemble models were trained and cross-validated on evidence-based categorical variant effect annotations from the ClinVar database, and were applied to 84 million non-synonymous single nucleotide variants (SNVs). The consensus model combined 39 functional mutation impacts, cross-species conservation score, and gene indispensability score. The indispensability score, accounting for differences in variant pathogenicities including in essential and mutation-tolerant genes, considerably improved the predictions. The consensus combination is consistent with as many input scores as possible while minimizing false predictions. The input scores are ranked based on their ability to predict effects. The score rankings and categorical phenotypic variant effect predictions are aimed for direct use in clinical and biological applications to prioritize human exome variants and mutations.

Download Full-text

Fido-SNP: the first webserver for scoring the impact of single nucleotide variants in the dog genome

Nucleic Acids Research ◽

10.1093/nar/gkz420 ◽

2019 ◽

Vol 47 (W1) ◽

pp. W136-W141 ◽

Cited By ~ 1

Author(s):

Emidio Capriotti ◽

Ludovica Montanucci ◽

Giuseppe Profiti ◽

Ivan Rossi ◽

Diana Giannuzzi ◽

...

Keyword(s):

Matthews Correlation Coefficient ◽

Genomic Variation ◽

Gradient Boosting ◽

Binary Classifier ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Coding Regions ◽

Variation Data ◽

Boosting Algorithm ◽

The Impact

Abstract As the amount of genomic variation data increases, tools that are able to score the functional impact of single nucleotide variants become more and more necessary. While there are several prediction servers available for interpreting the effects of variants in the human genome, only few have been developed for other species, and none were specifically designed for species of veterinary interest such as the dog. Here, we present Fido-SNP the first predictor able to discriminate between Pathogenic and Benign single-nucleotide variants in the dog genome. Fido-SNP is a binary classifier based on the Gradient Boosting algorithm. It is able to classify and score the impact of variants in both coding and non-coding regions based on sequence features within seconds. When validated on a previously unseen set of annotated variants from the OMIA database, Fido-SNP reaches 88% overall accuracy, 0.77 Matthews correlation coefficient and 0.91 Area Under the ROC Curve.

Download Full-text

Novel approach for CES1 genotyping: integrating single nucleotide variants and structural variation

Pharmacogenomics ◽

10.2217/pgs-2016-0145 ◽

2018 ◽

Vol 19 (4) ◽

pp. 349-359 ◽

Cited By ~ 1

Author(s):

Ditte Bjerre ◽

Henrik Berg Rasmussen ◽

The INDICES Consortium

Keyword(s):

Structural Variation ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Novel Approach

Download Full-text

Comprehensive variant effect predictions of single nucleotide variants in model organisms

10.1101/313031 ◽

2018 ◽

Cited By ~ 3

Author(s):

Omar Wagih ◽

Bede Busby ◽

Marco Galardini ◽

Danish Memon ◽

Athanasios Typas ◽

...

Keyword(s):

Amino Acid ◽

Protein Complex ◽

Model Organisms ◽

Single Nucleotide Variants ◽

Cellular Mechanisms ◽

Single Nucleotide ◽

Post Translational Modifications ◽

Variant Effect ◽

Coding Variants ◽

The Impact

AbstractThe effect of single nucleotide variants (SNVs) in coding and non-coding regions is of great interest in genetics. Although many computational methods aim to elucidate the effects of SNVs on cellular mechanisms, it is not straightforward to comprehensively cover different molecular effects. To address this we compiled and benchmarked sequence and structure-based variant effect predictors and we analyzed the impact of nearly all possible amino acid and nucleotide variants in the reference genomes of H. sapiens, S. cerevisiae and E. coli. Studied mechanisms include protein stability, interaction interfaces, post-translational modifications and transcription factor binding sites. We apply this resource to the study of natural and disease coding variants. We also show how variant effects can be aggregated to generate protein complex burden scores that uncover protein complex to phenotype associations based on a set of newly generated growth profiles of 93 sequenced S. cerevisiae strains in 43 conditions. This resource is available through mutfunc, a tool by which users can query precomputed predictions by providing amino acid or nucleotide-level variants.

Download Full-text

Annotation of Human Exome Gene Variants with Consensus Pathogenicity

10.20944/preprints202007.0735.v1 ◽

2020 ◽

Author(s):

Victor Zharavin ◽

James Balmford ◽

Patrick Metzger ◽

Melanie Boerries ◽

Harald Binder ◽

...

Keyword(s):

Prediction Models ◽

Human Gene ◽

Gene Variants ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Functional Conservation ◽

Mutation Impact ◽

Uncertain Significance ◽

Direct Use ◽

Variant Effect Prediction

Pathogenicity is unknown for the majority of human gene variants. For prioritization of sequenced somatic and germline mutation variants, in silico approaches can be utilized. In this study, 84 million non-synonymous Single Nucleotide Variants (SNVs) in the human coding genome were annotated using consensus Variant Effect Prediction (cVEP) method. An algorithm, implemented as a stacked ensemble of supervised learners, performed combination of the 39 functional, conservation mutation impact scores from dbNSFP4.0. Adding gene indispensability score, accounting for differences in the pathogenicities of the variants in the essential and the mutation-tolerant genes, improved the predictions. For each SNV the consensus combination gives either a continuous-value pathogenicity score, or a categorical score in five classes: pathogenic, likely pathogenic, uncertain significance, likely benign, benign. The provided class database is aimed for direct use in clinical practice. The trained prediction models were 5-fold cross-validated on the evidence-based categorical annotations from the ClinVar database. The rankings of the scores based on their ability to predict pathogenicity were obtained. A two-step strategy using the rankings, scores and class annotations is suggested for filtering and prioritization of the human exome mutations in clinical and biological applications of NGS technology.

Download Full-text

IntSplice2: Prediction of the Splicing Effects of Intronic Single-Nucleotide Variants Using LightGBM Modeling

Frontiers in Genetics ◽

10.3389/fgene.2021.701076 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jun-ichi Takeda ◽

Sae Fukami ◽

Akira Tamura ◽

Akihide Shibata ◽

Kinji Ohno

Keyword(s):

Allelic Frequency ◽

Training Dataset ◽

Gradient Boosting ◽

Support Vector ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Test Dataset ◽

Average Recall ◽

Statistical Measures ◽

Light Gradient

Prediction of the effect of a single-nucleotide variant (SNV) in an intronic region on aberrant pre-mRNA splicing is challenging except for an SNV affecting the canonical GU/AG splice sites (ss). To predict pathogenicity of SNVs at intronic positions −50 (Int-50) to −3 (Int-3) close to the 3’ ss, we developed light gradient boosting machine (LightGBM)-based IntSplice2 models using pathogenic SNVs in the human gene mutation database (HGMD) and ClinVar and common SNVs in dbSNP with 0.01 ≤ minor allelic frequency (MAF) < 0.50. The LightGBM models were generated using features representing splicing cis-elements. The average recall/sensitivity and specificity of IntSplice2 by fivefold cross-validation (CV) of the training dataset were 0.764 and 0.884, respectively. The recall/sensitivity of IntSplice2 was lower than the average recall/sensitivity of 0.800 of IntSplice that we previously made with support vector machine (SVM) modeling for the same intronic positions. In contrast, the specificity of IntSplice2 was higher than the average specificity of 0.849 of IntSplice. For benchmarking (BM) of IntSplice2 with IntSplice, we made a test dataset that was not used to train IntSplice. After excluding the test dataset from the training dataset, we generated IntSplice2-BM and compared it with IntSplice using the test dataset. IntSplice2-BM was superior to IntSplice in all of the seven statistical measures of accuracy, precision, recall/sensitivity, specificity, F1 score, negative predictive value (NPV), and matthews correlation coefficient (MCC). We made the IntSplice2 web service at https://www.med.nagoya-u.ac.jp/neurogenetics/IntSplice2.

Download Full-text

Decoding the effects of synonymous variants

10.1101/2021.05.20.445019 ◽

2021 ◽

Author(s):

Zishuo Zeng ◽

Ariel A Aptekmann ◽

Yana Bromberg

Keyword(s):

Human Genome ◽

Training Data ◽

Gradient Boosting ◽

Species Variation ◽

Single Nucleotide Variants ◽

Biological Impact ◽

Single Nucleotide ◽

Extreme Gradient Boosting ◽

Synonymous Variant ◽

Standard Training

Synonymous single nucleotide variants (sSNVs) are common in the human genome but are often overlooked. However, sSNVs can have significant biological impact and may lead to disease. Existing computational methods for evaluating the effect of sSNVs suffer from the lack of gold-standard training/evaluation data and exhibit over-reliance on sequence conservation signals. We developed synVep (synonymous Variant effect predictor), a machine learning-based method that overcomes both of these limitations. Our training data was a combination of variants reported by gnomAD (observed) and those unreported, but possible in the human genome (generated). We used positive-unlabeled learning to purify the generated variant set of any likely unobservable variants. We then trained two sequential extreme gradient boosting models to identify subsets of the remaining variants putatively enriched and depleted in effect. Our method attained 90% precision/recall on a previously unseen set of variants. Furthermore, although synVep does not explicitly use conservation, its scores correlated with evolutionary distances between orthologs in cross-species variation analysis. synVep was also able to differentiate pathogenic vs. benign variants, as well as splice-site disrupting variants (SDV) vs. non-SDVs. Thus, synVep provides an important improvement in annotation of sSNVs, allowing users to focus on variants that most likely harbor effects. Availability: synVep webserver for online query: https://services.bromberglab.org/synvep; For local runs Python script (https://bitbucket.org/bromberglab/synvep_local) and prediction database (https://zenodo.org/record/4763256) are also available.

Download Full-text

A resource of variant effect predictions of single nucleotide variants in model organisms

Molecular Systems Biology ◽

10.15252/msb.20188430 ◽

2018 ◽

Vol 14 (12) ◽

Cited By ~ 30

Author(s):

Omar Wagih ◽

Marco Galardini ◽

Bede P Busby ◽

Danish Memon ◽

Athanasios Typas ◽

...

Keyword(s):

Model Organisms ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Variant Effect

Download Full-text

Faculty Opinions recommendation of Phylogenetic and physicochemical analyses enhance the classification of rare nonsynonymous single nucleotide variants in type 1 and 2 long-QT syndrome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.717960422.793463950 ◽

2012 ◽

Author(s):

Jeffrey Noebels ◽

Tara Klassen

Keyword(s):

Long Qt Syndrome ◽

Single Nucleotide Variants ◽

Long Qt ◽

Single Nucleotide ◽

Qt Syndrome

Download Full-text

Single-Nucleotide Variants in microRNAs Sequences or in their Target Genes Might Influence the Risk of Epilepsy: A Review

Cellular and Molecular Neurobiology ◽

10.1007/s10571-021-01058-7 ◽

2021 ◽

Author(s):

Renata Parissi Buainain ◽

Matheus Negri Boschiero ◽

Bruno Camporeze ◽

Paulo Henrique Pires de Aguiar ◽

Fernando Augusto Lima Marson ◽

...

Keyword(s):

Target Genes ◽

Single Nucleotide Variants ◽

Single Nucleotide

Download Full-text

Combination of Genome-Wide Polymorphisms and Copy Number Variations of Pharmacogenes in Koreans

Journal of Personalized Medicine ◽

10.3390/jpm11010033 ◽

2021 ◽

Vol 11 (1) ◽

pp. 33

Author(s):

Nayoung Han ◽

Jung Mi Oh ◽

In-Wha Kim

Keyword(s):

Copy Number ◽

Genome Wide Association Study ◽

Copy Number Gain ◽

Copy Number Variations ◽

Gene Gain ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Haplotype Blocks ◽

Genome Wide ◽

Control And Prevention

For predicting phenotypes and executing precision medicine, combination analysis of single nucleotide variants (SNVs) genotyping with copy number variations (CNVs) is required. The aim of this study was to discover SNVs or common copy CNVs and examine the combined frequencies of SNVs and CNVs in pharmacogenes using the Korean genome and epidemiology study (KoGES), a consortium project. The genotypes (N = 72,299) and CNV data (N = 1000) were provided by the Korean National Institute of Health, Korea Centers for Disease Control and Prevention. The allele frequencies of SNVs, CNVs, and combined SNVs with CNVs were calculated and haplotype analysis was performed. CYP2D6 rs1065852 (c.100C>T, p.P34S) was the most common variant allele (48.23%). A total of 8454 haplotype blocks in 18 pharmacogenes were estimated. DMD ranked the highest in frequency for gene gain (64.52%), while TPMT ranked the highest in frequency for gene loss (51.80%). Copy number gain of CYP4F2 was observed in 22 subjects; 13 of those subjects were carriers with CYP4F2*3 gain. In the case of TPMT, approximately one-half of the participants (N = 308) had loss of the TPMT*1*1 diplotype. The frequencies of SNVs and CNVs in pharmacogenes were determined using the Korean cohort-based genome-wide association study.

Download Full-text