scholarly journals DeepPVP: phenotype-based prioritization of causative variants using deep learning

2018 ◽  
Author(s):  
Imane Boudellioua ◽  
Maxat Kulmanov ◽  
Paul N Schofield ◽  
Georgios V Gkoutos ◽  
Robert Hoehndorf

AbstractBackgroundPrioritization of variants in personal genomic data is a major challenge. Recently, computational methods that rely on comparing phenotype similarity have shown to be useful to identify causative variants. In these methods, pathogenicity prediction is combined with a semantic similarity measure to prioritize not only variants that are likely to be dysfunctional but those that are likely involved in the pathogenesis of a patient’s phenotype.ResultsWe have developed DeepPVP, a variant prioritization method that combined automated inference with deep neural networks to identify the likely causative variants in whole exome or whole genome sequence data. We demonstrate that DeepPVP performs significantly better than existing methods, including phenotype-based methods that use similar features. DeepPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp.ConclusionsDeepPVP further improves on existing variant prioritization methods both in terms of speed as well as accuracy.

2018 ◽  
Author(s):  
Alfredo Iacoangeli ◽  
Ahmad Al Khleifat ◽  
William Sproviero ◽  
Aleksey Shatunov ◽  
Ashley R Jones ◽  
...  

AbstractAmyotrophic lateral sclerosis (ALS, MND) is a neurodegenerative disease of upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two years of first symptoms. Genetic factors are an important cause of ALS, with variants in more than 25 genes having strong evidence, and weaker evidence available for variants in more than 120 genes. With the increasing availability of Next-Generation sequencing data, non-specialists, including health care professionals and patients, are obtaining their genomic information without a corresponding ability to analyse and interpret it. Furthermore, the relevance of novel or existing variants in ALS genes is not always apparent. Here we present ALSgeneScanner, a tool that is easy to install and use, able to provide an automatic, detailed, annotated report, on a list of ALS genes from whole genome sequence data in a few hours and whole exome sequence data in about one hour on a readily available mid-range computer. This will be of value to non-specialists and aid in the interpretation of the relevance of novel and existing variants identified in DNA sequencing data.


2016 ◽  
Author(s):  
Paolo Devanna ◽  
Xiaowei Sylvia Chen ◽  
Joses Ho ◽  
Dario Gajewski ◽  
Alessandro Gialluisi ◽  
...  

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.


Author(s):  
Amnon Koren ◽  
Dashiell J Massey ◽  
Alexa N Bracci

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Alana R. Rodney ◽  
Reuben M. Buckley ◽  
Robert S. Fulton ◽  
Catrina Fronick ◽  
Todd Richmond ◽  
...  

AbstractOver 94 million domestic cats are susceptible to cancers and other common and rare diseases. Whole exome sequencing (WES) is a proven strategy to study these disease-causing variants. Presented is a 35.7 Mb exome capture design based on the annotated Felis_catus_9.0 genome assembly, covering 201,683 regions of the cat genome. Whole exome sequencing was conducted on 41 cats with known and unknown genetic diseases and traits, of which ten cats had matching whole genome sequence (WGS) data available, used to validate WES performance. At 80 × mean exome depth of coverage, 96.4% of on-target base coverage had a sequencing depth > 20-fold, while over 98% of single nucleotide variants (SNVs) identified by WGS were also identified by WES. Platform-specific SNVs were restricted to sex chromosomes and a small number of olfactory receptor genes. Within the 41 cats, we identified 31 previously known causal variants and discovered new gene candidate variants, including novel missense variance for polycystic kidney disease and atrichia in the Peterbald cat. These results show the utility of WES to identify novel gene candidate alleles for diseases and traits for the first time in a feline model.


Data in Brief ◽  
2020 ◽  
Vol 33 ◽  
pp. 106416
Author(s):  
Asset Daniyarov ◽  
Askhat Molkenov ◽  
Saule Rakhimova ◽  
Ainur Akhmetova ◽  
Zhannur Nurkina ◽  
...  

Author(s):  
Pamela Wiener ◽  
Christelle Robert ◽  
Abulgasim Ahbara ◽  
Mazdak Salavati ◽  
Ayele Abebe ◽  
...  

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.


2017 ◽  
Vol 7 (1) ◽  
Author(s):  
Lynsey K. Whitacre ◽  
Jesse L. Hoff ◽  
Robert D. Schnabel ◽  
Sara Albarella ◽  
Francesca Ciotola ◽  
...  

Author(s):  
Viola Kurm ◽  
Ilse Houwers ◽  
Claudia E. Coipan ◽  
Peter Bonants ◽  
Cees Waalwijk ◽  
...  

AbstractIdentification and classification of members of the Ralstonia solanacearum species complex (RSSC) is challenging due to the heterogeneity of this complex. Whole genome sequence data of 225 strains were used to classify strains based on average nucleotide identity (ANI) and multilocus sequence analysis (MLSA). Based on the ANI score (>95%), 191 out of 192(99.5%) RSSC strains could be grouped into the three species R. solanacearum, R. pseudosolanacearum, and R. syzygii, and into the four phylotypes within the RSSC (I,II, III, and IV). R. solanacearum phylotype II could be split in two groups (IIA and IIB), from which IIB clustered in three subgroups (IIBa, IIBb and IIBc). This division by ANI was in accordance with MLSA. The IIB subgroups found by ANI and MLSA also differed in the number of SNPs in the primer and probe sites of various assays. An in-silico analysis of eight TaqMan and 11 conventional PCR assays was performed using the whole genome sequences. Based on this analysis several cases of potential false positives or false negatives can be expected upon the use of these assays for their intended target organisms. Two TaqMan assays and two PCR assays targeting the 16S rDNA sequence should be able to detect all phylotypes of the RSSC. We conclude that the increasing availability of whole genome sequences is not only useful for classification of strains, but also shows potential for selection and evaluation of clade specific nucleic acid-based amplification methods within the RSSC.


Sign in / Sign up

Export Citation Format

Share Document