Characterising the loss-of-function impact of 5’ untranslated region variants in whole genome sequence data from 15,708 individuals

Mapping Intimacies ◽

10.1101/543504 ◽

2019 ◽

Cited By ~ 6

Author(s):

Nicola Whiffin ◽

Konrad J Karczewski ◽

Xiaolei Zhang ◽

Sonia Chothani ◽

Miriam J Smith ◽

...

Keyword(s):

Human Disease ◽

Large Scale ◽

Sequence Data ◽

Case Reports ◽

Protein Translation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Loss Of Function ◽

Genetic Sequencing ◽

Strong Negative Selection

AbstractUpstream open reading frames (uORFs) are important tissue-specific cis-regulators of protein translation. Although isolated case reports have shown that variants that create or disrupt uORFs can cause disease, genetic sequencing approaches typically focus on protein-coding regions and ignore these variants. Here, we describe a systematic genome-wide study of variants that create and disrupt human uORFs, and explore their role in human disease using 15,708 whole genome sequences collected by the Genome Aggregation Database (gnomAD) project. We show that 14,897 variants that create new start codons upstream of the canonical coding sequence (CDS), and 2,406 variants disrupting the stop site of existing uORFs, are under strong negative selection. Furthermore, variants creating uORFs that overlap the CDS show signals of selection equivalent to coding loss-of-function variants, and uORF-perturbing variants are under strong selection when arising upstream of known disease genes and genes intolerant to loss-of-function variants. Finally, we identify specific genes where perturbation of uORFs is likely to represent an important disease mechanism, and report a novel uORF frameshift variant upstream of NF2 in families with neurofibromatosis. Our results highlight uORF-perturbing variants as an important and under-recognised functional class that can contribute to penetrant human disease, and demonstrate the power of large-scale population sequencing data to study the deleteriousness of specific classes of non-coding variants.

Download Full-text

MEN1 Mutations in Hürthle Cell (Oncocytic) Thyroid Carcinoma

The Journal of Clinical Endocrinology & Metabolism ◽

10.1210/jc.2014-3622 ◽

2015 ◽

Vol 100 (4) ◽

pp. E611-E615 ◽

Cited By ~ 11

Author(s):

Katayoon Kasaian ◽

Ana-Maria Chindris ◽

Sam M. Wiseman ◽

Karen L. Mungall ◽

Thomas Zeng ◽

...

Keyword(s):

Thyroid Carcinoma ◽

Sequence Data ◽

Whole Genome Sequence ◽

Regulation Of Transcription ◽

Whole Genome ◽

Loss Of Function ◽

Transcription Control ◽

Thyroid Carcinomas ◽

Normal Tissues ◽

Hürthle Cell

Context and Objective: Oncocytic thyroid carcinoma, also known as Hürthle cell thyroid carcinoma, accounts for only a small percentage of all thyroid cancers. However, this malignancy often presents at an advanced stage and poses unique challenges to patients and clinicians. Surgical resection of the tumor accompanied in some cases by radioactive iodine treatment, radiation, and chemotherapy are the established modes of therapy. Knowledge of the perturbed oncogenic pathways can provide better understanding of the mechanism of disease and thus opportunities for more effective clinical management. Design and Patients: Initially, two oncocytic thyroid carcinomas and their matched normal tissues were profiled using whole genome sequencing. Subsequently, 72 oncocytic thyroid carcinomas, one cell line, and five Hürthle cell adenomas were examined by targeted sequencing for the presence of mutations in the multiple endocrine neoplasia I (MEN1) gene. Results: Here we report the identification of MEN1 loss-of-function mutations in 4% of patients diagnosed with oncocytic thyroid carcinoma. Whole genome sequence data also revealed large regions of copy number variation encompassing nearly the entire genomes of these tumors. Conclusion: Menin, a ubiquitously expressed nuclear protein, is a well-characterized tumor suppressor whose loss is the cause of MEN1 syndrome. Menin is involved in several major cellular pathways such as regulation of transcription, control of cell cycle, apoptosis, and DNA damage repair pathways. Mutations of this gene in a subset of Hürthle cell tumors point to a potential role for this protein and its associated pathways in thyroid tumorigenesis.

Download Full-text

Finding functional disease-associated non-coding variation using next-generation sequencing

10.1101/060285 ◽

2016 ◽

Author(s):

Paolo Devanna ◽

Xiaowei Sylvia Chen ◽

Joses Ho ◽

Dario Gajewski ◽

Alessandro Gialluisi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Binding Sites ◽

Large Scale ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Whole Exome ◽

Generation Sequencing

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.

Download Full-text

Soybean Haplotype Map (GmHapMap): A Universal Resource for Soybean Translational and Functional Genomics

10.1101/534578 ◽

2019 ◽

Cited By ~ 6

Author(s):

Davoud Torkamaneh ◽

Jérôme Laroche ◽

Babu Valliyodan ◽

Louise O’Donoughue ◽

Elroy Cober ◽

...

Keyword(s):

Glycine Max ◽

Gene Function ◽

Oil Content ◽

Seed Oil ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Loss Of Function ◽

Seed Oil Content ◽

Extensive Coverage

AbstractHere we describe the first worldwide haplotype map for soybean (GmHapMap) constructed using whole-genome sequence data for 1,007 Glycine max accessions and yielding 15 million variants. The number of unique haplotypes plateaued within this collection (4.3 million tag SNPs) suggesting extensive coverage of diversity within the cultivated germplasm. We imputed GmHapMap variants onto 21,618 previously genotyped (50K array/210K GBS) accessions with up to 96% success for common alleles. A GWAS performed with imputed data enabled us to identify a causal SNP residing in the NPC1 gene and to demonstrate its role in controlling seed oil content. We identified 405,101 haplotypes for the 55,589 genes and show that such haplotypes can help define alleles. Finally, we predicted 18,031 putative loss-of-function (LOF) mutations in 10,662 genes and illustrate how such a resource can be used to explore gene function. The GmHapMap provides a unique worldwide resource for soybean genomics and breeding.

Download Full-text

DeepVariant-on-Spark: Small-Scale Genome Analysis Using a Cloud-Based Computing Framework

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/7231205 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Po-Jung Huang ◽

Jui-Huan Chang ◽

Hou-Hsien Lin ◽

Yu-Xuan Li ◽

Chi-Ching Lee ◽

...

Keyword(s):

Genome Analysis ◽

Genetic Variants ◽

Large Scale ◽

Sequence Data ◽

Classification Model ◽

Whole Genome Sequence ◽

Small Scale ◽

Whole Genome ◽

Gold Standard Method ◽

Computing Framework

Although sequencing a human genome has become affordable, identifying genetic variants from whole-genome sequence data is still a hurdle for researchers without adequate computing equipment or bioinformatics support. GATK is a gold standard method for the identification of genetic variants and has been widely used in genome projects and population genetic studies for many years. This was until the Google Brain team developed a new method, DeepVariant, which utilizes deep neural networks to construct an image classification model to identify genetic variants. However, the superior accuracy of DeepVariant comes at the cost of computational intensity, largely constraining its applications. Accordingly, we present DeepVariant-on-Spark to optimize resource allocation, enable multi-GPU support, and accelerate the processing of the DeepVariant pipeline. To make DeepVariant-on-Spark more accessible to everyone, we have deployed the DeepVariant-on-Spark to the Google Cloud Platform (GCP). Users can deploy DeepVariant-on-Spark on the GCP following our instruction within 20 minutes and start to analyze at least ten whole-genome sequencing datasets using free credits provided by the GCP. DeepVaraint-on-Spark is freely available for small-scale genome analysis using a cloud-based computing framework, which is suitable for pilot testing or preliminary study, while reserving the flexibility and scalability for large-scale sequencing projects.

Download Full-text

Faculty Opinions recommendation of Optimal algorithms for haplotype assembly from whole-genome sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13339986.14707085 ◽

2011 ◽

Author(s):

Alejandro Schaffer

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Optimal Algorithms ◽

Genome Sequence Data ◽

Haplotype Assembly

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Whole genome sequence data of Bacillus australimaris strain B28A, isolated from Marine Water in India

Data in Brief ◽

10.1016/j.dib.2021.107240 ◽

2021 ◽

pp. 107240

Author(s):

Wael Ali Mohammed Hadi ◽

Boby T Edwin ◽

A Jayakumaran Nair

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Marine Water ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Download Full-text

Whole genome sequence data of Mycobacterium tuberculosis XDR strain, isolated from patient in Kazakhstan

Data in Brief ◽

10.1016/j.dib.2020.106416 ◽

2020 ◽

Vol 33 ◽

pp. 106416

Author(s):

Asset Daniyarov ◽

Askhat Molkenov ◽

Saule Rakhimova ◽

Ainur Akhmetova ◽

Zhannur Nurkina ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Download Full-text

Whole-genome sequence data suggests environmental adaptation of Ethiopian sheep populations

Genome Biology and Evolution ◽

10.1093/gbe/evab014 ◽

2021 ◽

Author(s):

Pamela Wiener ◽

Christelle Robert ◽

Abulgasim Ahbara ◽

Mazdak Salavati ◽

Ayele Abebe ◽

...

Keyword(s):

High Altitude ◽

Environmental Variables ◽

Large Scale ◽

Sequence Data ◽

Strong Association ◽

Environmental Adaptation ◽

Whole Genome Sequence ◽

Single Nucleotide Variants ◽

High Altitude Adaptation ◽

Altitude Adaptation

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.

Download Full-text

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text