Loss of critical developmental and human disease-causing genes in 58 mammals

Mapping Intimacies ◽

10.1101/819169 ◽

2019 ◽

Author(s):

Yatish Turakhia ◽

Heidi I. Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Synonymous Substitution ◽

Specific Gene ◽

High Confidence ◽

Protein Coding ◽

Congenital Diseases ◽

Manual Curation ◽

Human Genes

Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools and protein databases focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (deletion and non-synonymous substitution) as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence protein-coding gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using the hg38 human assembly as a reference, we discovered over 500 unique human genes affected by such high-confidence erosion events in different clades across 58 mammals. While most of these events likely have benign consequences, we also found dozens of clade-specific gene losses that result in early lethality in outgroup mammals or are associated with severe congenital diseases in humans. Our discoveries yield intriguing potential for translational medical genetics and for evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text

A fully-automated method discovers loss of mouse-lethal and human-monogenic disease genes in 58 mammals

Nucleic Acids Research ◽

10.1093/nar/gkaa550 ◽

2020 ◽

Vol 48 (16) ◽

pp. e91-e91

Author(s):

Yatish Turakhia ◽

Heidi I Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Monogenic Disease ◽

Disease Genes ◽

Congenital Diseases ◽

Manual Curation ◽

Automated Method ◽

Human Ortholog ◽

Early Mouse

Abstract Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (amino acid deletions and substitutions) and sister species support as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using human as reference, we discovered over 400 unique human ortholog erosion events across 58 mammals. This includes dozens of clade-specific losses of genes that result in early mouse lethality or are associated with severe human congenital diseases. Our discoveries yield intriguing potential for translational medical genetics and evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text

Natural Selection at an Exceptionally Long GGC Repeat in the Human RASGEF1C and Divergent Genotypes in Late-onset Neurocognitive Disorder

10.21203/rs.3.rs-517583/v1 ◽

2021 ◽

Author(s):

Z Jafarian ◽

S Khamse ◽

H Afshar ◽

Khorram Khorshid HR ◽

A Delbari ◽

...

Keyword(s):

Natural Selection ◽

Evolutionary Biology ◽

Late Onset ◽

Core Promoter ◽

Human Subjects ◽

Specific Gene ◽

Protein Coding ◽

Repeat Allele ◽

Complex Disorders ◽

Selection For

Abstract Across the human protein-coding genes, the neuron-specific gene, RASGEF1C, contains the longest (GGC)-repeat, spanning its core promoter and 5′ untranslated region (RASGEF1C-201 ENST00000361132.9). RASGEF1C expression dysregulation occurs in late-onset neurocognitive disorders (NCDs), such as Alzheimer’s disease. Here we sequenced the GGC-repeat in a sample of human subjects (N = 269), consisting of late-onset NCDs (N = 115) and controls (N = 154). We also studied the status of this STR across vertebrates. The 6-repeat allele of this repeat was the predominant allele in the controls (frequency = 0.85) and NCD patients (frequency = 0.78). The NCD genotype compartment consisted of an excess of genotypes that lacked the 6-repeat (Mid-P exact = 0.004). We also detected divergent genotypes that were present in five NCD patients and not in the controls (Mid-P exact = 0.007). This STR expanded beyond 2-repeats specifically in primates, and was at maximum length in human. We conclude that there is natural selection for the 6-repeat allele of the RASGEF1C (GGC)-repeat in human, and significant divergence from that allele in late-onset NCDs. Indication of natural selection for predominantly abundant STR alleles and divergent genotypes enhance the perspective of evolutionary biology and disease pathogenesis in human complex disorders.

Download Full-text

Development of predicitve models to distinguish metals from non-metal toxicants, and individual metal from one another

BMC Bioinformatics ◽

10.1186/s12859-020-3525-7 ◽

2020 ◽

Vol 21 (S9) ◽

Author(s):

Zongtao Yu ◽

Yuanyuan Fu ◽

Junmei Ai ◽

Jicai Zhang ◽

Gang Huang ◽

...

Keyword(s):

Large Scale ◽

Gene Marker ◽

Specific Gene ◽

Protein Coding ◽

Gene Markers ◽

Independent Dataset ◽

Individual Metal ◽

High Prediction ◽

Toxic Mechanisms ◽

Metal Contaminant

Abstract Background Evaluating the toxicity of chemical mixture and their possible mechanism of action is still a challenge for humans and other organisms. Microarray classifier analysis has shown promise in the toxicogenomic area by identifying biomarkers to predict unknown samples. Our study focuses on identifying gene markers with better sensitivity and specificity, building predictive models to distinguish metals from non-metal toxicants, and individual metal from one another, and furthermore helping understand underlying toxic mechanisms. Results Based on an independent dataset test, using only 15 gene markers, we were able to distinguish metals from non-metal toxicants with 100% accuracy. Of these, 6 and 9 genes were commonly down- and up-regulated respectively by most of the metals. 8 out of 15 genes belong to membrane protein coding genes. Function well annotated genes in the list include ADORA2B, ARNT, S100G, and DIO3. Also, a 10-gene marker list was identified that can discriminate an individual metal from one another with 100% accuracy. We could find a specific gene marker for each metal in the 10-gene marker list. Function well annotated genes in this list include GSTM2, HSD11B, AREG, and C8B. Conclusions Our findings suggest that using a microarray classifier analysis, not only can we create diagnostic classifiers for predicting an exact metal contaminant from a large scale of contaminant pool with high prediction accuracy, but we can also identify valuable biomarkers to help understand the common and underlying toxic mechanisms induced by metals.

Download Full-text

Large Scale Profiling of Protein Isoforms Using Label-Free Quantitative Proteomics Revealed the Regulation of Nonsense-Mediated Decay in Moso Bamboo (Phyllostachys edulis)

Cells ◽

10.3390/cells8070744 ◽

2019 ◽

Vol 8 (7) ◽

pp. 744 ◽

Cited By ~ 5

Author(s):

Xiaolan Yu ◽

Yongsheng Wang ◽

Markus V. Kohnen ◽

Mingxin Piao ◽

Min Tu ◽

...

Keyword(s):

Mass Spectrometry ◽

Quantitative Proteomics ◽

Large Scale ◽

Gene Annotation ◽

Function Analysis ◽

Moso Bamboo ◽

Protein Isoforms ◽

Label Free ◽

Total Proteins ◽

Protein Coding

Moso bamboo is an important forest species with a variety of ecological, economic, and cultural values. However, the gene annotation information of moso bamboo is only based on the transcriptome sequencing, lacking the evidence of proteome. The lignification and fiber in moso bamboo leads to a difficulty in the extraction of protein using conventional methods, which seriously hinders research on the proteomics of moso bamboo. The purpose of this study is to establish efficient methods for extracting the total proteins from moso bamboo for following mass spectrometry-based quantitative proteome identification. Here, we have successfully established a set of efficient methods for extracting total proteins of moso bamboo followed by mass spectrometry-based label-free quantitative proteome identification, which further improved the protein annotation of moso bamboo genes. In this study, 10,376 predicted coding genes were confirmed by quantitative proteomics, accounting for 35.8% of all annotated protein-coding genes. Proteome analysis also revealed the protein-coding potential of 1015 predicted long noncoding RNA (lncRNA), accounting for 51.03% of annotated lncRNAs. Thus, mass spectrometry-based proteomics provides a reliable method for gene annotation. Especially, quantitative proteomics revealed the translation patterns of proteins in moso bamboo. In addition, the 3284 transcript isoforms from 2663 genes identified by Pacific BioSciences (PacBio) single-molecule real-time long-read isoform sequencing (Iso-Seq) was confirmed on the protein level by mass spectrometry. Furthermore, domain analysis of mass spectrometry-identified proteins encoded in the same genomic locus revealed variations in domain composition pointing towards a functional diversification of protein isoform. Finally, we found that part transcripts targeted by nonsense-mediated mRNA decay (NMD) could also be translated into proteins. In summary, proteomic analysis in this study improves the proteomics-assisted genome annotation of moso bamboo and is valuable to the large-scale research of functional genomics in moso bamboo. In summary, this study provided a theoretical basis and technical support for directional gene function analysis at the proteomics level in moso bamboo.

Download Full-text

Double triage to identify poorly annotated genes in maize: The missing link in community curation

10.1101/654848 ◽

2019 ◽

Author(s):

Marcela K. Tello-Ruiz ◽

Cristina F. Marco ◽

Fei-Man Hsu ◽

Rajdeep S. Khangura ◽

Pengfei Qiao ◽

...

Keyword(s):

Gene Annotation ◽

Gene Prediction ◽

Gene Tree ◽

Quality Metrics ◽

Maize Genome ◽

Protein Coding ◽

Transcript Model ◽

Manual Curation ◽

P Gene ◽

Community Curation

AbstractThe sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors – including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Download Full-text

RecBlast: Cloud-Based Large Scale Orthology Detection

10.1101/112946 ◽

2017 ◽

Author(s):

Efrat Rapoport ◽

Moran Neuhof

Keyword(s):

Comparative Genomics ◽

Evolutionary Biology ◽

Large Scale ◽

Valuable Insight ◽

Multiple Genes ◽

Human Genes ◽

Amazon Web Services ◽

Low Efficiency ◽

Downstream Analysis ◽

Reciprocal Blast

AbstractBackgroundThe effective detection and comparison of orthologues is crucial for answering many questions in comparative genomics, phylogenetics and evolutionary biology. One of the most common methods for discovering orthologues is widely known as ‘Reciprocal Blast’. While this method is simple when comparing only two genomes, performing a large-scale comparison of Multiple Genes across Multiple Taxa becomes a labor-intensive and inefficient task. The low efficiency of this complicated process limits the scope and breadth of questions that would otherwise benefit from this powerful method.FindingsHere we present RecBlast, an intuitive and easy-to-use pipeline that enables fast and easy discovery of orthologues along and across the evolutionary tree. RecBlast is capable of running heavy, large-scale and complex Reciprocal Blast comparisons across multiple genes and multiple taxa, in a completely automatic way. RecBlast is available as a cloud-based web server, which includes an easy-to-use user interface, implemented using cloud computing and an elastic and scalable server architecture. RecBlast is also available as a powerful standalone software supporting multi-processing for large datasets, and a cloud image which can be easily deployed on Amazon Web Services cloud. We also include sample results spanning 448 human genes, which illustrate the potential of RecBlast in detecting orthologues and in highlighting patterns and trends across multiple taxa.ConclusionsRecBlast provides a fast, inexpensive and valuable insight into trends and phenomena across distance phyla, and provides data, visualizations and directions for downstream analysis. RecBlast's fully automatic pipeline provides a new and intuitive discovery platform for researchers from any domain in biology who are interested in evolution, comparative genomics and phylogenetics, regardless of their computational skills.

Download Full-text

Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise

10.1101/332825 ◽

2018 ◽

Cited By ~ 22

Author(s):

Mihaela Pertea ◽

Alaina Shumate ◽

Geo Pertea ◽

Ales Varabyou ◽

Yu-Chi Chang ◽

...

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Human Gene ◽

Splice Variants ◽

Gene List ◽

Transcriptional Noise ◽

Protein Coding ◽

Human Genes ◽

Gene Database ◽

Per Gene

AbstractWe assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.

Download Full-text

Construction of Disease-specific Cytokine Profiles by Associating Disease Genes with Immune Responses

10.1101/2021.09.10.459816 ◽

2021 ◽

Author(s):

Tianyun Liu ◽

Shiyin Wang ◽

Russ B Altman

Keyword(s):

Immune System ◽

Immune Responses ◽

Inflammatory Responses ◽

P Value ◽

Specific Gene ◽

High Confidence ◽

Cytokine Profiles ◽

Gene Sets ◽

Human Genes ◽

Disease Specific

AbstractThe pathogenesis of many inflammatory diseases is a coordinated process involving metabolic dysfunctions and immune response—usually modulated by the production of cytokines and associated inflammatory molecules. In this work, we seek to understand how genes involved in pathogenesis which are often not associated with the immune system in an obvious way communicate with the immune system. We have embedded a network of human protein-protein interactions (PPI) from the STRING database with 14,707 human genes using feature learning that captures high confidence edges. We have found that our predicted Association Scores derived from the features extracted from STRING’s high confidence edges are useful for predicting novel connections between genes, thus enabling the construction of a full map of predicted associations for all possible pairs between 14,707 human genes. In particular, we analyzed the pattern of associations for 126 cytokines and found that the six patterns of cytokine interaction with human genes are consistent with their functional classifications. In order to define the disease-specific roles of cytokines we have collected gene sets for 11,944 diseases from DisGeNET. We used these gene sets to predict disease-specific gene associations with cytokines by calculating the normalized average Association Scores between disease-associated gene sets and the 126 cytokines; this creates a unique profile of inflammatory genes (both known and predicted) for each disease. We validated our predicted cytokine associations by comparing them to known associations for 171 diseases. The predicted cytokine profiles correlate (p-value<0.05) with the known ones in 147 diseases. We further characterized the profiles of each disease by calculating an “Inflammation Score” that summarizes different modes of immune responses. Finally, by analyzing subnetworks formed between disease-specific pathogenesis genes, hormones, receptors, and cytokines, we identified the key genes responsible for interactions between pathogenesis and inflammatory responses. These genes and the corresponding cytokines used by different immune disorders suggest unique targets for drug discovery.

Download Full-text

Chromosome-level genome assembly and manually-curated proteome of model necrotroph Parastagonospora nodorum Sn15 reveals a genome-wide trove of candidate effector homologs, and redundancy of virulence-related functions within an accessory chromosome

BMC Genomics ◽

10.1186/s12864-021-07699-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Stefania Bertazzoni ◽

Darcy A. B. Jones ◽

Huyen T. Phan ◽

Kar-Chun Tan ◽

James K. Hane

Keyword(s):

Genome Assembly ◽

Plant Pathogens ◽

Gene Annotation ◽

Specific Gene ◽

Accessory Chromosome ◽

Reference Isolate ◽

Manual Curation ◽

A Genome ◽

Depth Analysis ◽

Parastagonospora Nodorum

Abstract Background The fungus Parastagonospora nodorum causes septoria nodorum blotch (SNB) of wheat (Triticum aestivum) and is a model species for necrotrophic plant pathogens. The genome assembly of reference isolate Sn15 was first reported in 2007. P. nodorum infection is promoted by its production of proteinaceous necrotrophic effectors, three of which are characterised – ToxA, Tox1 and Tox3. Results A chromosome-scale genome assembly of P. nodorum Australian reference isolate Sn15, which combined long read sequencing, optical mapping and manual curation, produced 23 chromosomes with 21 chromosomes possessing both telomeres. New transcriptome data were combined with fungal-specific gene prediction techniques and manual curation to produce a high-quality predicted gene annotation dataset, which comprises 13,869 high confidence genes, and an additional 2534 lower confidence genes retained to assist pathogenicity effector discovery. Comparison to a panel of 31 internationally-sourced isolates identified multiple hotspots within the Sn15 genome for mutation or presence-absence variation, which was used to enhance subsequent effector prediction. Effector prediction resulted in 257 candidates, of which 98 higher-ranked candidates were selected for in-depth analysis and revealed a wealth of functions related to pathogenicity. Additionally, 11 out of the 98 candidates also exhibited orthology conservation patterns that suggested lateral gene transfer with other cereal-pathogenic fungal species. Analysis of the pan-genome indicated the smallest chromosome of 0.4 Mbp length to be an accessory chromosome (AC23). AC23 was notably absent from an avirulent isolate and is predominated by mutation hotspots with an increase in non-synonymous mutations relative to other chromosomes. Surprisingly, AC23 was deficient in effector candidates, but contained several predicted genes with redundant pathogenicity-related functions. Conclusions We present an updated series of genomic resources for P. nodorum Sn15 – an important reference isolate and model necrotroph – with a comprehensive survey of its predicted pathogenicity content.

Download Full-text

Pairs of compensatory frameshifting mutations contribute to evolution of protein-coding sequences in vertebrates and insects

10.1101/2020.12.25.424394 ◽

2020 ◽

Author(s):

Dmitry Biba ◽

Galya Klink ◽

Georgii Bazykin

Keyword(s):

Negative Selection ◽

Amino Acid Sequences ◽

Fitness Cost ◽

Vertebrate Species ◽

High Confidence ◽

Protein Coding ◽

Coding Sequences ◽

Reading Frame ◽

Human Genes

AbstractInsertions and deletions of lengths not divisible by 3 in protein-coding sequences cause frameshifts that usually induce premature stop codons and may carry a high fitness cost. However, this cost can be circumvented by a second compensatory indel restoring the reading frame. The role of such compensatory frameshifting mutations (CFMs) in evolution has not been studied systematically. Here, we use whole-genome alignments of protein coding genes of 100 vertebrate species, and of 122 insect species, studying the prevalence of CFMs in their divergence. After stringent filtering, we detect a total of 11 high-confidence genes carrying pairs of CFMs, including three human genes: RAB36, ARHGAP6 and NCR3LG1. CFMs tended to occur in genes under relaxed negative selection, indicating that they are typically prevented at functionally important genes. In some instances, mutations closely predating or following the CFMs restored the biochemical similarity of the frameshifted segment to the ancestral sequence, possibly reducing or negating the fitness cost of a CFM. Typically, however, the resulting sequence bore no similarity to the ancestral one, indicating that the CFMs can uncover radically novel regions of sequence space. In total, CFMs represent a potentially important and previously overlooked source of novel variation in amino acid sequences.

Download Full-text