TGFam-Finder: An optimal solution for target-gene family annotation in eukaryotic genomes

Mapping Intimacies ◽

10.1101/372433 ◽

2018 ◽

Cited By ~ 1

Author(s):

Seungill Kim ◽

Kyeongchae Cheong ◽

Jieun Park ◽

Myung-Shin Kim ◽

Ji-Hyun Kim ◽

...

Keyword(s):

Gene Family ◽

Large Scale ◽

Target Gene ◽

Optimal Solution ◽

Target Domain ◽

Structural Annotation ◽

Protein Coding ◽

Protein Coding Genes ◽

New Gene ◽

Eukaryotic Genomes

AbstractWhole genome annotation errors that omit essential protein-coding genes hinder further research. We developed Target Gene Family Finder (TGFam-Finder), an optimal tool for structural annotation of protein-coding genes containing target domain(s) of interest in eukaryotic genomes. Large-scale re-annotation of 100 publicly available eukaryotic genomes led to the discovery of essential genes that were missed in previous annotations. An average of 117 (346%) and 148 (45%) additional FAR1 and NLR genes were newly identified in 50 plant genomes. Furthermore, 117 (47%) additional C2H2 zinc finger genes were detected in 50 animal genomes including human and mouse. Accuracy of the newly annotated genes was validated by RT-PCR and cDNA sequencing in human, mouse and rice. In the human genome, 26 newly annotated genes were identical with known functional genes. TGFam-Finder along with the new gene models provide an optimized platform for unbiased functional and comparative genomics and comprehensive evolutionary study in eukaryotes.

Download Full-text

PaperBLAST: Text-mining papers for information about homologs

10.1101/133041 ◽

2017 ◽

Author(s):

Morgan N. Price ◽

Adam P. Arkin

Keyword(s):

Text Mining ◽

Genome Sequencing ◽

Full Text ◽

Large Scale ◽

Scientific Literature ◽

Protein Sequences ◽

Protein Coding ◽

Link Protein ◽

Protein Coding Genes ◽

Link Type

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.

Download Full-text

Discovering long noncoding RNA predictors of anticancer drug sensitivity beyond protein-coding genes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1909998116 ◽

2019 ◽

Vol 116 (44) ◽

pp. 22020-22029 ◽

Cited By ~ 9

Author(s):

Aritro Nath ◽

Eunice Y. T. Lau ◽

Adam M. Lee ◽

Paul Geeleher ◽

William C. S. Cho ◽

...

Keyword(s):

Anticancer Drug ◽

Noncoding Rna ◽

Large Scale ◽

Drug Response ◽

Cancer Cell Line ◽

Systematic Evaluation ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Biomarkers ◽

Response Predictors

Large-scale cancer cell line screens have identified thousands of protein-coding genes (PCGs) as biomarkers of anticancer drug response. However, systematic evaluation of long noncoding RNAs (lncRNAs) as pharmacogenomic biomarkers has so far proven challenging. Here, we study the contribution of lncRNAs as drug response predictors beyond spurious associations driven by correlations with proximal PCGs, tissue lineage, or established biomarkers. We show that, as a whole, the lncRNA transcriptome is equally potent as the PCG transcriptome at predicting response to hundreds of anticancer drugs. Analysis of individual lncRNAs transcripts associated with drug response reveals nearly half of the significant associations are in fact attributable to proximal cis-PCGs. However, adjusting for effects of cis-PCGs revealed significant lncRNAs that augment drug response predictions for most drugs, including those with well-established clinical biomarkers. In addition, we identify lncRNA-specific somatic alterations associated with drug response by adopting a statistical approach to determine lncRNAs carrying somatic mutations that undergo positive selection in cancer cells. Lastly, we experimentally demonstrate that 2 lncRNAs, EGFR-AS1 and MIR205HG, are functionally relevant predictors of anti-epidermal growth factor receptor (EGFR) drug response.

Download Full-text

TSEBRA: transcript selector for BRAKER

BMC Bioinformatics ◽

10.1186/s12859-021-04482-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lars Gabriel ◽

Katharina J. Hoff ◽

Tomáš Brůna ◽

Mark Borodovsky ◽

Mario Stanke

Keyword(s):

Statistical Models ◽

Gene Prediction ◽

Software Tool ◽

Genome Project ◽

Rna Seq ◽

Protein Coding ◽

Homologous Protein ◽

Protein Coding Genes ◽

Overlapping Transcripts ◽

Eukaryotic Genomes

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Download Full-text

An Exploration of the Sequence of a 2.9-Mb Region of the Genome of Drosophila melanogaster: The Adh Region

Genetics ◽

10.1093/genetics/153.1.179 ◽

1999 ◽

Vol 153 (1) ◽

pp. 179-219 ◽

Cited By ~ 15

Author(s):

M Ashburner ◽

S Misra ◽

J Roote ◽

S E Lewis ◽

R Blazej ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Transposable Element ◽

Large Scale ◽

Chromosome Region ◽

Complete Sequence ◽

Test Methods ◽

P Element ◽

Cdna Libraries ◽

Protein Coding ◽

Protein Coding Genes

Abstract A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized “Adh region.” A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.

Download Full-text

Ab Initio Construction and Evolutionary Analysis of Protein-Coding Gene Families with Partially Homologous Relationships: Closely Related Drosophila Genomes as a Case Study

Genome Biology and Evolution ◽

10.1093/gbe/evaa041 ◽

2020 ◽

Vol 12 (3) ◽

pp. 185-202

Author(s):

Xia Han ◽

Jindan Guo ◽

Erli Pang ◽

Hongtao Song ◽

Kui Lin

Keyword(s):

Gene Family ◽

De Novo ◽

Gene Families ◽

Gene Family Evolution ◽

Evolutionary Analysis ◽

Protein Coding ◽

Protein Coding Genes ◽

Genome Phylogeny ◽

Partial Homology

Abstract How have genes evolved within a well-known genome phylogeny? Many protein-coding genes should have evolved as a whole at the gene level, and some should have evolved partly through fragments at the subgene level. To comprehensively explore such complex homologous relationships and better understand gene family evolution, here, with de novo-identified modules, the subgene units which could consecutively cover proteins within a set of closely related species, we applied a new phylogeny-based approach that considers evolutionary models with partial homology to classify all protein-coding genes in nine Drosophila genomes. Compared with two other popular methods for gene family construction, our approach improved practical gene family classifications with a more reasonable view of homology and provided a much more complete landscape of gene family evolution at the gene and subgene levels. In the case study, we found that most expanded gene families might have evolved mainly through module rearrangements rather than gene duplications and mainly generated single-module genes through partial gene duplication, suggesting that there might be pervasive subgene rearrangement in the evolution of protein-coding gene families. The use of a phylogeny-based approach with partial homology to classify and analyze protein-coding gene families may provide us with a more comprehensive landscape depicting how genes evolve within a well-known genome phylogeny.

Download Full-text

Mitochondrial genome organization and vertebrate phylogenetics

Genetics and Molecular Biology ◽

10.1590/s1415-47572000000400008 ◽

2000 ◽

Vol 23 (4) ◽

pp. 745-752 ◽

Cited By ~ 57

Author(s):

Sérgio Luiz Pereira

Keyword(s):

Mitochondrial Genome ◽

Genome Organization ◽

Tandem Duplication ◽

Probable Mechanism ◽

Mitochondrial Genomes ◽

Protein Coding ◽

Protein Coding Genes ◽

Use Of Data ◽

New Gene ◽

Conserved Gene

With the advent of DNA sequencing techniques the organization of the vertebrate mitochondrial genome shows variation between higher taxonomic levels. The most conserved gene order is found in placental mammals, turtles, fishes, some lizards and Xenopus. Birds, other species of lizards, crocodilians, marsupial mammals, snakes, tuatara, lamprey, and some other amphibians and one species of fish have gene orders that are less conserved. The most probable mechanism for new gene rearrangements seems to be tandem duplication and multiple deletion events, always associated with tRNA sequences. Some new rearrangements seem to be typical of monophyletic groups and the use of data from these groups may be useful for answering phylogenetic questions involving vertebrate higher taxonomic levels. Other features such as the secondary structure of tRNA, and the start and stop codons of protein-coding genes may also be useful in comparisons of vertebrate mitochondrial genomes.

Download Full-text

Large-Scale Parsimony Analysis of Metazoan Indels in Protein-Coding Genes

Molecular Biology and Evolution ◽

10.1093/molbev/msp263 ◽

2009 ◽

Vol 27 (2) ◽

pp. 441-451 ◽

Cited By ~ 32

Author(s):

F. Belinky ◽

O. Cohen ◽

D. Huchon

Keyword(s):

Large Scale ◽

Parsimony Analysis ◽

Protein Coding ◽

Protein Coding Genes

Download Full-text

The shiftability of protein coding genes: the genetic code was optimized for frameshift tolerating

10.7287/peerj.preprints.806 ◽

2015 ◽

Cited By ~ 1

Author(s):

Xiaolong Wang ◽

Xuxiang Wang ◽

Gang Chen ◽

Jianye Zhang ◽

Yongqiang Liu ◽

...

Keyword(s):

Genetic Code ◽

Model Organisms ◽

Large Dataset ◽

Protein Coding ◽

E Coli ◽

Protein Coding Genes ◽

New Gene ◽

Sense Codon ◽

The Relationship ◽

Reading Frames

The genetic code defines the relationship between a protein and its coding DNA sequence. It was presumed that most frameshifts would yield non-functional, truncated or cytotoxic products. In this study, we report that in E. coli, a frameshift β-lactamase (bla) gene is still functional if all of the inner stop codons were readthrough or replaced by a sense codon. By analyzing a large dataset including all available protein coding genes in major model organisms, it is demonstrated that in any species, and in any protein-coding genes, the three translational products from the three different reading frames, are always similar to each other and with constant ~50% similarities and ~100% coverages, and the similarities is predefined by the genetic code rather than the sequences themselves. It is likely that a coding gene can be translated into three isoforms from each of the three reading frames, we propose a new gene expression paradigm, “one transcript, three translations”, which is an amendment to the traditional “one gene, one/multiple peptides” hypotheses. Finally, we concluded that the genetic code was optimized for frameshift tolerating in the early evolution, which endows every protein coding gene a character of shiftability, an inherent and everlasting ability to tolerate frameshift mutations, and serves as an innate mechanism for cells to deal with the frameshift problem.

Download Full-text

High‐quality genomes reveal significant genetic divergence and cryptic speciation in the model organism Folsomia candida (Collembola)

10.22541/au.164018558.87095695/v1 ◽

2021 ◽

Author(s):

Yun-Xia Luan ◽

Yingying Cui ◽

Wan-Jun Chen ◽

Jianfeng Jin ◽

Ai-Min Liu ◽

...

Keyword(s):

Large Scale ◽

Test Organism ◽

Gene Families ◽

Species Differentiation ◽

Folsomia Candida ◽

Cryptic Speciation ◽

High Quality ◽

Protein Coding ◽

Protein Coding Genes ◽

Soil Arthropod

The collembolan Folsomia candida Willem, 1902, is an important representative soil arthropod that is widely distributed throughout the world and has been frequently used as a test organism in soil ecology and ecotoxicology studies. However, it is questioned as an ideal “standard” because of differences in reproductive modes and cryptic genetic diversity between strains from various geographical origins. In this study, we present two high-quality chromosome-level genomes of F. candida, for the parthenogenetic Danish strain (FCDK, 219.08 Mb, N50 of 38.47 Mb, 25,139 protein-coding genes) and the sexual Shanghai strain (FCSH, 153.09 Mb, N50 of 25.75 Mb, 21,609 protein-coding genes). The seven chromosomes of FCDK are each 25–54% larger than the corresponding chromosomes of FCSH, showing obvious repetitive element expansions and large-scale inversions and translocations but no whole-genome duplication. The strain-specific genes, expanded gene families and genes in nonsyntenic chromosomal regions identified in FCDK are highly related to its broader environmental adaptation. In addition, the overall sequence identity of the two mitogenomes is only 78.2%, and FCDK has fewer strain-specific microRNAs than FCSH. In conclusion, FCDK and FCSH have accumulated independent genetic changes and evolved into distinct species since diverging 10 Mya. Our work shows that F. candida represents a good model of rapidly cryptic speciation. Moreover, it provides important genomic resources for studying the mechanisms of species differentiation, soil arthropod adaptation to soil ecosystems, and Wolbachia-induced parthenogenesis as well as the evolution of Collembola, a pivotal phylogenetic clade between Crustacea and Insecta.

Download Full-text

Transcriptomic and Proteomic Profiling of Human Stable and Unstable Carotid Atherosclerotic Plaques

Frontiers in Genetics ◽

10.3389/fgene.2021.755507 ◽

2021 ◽

Vol 12 ◽

Author(s):

Mei-hua Bao ◽

Ruo-qi Zhang ◽

Xiao-shan Huang ◽

Ji Zhou ◽

Zhen Guo ◽

...

Keyword(s):

Cellular Response ◽

Target Gene ◽

Clinical Manifestations ◽

Differentially Expressed ◽

Atherosclerotic Plaques ◽

Proteomic Profiling ◽

Illumina Hiseq ◽

Protein Coding ◽

Protein Coding Genes ◽

The Stability

Atherosclerosis is a chronic inflammatory disease with high prevalence and mortality. The rupture of atherosclerotic plaque is the main reason for the clinical events caused by atherosclerosis. Making clear the transcriptomic and proteomic profiles between the stabe and unstable atherosclerotic plaques is crucial to prevent the clinical manifestations. In the present study, 5 stable and 5 unstable human carotid atherosclerotic plaques were obtained by carotid endarterectomy. The samples were used for the whole transcriptome sequencing (RNA-Seq) by the Next-Generation Sequencing using the Illumina HiSeq, and for proteome analysis by HPLC-MS/MS. The lncRNA-targeted genes and circRNA-originated genes were identified by analyzing their location and sequence. Gene Ontology and KEGG enrichment was carried out to analyze the functions of differentially expressed RNAs and proteins. The protein-protein interactions (PPI) network was constructed by the online tool STRING. The consistency of transcriptome and proteome were analyzed, and the lncRNA/circRNA-miRNA-mRNA interactions were predicted. As a result, 202 mRNAs, 488 lncRNAs, 91 circRNAs, and 293 proteins were identified to be differentially expressed between stable and unstable atherosclerotic plaques. The 488 lncRNAs might target 381 protein-coding genes by cis-acting mechanisms. Sequence analysis indicated the 91 differentially expressed circRNAs were originated from 97 protein-coding genes. These differentially expressed RNAs and proteins were mainly enriched in the terms of the cellular response to stress or stimulus, the regulation of gene transcription, the immune response, the nervous system functions, the hematologic activities, and the endocrine system. These results were consistent with the previous reported data in the dataset GSE41571. Further analysis identified CD5L, S100A12, CKB (target gene of lncRNA MSTRG.11455.17), CEMIP (target gene of lncRNA MSTRG.12845), and SH3GLB1 (originated gene of hsacirc_000411) to be critical genes in regulating the stability of atherosclerotic plaques. Our results provided a comprehensive transcriptomic and proteomic knowledge on the stability of atherosclerotic plaques.

Download Full-text