Large-Scale Parsimony Analysis of Metazoan Indels in Protein-Coding Genes

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.

Download Full-text

Discovering long noncoding RNA predictors of anticancer drug sensitivity beyond protein-coding genes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1909998116 ◽

2019 ◽

Vol 116 (44) ◽

pp. 22020-22029 ◽

Cited By ~ 9

Author(s):

Aritro Nath ◽

Eunice Y. T. Lau ◽

Adam M. Lee ◽

Paul Geeleher ◽

William C. S. Cho ◽

...

Keyword(s):

Anticancer Drug ◽

Noncoding Rna ◽

Large Scale ◽

Drug Response ◽

Cancer Cell Line ◽

Systematic Evaluation ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Biomarkers ◽

Response Predictors

Large-scale cancer cell line screens have identified thousands of protein-coding genes (PCGs) as biomarkers of anticancer drug response. However, systematic evaluation of long noncoding RNAs (lncRNAs) as pharmacogenomic biomarkers has so far proven challenging. Here, we study the contribution of lncRNAs as drug response predictors beyond spurious associations driven by correlations with proximal PCGs, tissue lineage, or established biomarkers. We show that, as a whole, the lncRNA transcriptome is equally potent as the PCG transcriptome at predicting response to hundreds of anticancer drugs. Analysis of individual lncRNAs transcripts associated with drug response reveals nearly half of the significant associations are in fact attributable to proximal cis-PCGs. However, adjusting for effects of cis-PCGs revealed significant lncRNAs that augment drug response predictions for most drugs, including those with well-established clinical biomarkers. In addition, we identify lncRNA-specific somatic alterations associated with drug response by adopting a statistical approach to determine lncRNAs carrying somatic mutations that undergo positive selection in cancer cells. Lastly, we experimentally demonstrate that 2 lncRNAs, EGFR-AS1 and MIR205HG, are functionally relevant predictors of anti-epidermal growth factor receptor (EGFR) drug response.

Download Full-text

An Exploration of the Sequence of a 2.9-Mb Region of the Genome of Drosophila melanogaster: The Adh Region

Genetics ◽

10.1093/genetics/153.1.179 ◽

1999 ◽

Vol 153 (1) ◽

pp. 179-219 ◽

Cited By ~ 15

Author(s):

M Ashburner ◽

S Misra ◽

J Roote ◽

S E Lewis ◽

R Blazej ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Transposable Element ◽

Large Scale ◽

Chromosome Region ◽

Complete Sequence ◽

Test Methods ◽

P Element ◽

Cdna Libraries ◽

Protein Coding ◽

Protein Coding Genes

Abstract A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized “Adh region.” A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.

Download Full-text

High‐quality genomes reveal significant genetic divergence and cryptic speciation in the model organism Folsomia candida (Collembola)

10.22541/au.164018558.87095695/v1 ◽

2021 ◽

Author(s):

Yun-Xia Luan ◽

Yingying Cui ◽

Wan-Jun Chen ◽

Jianfeng Jin ◽

Ai-Min Liu ◽

...

Keyword(s):

Large Scale ◽

Test Organism ◽

Gene Families ◽

Species Differentiation ◽

Folsomia Candida ◽

Cryptic Speciation ◽

High Quality ◽

Protein Coding ◽

Protein Coding Genes ◽

Soil Arthropod

The collembolan Folsomia candida Willem, 1902, is an important representative soil arthropod that is widely distributed throughout the world and has been frequently used as a test organism in soil ecology and ecotoxicology studies. However, it is questioned as an ideal “standard” because of differences in reproductive modes and cryptic genetic diversity between strains from various geographical origins. In this study, we present two high-quality chromosome-level genomes of F. candida, for the parthenogenetic Danish strain (FCDK, 219.08 Mb, N50 of 38.47 Mb, 25,139 protein-coding genes) and the sexual Shanghai strain (FCSH, 153.09 Mb, N50 of 25.75 Mb, 21,609 protein-coding genes). The seven chromosomes of FCDK are each 25–54% larger than the corresponding chromosomes of FCSH, showing obvious repetitive element expansions and large-scale inversions and translocations but no whole-genome duplication. The strain-specific genes, expanded gene families and genes in nonsyntenic chromosomal regions identified in FCDK are highly related to its broader environmental adaptation. In addition, the overall sequence identity of the two mitogenomes is only 78.2%, and FCDK has fewer strain-specific microRNAs than FCSH. In conclusion, FCDK and FCSH have accumulated independent genetic changes and evolved into distinct species since diverging 10 Mya. Our work shows that F. candida represents a good model of rapidly cryptic speciation. Moreover, it provides important genomic resources for studying the mechanisms of species differentiation, soil arthropod adaptation to soil ecosystems, and Wolbachia-induced parthenogenesis as well as the evolution of Collembola, a pivotal phylogenetic clade between Crustacea and Insecta.

Download Full-text

Large-scale analysis of human gene expression variability associates highly variable drug targets with lower drug effectiveness and safety

Bioinformatics ◽

10.1093/bioinformatics/btz023 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3028-3037 ◽

Cited By ~ 8

Author(s):

Eyal Simonovsky ◽

Ronen Schuster ◽

Esti Yeger-Lotem

Keyword(s):

Drug Target ◽

Large Scale ◽

Target Genes ◽

Supplementary Information ◽

Protein Coding ◽

Expression Levels ◽

Expression Variability ◽

Protein Coding Genes ◽

Approved Drugs ◽

Variable Genes

Abstract Motivation The effectiveness of drugs tends to vary between patients. One of the well-known reasons for this phenomenon is genetic polymorphisms in drug target genes among patients. Here, we propose that differences in expression levels of drug target genes across individuals can also contribute to this phenomenon. Results To explore this hypothesis, we analyzed the expression variability of protein-coding genes, and particularly drug target genes, across individuals. For this, we developed a novel variability measure, termed local coefficient of variation (LCV), which ranks the expression variability of each gene relative to genes with similar expression levels. Unlike commonly used methods, LCV neutralizes expression levels biases without imposing any distribution over the variation and is robust to data incompleteness. Application of LCV to RNA-sequencing profiles of 19 human tissues and to target genes of 1076 approved drugs revealed that drug target genes were significantly more variable than protein-coding genes. Analysis of 113 drugs with available effectiveness scores showed that drugs targeting highly variable genes tended to be less effective in the population. Furthermore, comparison of approved drugs to drugs that were withdrawn from the market showed that withdrawn drugs targeted significantly more variable genes than approved drugs. Last, upon analyzing gender differences we found that the variability of drug target genes was similar between men and women. Altogether, our results suggest that expression variability of drug target genes could contribute to the variable responsiveness and effectiveness of drugs, and is worth considering during drug treatment and development. Availability and implementation LCV is available as a python script in GitHub (https://github.com/eyalsim/LCV). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Functional and transcriptional profiling of non-coding RNAs in yeast reveal context-dependent phenotypes and in trans effects on the protein regulatory network

PLoS Genetics ◽

10.1371/journal.pgen.1008761 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008761

Author(s):

Laura Natalia Balarezo-Cisneros ◽

Steven Parker ◽

Marcin G. Fraczek ◽

Soukaina Timouma ◽

Ping Wang ◽

...

Keyword(s):

Large Scale ◽

Transcriptional Profiling ◽

Growth Conditions ◽

Phenotypic Data ◽

Protein Coding ◽

Protein Coding Genes ◽

Genome Wide ◽

In Trans ◽

Non Coding Rnas ◽

The Impact

Non-coding RNAs (ncRNAs), including the more recently identified Stable Unannotated Transcripts (SUTs) and Cryptic Unstable Transcripts (CUTs), are increasingly being shown to play pivotal roles in the transcriptional and post-transcriptional regulation of genes in eukaryotes. Here, we carried out a large-scale screening of ncRNAs in Saccharomyces cerevisiae, and provide evidence for SUT and CUT function. Phenotypic data on 372 ncRNA deletion strains in 23 different growth conditions were collected, identifying ncRNAs responsible for significant cellular fitness changes. Transcriptome profiles were assembled for 18 haploid ncRNA deletion mutants and 2 essential ncRNA heterozygous deletants. Guided by the resulting RNA-seq data we analysed the genome-wide dysregulation of protein coding genes and non-coding transcripts. Novel functional ncRNAs, SUT125, SUT126, SUT035 and SUT532 that act in trans by modulating transcription factors were identified. Furthermore, we described the impact of SUTs and CUTs in modulating coding gene expression in response to different environmental conditions, regulating important biological process such as respiration (SUT125, SUT126, SUT035, SUT432), steroid biosynthesis (CUT494, SUT053, SUT468) or rRNA processing (SUT075 and snR30). Overall, these data capture and integrate the regulatory and phenotypic network of ncRNAs and protein-coding genes, providing genome-wide evidence of the impact of ncRNAs on cellular homeostasis.

Download Full-text

The shrinking human protein coding complement: are there fewer than 20,000 genes?

10.1101/001909 ◽

2014 ◽

Cited By ~ 2

Author(s):

Iakes Ezkurdia ◽

David Juan ◽

Jose Manuel Rodriguez ◽

Adam Frankish ◽

Mark Deikhans ◽

...

Keyword(s):

Protein Expression ◽

Human Genome ◽

Genome Annotation ◽

Large Scale ◽

Cellular Protein ◽

Human Protein ◽

Protein Coding ◽

Detection Rates ◽

Protein Coding Genes ◽

Peptide Mass

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we map the peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation the human genome. We find that conservation across vertebrate species and the age of the gene family are key indicators of whether a peptide will be detected in proteomics experiments. We find peptides for most highly conserved genes and for practically all genes that evolved before bilateria. At the same time there is almost no evidence of protein expression for genes that have appeared since primates, or for genes that do not have any protein-like features or cross-species conservation. We identify 19 non-protein-like features such as weak conservation, no protein features or ambiguous annotations in major databases that are indicators of low peptide detection rates. We use these features to describe a set of 2,001 genes that are potentially non-coding, and show that many of these genes behave more like non-coding genes than protein-coding genes. We detect peptides for just 3% of these genes. We suggest that many of these 2,001 genes do not code for proteins under normal circumstances and that they should not be included in the human protein coding gene catalogue. These potential non-coding genes will be revised as part of the ongoing human genome annotation effort.

Download Full-text

Discovering novel long non-coding RNA predictors of anticancer drug sensitivity beyond protein-coding genes

10.1101/666156 ◽

2019 ◽

Author(s):

Aritro Nath ◽

Eunice Y.T. Lau ◽

Adam M. Lee ◽

Paul Geeleher ◽

William C.S. Cho ◽

...

Keyword(s):

Anticancer Drug ◽

Large Scale ◽

Drug Response ◽

Cancer Cell Line ◽

Systematic Evaluation ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Biomarkers ◽

Response Predictors ◽

Drugs Analysis

AbstractLarge-scale cancer cell line screens have identified thousands of protein-coding genes (PCGs) as biomarkers of anticancer drug response. However, systematic evaluation of long non-coding RNAs (lncRNAs) as pharmacogenomic biomarkers has so far proven challenging. Here, we study the contribution of lncRNAs as drug response predictors beyond spurious associations driven by correlations with proximal PCGs, tissue-lineage or established biomarkers. We show that, as a whole, the lncRNA transcriptome is equally potent as the PCG transcriptome at predicting response to hundreds of anticancer drugs. Analysis of individual lncRNAs transcripts associated with drug response reveals nearly half of the significant associations are in fact attributable to proximal cis-PCGs. However, adjusting for effects of cis-PCGs revealed significant lncRNAs that augment drug response predictions for most drugs, including those with well-established clinical biomarkers. In addition, we identify lncRNA-specific somatic alterations associated with drug response by adopting a statistical approach to determine lncRNAs carrying somatic mutations that undergo positive selection in cancer cells. Lastly, we experimentally demonstrate that two novel lncRNA, EGFR-AS1 and MIR205HG, are functionally relevant predictors of anti-EGFR drug response.

Download Full-text

MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

10.1101/851964 ◽

2019 ◽

Cited By ~ 1

Author(s):

Eli Levy Karin ◽

Milot Mirdita ◽

Johannes Söding

Keyword(s):

High Throughput ◽

Large Scale ◽

Sequence Similarity ◽

Direct Sequencing ◽

Metagenomic Data ◽

Reference Database ◽

Protein Coding ◽

Protein Coding Genes ◽

Highly Sensitive ◽

Computational Procedures

AbstractBackgroundMetagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.ResultsMetaEuk is a toolkit for high-throughput, reference-based discovery and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk’s power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in eight days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.ConclusionThe open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation.

Download Full-text

TGFam-Finder: An optimal solution for target-gene family annotation in eukaryotic genomes

10.1101/372433 ◽

2018 ◽

Cited By ~ 1

Author(s):

Seungill Kim ◽

Kyeongchae Cheong ◽

Jieun Park ◽

Myung-Shin Kim ◽

Ji-Hyun Kim ◽

...

Keyword(s):

Gene Family ◽

Large Scale ◽

Target Gene ◽

Optimal Solution ◽

Target Domain ◽

Structural Annotation ◽

Protein Coding ◽

Protein Coding Genes ◽

New Gene ◽

Eukaryotic Genomes

AbstractWhole genome annotation errors that omit essential protein-coding genes hinder further research. We developed Target Gene Family Finder (TGFam-Finder), an optimal tool for structural annotation of protein-coding genes containing target domain(s) of interest in eukaryotic genomes. Large-scale re-annotation of 100 publicly available eukaryotic genomes led to the discovery of essential genes that were missed in previous annotations. An average of 117 (346%) and 148 (45%) additional FAR1 and NLR genes were newly identified in 50 plant genomes. Furthermore, 117 (47%) additional C2H2 zinc finger genes were detected in 50 animal genomes including human and mouse. Accuracy of the newly annotated genes was validated by RT-PCR and cDNA sequencing in human, mouse and rice. In the human genome, 26 newly annotated genes were identical with known functional genes. TGFam-Finder along with the new gene models provide an optimized platform for unbiased functional and comparative genomics and comprehensive evolutionary study in eukaryotes.

Download Full-text