MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

Mapping Intimacies ◽

10.1101/851964 ◽

2019 ◽

Cited By ~ 1

Author(s):

Eli Levy Karin ◽

Milot Mirdita ◽

Johannes Söding

Keyword(s):

High Throughput ◽

Large Scale ◽

Sequence Similarity ◽

Direct Sequencing ◽

Metagenomic Data ◽

Reference Database ◽

Protein Coding ◽

Protein Coding Genes ◽

Highly Sensitive ◽

Computational Procedures

AbstractBackgroundMetagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.ResultsMetaEuk is a toolkit for high-throughput, reference-based discovery and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk’s power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in eight days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups.ConclusionThe open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation.

Download Full-text

PaperBLAST: Text-mining papers for information about homologs

10.1101/133041 ◽

2017 ◽

Author(s):

Morgan N. Price ◽

Adam P. Arkin

Keyword(s):

Text Mining ◽

Genome Sequencing ◽

Full Text ◽

Large Scale ◽

Scientific Literature ◽

Protein Sequences ◽

Protein Coding ◽

Link Protein ◽

Protein Coding Genes ◽

Link Type

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.

Download Full-text

Discovering long noncoding RNA predictors of anticancer drug sensitivity beyond protein-coding genes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1909998116 ◽

2019 ◽

Vol 116 (44) ◽

pp. 22020-22029 ◽

Cited By ~ 9

Author(s):

Aritro Nath ◽

Eunice Y. T. Lau ◽

Adam M. Lee ◽

Paul Geeleher ◽

William C. S. Cho ◽

...

Keyword(s):

Anticancer Drug ◽

Noncoding Rna ◽

Large Scale ◽

Drug Response ◽

Cancer Cell Line ◽

Systematic Evaluation ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Biomarkers ◽

Response Predictors

Large-scale cancer cell line screens have identified thousands of protein-coding genes (PCGs) as biomarkers of anticancer drug response. However, systematic evaluation of long noncoding RNAs (lncRNAs) as pharmacogenomic biomarkers has so far proven challenging. Here, we study the contribution of lncRNAs as drug response predictors beyond spurious associations driven by correlations with proximal PCGs, tissue lineage, or established biomarkers. We show that, as a whole, the lncRNA transcriptome is equally potent as the PCG transcriptome at predicting response to hundreds of anticancer drugs. Analysis of individual lncRNAs transcripts associated with drug response reveals nearly half of the significant associations are in fact attributable to proximal cis-PCGs. However, adjusting for effects of cis-PCGs revealed significant lncRNAs that augment drug response predictions for most drugs, including those with well-established clinical biomarkers. In addition, we identify lncRNA-specific somatic alterations associated with drug response by adopting a statistical approach to determine lncRNAs carrying somatic mutations that undergo positive selection in cancer cells. Lastly, we experimentally demonstrate that 2 lncRNAs, EGFR-AS1 and MIR205HG, are functionally relevant predictors of anti-epidermal growth factor receptor (EGFR) drug response.

Download Full-text

An Exploration of the Sequence of a 2.9-Mb Region of the Genome of Drosophila melanogaster: The Adh Region

Genetics ◽

10.1093/genetics/153.1.179 ◽

1999 ◽

Vol 153 (1) ◽

pp. 179-219 ◽

Cited By ~ 15

Author(s):

M Ashburner ◽

S Misra ◽

J Roote ◽

S E Lewis ◽

R Blazej ◽

...

Keyword(s):

Drosophila Melanogaster ◽

Transposable Element ◽

Large Scale ◽

Chromosome Region ◽

Complete Sequence ◽

Test Methods ◽

P Element ◽

Cdna Libraries ◽

Protein Coding ◽

Protein Coding Genes

Abstract A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized “Adh region.” A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

Genome Sequence of Gordonia Phage Yvonnetastic

Genome Announcements ◽

10.1128/genomea.00594-16 ◽

2016 ◽

Vol 4 (4) ◽

Cited By ~ 1

Author(s):

Welkin H. Pope ◽

Anshika Bandyopadhyay ◽

Meghan L. Carlton ◽

Meghan T. Kane ◽

Niyati J. Panchal ◽

...

Keyword(s):

Genome Sequence ◽

Sequence Similarity ◽

Trna Genes ◽

Protein Coding ◽

Protein Coding Genes ◽

A Genome

Gordonia bacteriophage Yvonnetastic was isolated from soil in Pittsburgh, PA, using Gordonia terrae 3612 as a host. Yvonnetastic has siphoviral morphology and a genome of 98,136 bp, with 198 predicted protein-coding genes and five tRNA genes. Yvonnetastic does not share substantial sequence similarity with other sequenced bacteriophage genomes.

Download Full-text

Dynamic Expression of Long Non-Coding RNAs Throughout Parasite Sexual and Neural Maturation in Schistosoma Japonicum

Non-Coding RNA ◽

10.3390/ncrna6020015 ◽

2020 ◽

Vol 6 (2) ◽

pp. 15 ◽

Cited By ~ 1

Author(s):

Lucas Maciel ◽

David Morales-Vicente ◽

Sergio Verjovski-Almeida

Keyword(s):

Schistosoma Japonicum ◽

Sexual Maturation ◽

System Development ◽

Sequence Similarity ◽

Nervous System Development ◽

Rna Seq ◽

Protein Coding ◽

Protein Coding Genes ◽

Synteny Conservation ◽

Non Coding Rnas

Schistosoma japonicum is a flatworm that causes schistosomiasis, a neglected tropical disease. S. japonicum RNA-Seq analyses has been previously reported in the literature on females and males obtained during sexual maturation from 14 to 28 days post-infection in mouse, resulting in the identification of protein-coding genes and pathways, whose expression levels were related to sexual development. However, this work did not include an analysis of long non-coding RNAs (lncRNAs). Here, we applied a pipeline to identify and annotate lncRNAs in 66 S. japonicum RNA-Seq publicly available libraries, from different life-cycle stages. We also performed co-expression analyses to find stage-specific lncRNAs possibly related to sexual maturation. We identified 12,291 S. japonicum expressed lncRNAs. Sequence similarity search and synteny conservation indicated that some 14% of S. japonicum intergenic lncRNAs have synteny conservation with S. mansoni intergenic lncRNAs. Co-expression analyses showed that lncRNAs and protein-coding genes in S. japonicum males and females have a dynamic co-expression throughout sexual maturation, showing differential expression between the sexes; the protein-coding genes were related to the nervous system development, lipid and drug metabolism, and overall parasite survival. Co-expression pattern suggests that lncRNAs possibly regulate these processes or are regulated by the same activation program as that of protein-coding genes.

Download Full-text

Large-Scale Parsimony Analysis of Metazoan Indels in Protein-Coding Genes

Molecular Biology and Evolution ◽

10.1093/molbev/msp263 ◽

2009 ◽

Vol 27 (2) ◽

pp. 441-451 ◽

Cited By ~ 32

Author(s):

F. Belinky ◽

O. Cohen ◽

D. Huchon

Keyword(s):

Large Scale ◽

Parsimony Analysis ◽

Protein Coding ◽

Protein Coding Genes

Download Full-text

High‐quality genomes reveal significant genetic divergence and cryptic speciation in the model organism Folsomia candida (Collembola)

10.22541/au.164018558.87095695/v1 ◽

2021 ◽

Author(s):

Yun-Xia Luan ◽

Yingying Cui ◽

Wan-Jun Chen ◽

Jianfeng Jin ◽

Ai-Min Liu ◽

...

Keyword(s):

Large Scale ◽

Test Organism ◽

Gene Families ◽

Species Differentiation ◽

Folsomia Candida ◽

Cryptic Speciation ◽

High Quality ◽

Protein Coding ◽

Protein Coding Genes ◽

Soil Arthropod

The collembolan Folsomia candida Willem, 1902, is an important representative soil arthropod that is widely distributed throughout the world and has been frequently used as a test organism in soil ecology and ecotoxicology studies. However, it is questioned as an ideal “standard” because of differences in reproductive modes and cryptic genetic diversity between strains from various geographical origins. In this study, we present two high-quality chromosome-level genomes of F. candida, for the parthenogenetic Danish strain (FCDK, 219.08 Mb, N50 of 38.47 Mb, 25,139 protein-coding genes) and the sexual Shanghai strain (FCSH, 153.09 Mb, N50 of 25.75 Mb, 21,609 protein-coding genes). The seven chromosomes of FCDK are each 25–54% larger than the corresponding chromosomes of FCSH, showing obvious repetitive element expansions and large-scale inversions and translocations but no whole-genome duplication. The strain-specific genes, expanded gene families and genes in nonsyntenic chromosomal regions identified in FCDK are highly related to its broader environmental adaptation. In addition, the overall sequence identity of the two mitogenomes is only 78.2%, and FCDK has fewer strain-specific microRNAs than FCSH. In conclusion, FCDK and FCSH have accumulated independent genetic changes and evolved into distinct species since diverging 10 Mya. Our work shows that F. candida represents a good model of rapidly cryptic speciation. Moreover, it provides important genomic resources for studying the mechanisms of species differentiation, soil arthropod adaptation to soil ecosystems, and Wolbachia-induced parthenogenesis as well as the evolution of Collembola, a pivotal phylogenetic clade between Crustacea and Insecta.

Download Full-text

Piggy: A Rapid, Large-Scale Pan-Genome Analysis Tool for Intergenic Regions in Bacteria

10.1101/179515 ◽

2017 ◽

Cited By ~ 3

Author(s):

Harry A. Thorpe ◽

Sion C. Bayliss ◽

Samuel K. Sheppard ◽

Edward J. Feil

Keyword(s):

Large Scale ◽

Reference Database ◽

Analysis Tool ◽

Protein Coding ◽

Coding Sequences ◽

Large Genome ◽

Pan Genome ◽

Overwhelming Evidence ◽

Intergenic Regions ◽

Genome Analyses

AbstractDespite overwhelming evidence that variation in intergenic regions (IGRs) in bacteria impacts on phenotypes, most current approaches for analysing pan-genomes focus exclusively on protein-coding sequences. To address this we present Piggy, a novel pipeline that emulates Roary except that it is based only on IGRs. We demonstrate the use of Piggy for pan-genome analyses of Staphylococcus aureus and Escherichia coli using large genome datasets. For S. aureus, we show that highly divergent (“switched”) IGRs are associated with differences in gene expression, and we establish a multi-locus reference database of IGR alleles (igMLST; implemented in BIGSdb). Piggy is available at https://github.com/harry-thorpe/piggy.

Download Full-text

Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009428 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009428

Author(s):

Ryota Sugimoto ◽

Luca Nishimura ◽

Phuong Thanh Nguyen ◽

Jumpei Ito ◽

Nicholas F. Parrish ◽

...

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Metagenomic Data ◽

Marker Genes ◽

Biological Entity ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Human Gut ◽

Protein Coding ◽

Viral Sequences

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

Download Full-text