Accurate and Complete Genomes from Metagenomes

Mapping Intimacies ◽

10.1101/808410 ◽

2019 ◽

Cited By ~ 10

Author(s):

Lin-Xing Chen ◽

Karthik Anantharaman ◽

Alon Shaiber ◽

A. Murat Eren ◽

Jillian F. Banfield

Keyword(s):

Bacterial Genome ◽

Biological Information ◽

Microbial Life ◽

Repeat Sequences ◽

Bacterial Genome Sequence ◽

Gc Skew ◽

Local Assembly ◽

Bioinformatic Approaches ◽

Reference Genomes ◽

Very High

AbstractGenomes are an integral component of the biological information about an organism and, logically, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs), but gaps, local assembly errors, chimeras and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and in some cases achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of ~7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. Interestingly, analysis of cumulative GC skew identified potential mis-assemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

Download Full-text

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

10.1101/125534 ◽

2017 ◽

Author(s):

Chong Chu ◽

Xin Li ◽

Yufeng Wu

Keyword(s):

Sequence Data ◽

Bacterial Genome ◽

Software Tool ◽

Sea Bass ◽

Short Sequence ◽

Asian Sea Bass ◽

Long Reads ◽

Local Assembly ◽

Genomic Repeats ◽

Gap Closing

AbstractBackgroundClosing gaps in draft genomes is an important post processing step in genome assembly. It leads to more complete genomes, which benefits downstream genome analysis such as annotation and genotyping. Several tools have been developed for gap closing. However, these tools don’t fully utilize the information contained in the sequence data. For example, while it is known that many gaps are caused by genomic repeats, existing tools often ignore many sequence reads that originate from a repeat-related gap.ResultsIn this paper, we propose a new approach called GAPPadder for gap closing. The main advantage of GAPPadder is that it uses more information in sequence data for gap closing. In particular, GAPPadder finds and uses reads that originate from repeate-related gaps. We show that these repeat-associated reads are useful for gap closing, even though they are ignored by all existing tools. Other main features of GAPPadder include utilizing the information in sequence reads with different insert sizes and performing two-stage local assembly of gap sequences. We compare GAPPadder with GapCloser, GapFiller and Sealer on one bacterial genome, human chromosome 14 and the human whole genome with paired-end and mate-paired reads with both short and long insert sizes. Empirical results show that GAPPadder can close more gaps than these existing tools. Besides closing gaps on draft genomes assembled only from short sequence reads, GAPPadder can also be used to close gaps for draft genomes assembled with long reads. We show GAPPadder can close gaps on the bed bug genome and the Asian sea bass genome that are assembled partially and fully with long reads respectively. We also show GAPPadder is efficient in both time and memory usage. The software tool, GAPPadder, is available for download at https://github.com/Reedwarbler/GAPPadder.

Download Full-text

Multi -omics and metabolic modelling pipelines: challenges and tools for systems microbiology

10.1101/013532 ◽

2015 ◽

Author(s):

Marco Fondi ◽

Pietro Liò

Keyword(s):

Systems Biology ◽

Large Scale ◽

Biological Information ◽

Special Focus ◽

Metabolic Modelling ◽

Microbial Life ◽

Multi Scale ◽

Integrated Omics ◽

Molecular Components ◽

Omics Data Integration

Integrated -omics approaches are quickly spreading across microbiology research labs, leading to i) the possibility of detecting previously hidden features of microbial cells like multi-scale spatial organisation and ii) tracing molecular components across multiple cellular functional states. This promises to reduce the knowledge gap between genotype and phenotype and poses new challenges for computational microbiologists. We underline how the capability to unravel the complexity of microbial life will strongly depend on the integration of the huge and diverse amount of information that can be derived today from -omics experiments. In this work, we present opportunities and challenges of multi –omics data integration in current systems biology pipelines. We here discuss which layers of biological information are important for biotechnological and clinical purposes, with a special focus on bacterial metabolism and modelling procedures. A general review of the most recent computational tools for performing large-scale datasets integration is also presented, together with a possible framework to guide the design of systems biology experiments by microbiologists.

Download Full-text

Complete and Circularized Bacterial Genome Sequence of Gordonia sp. Strain X0973

Microbiology Resource Announcements ◽

10.1128/mra.01479-20 ◽

2021 ◽

Vol 10 (9) ◽

Author(s):

Christopher A. Gulvik ◽

Dhwani Batra ◽

Lori A. Rowe ◽

Milli Sheth ◽

Sarah Nobles ◽

...

Keyword(s):

Genome Sequence ◽

Bacterial Genome ◽

Illumina Miseq ◽

Gram Positive ◽

Coding Sequences ◽

Content Type ◽

Weakly Acid ◽

Circular Genome ◽

Bacterial Genome Sequence ◽

Phylogenetic Neighbor

ABSTRACT Gordonia sp. strain X0973 is a Gram-positive, weakly acid-fast, aerobic actinomycete obtained from a human abscess with Gordonia araii NBRC 100433T as its closest phylogenetic neighbor. Here, we report using Illumina MiSeq and PacBio reads to assemble the complete and circular genome sequence of 3.75 Mbp with 3,601 predicted coding sequences.

Download Full-text

Optical mapping as a routine tool for bacterial genome sequence finishing

BMC Genomics ◽

10.1186/1471-2164-8-321 ◽

2007 ◽

Vol 8 (1) ◽

pp. 321 ◽

Cited By ~ 85

Author(s):

Phil Latreille ◽

Stacie Norton ◽

Barry S Goldman ◽

John Henkhaus ◽

Nancy Miller ◽

...

Keyword(s):

Genome Sequence ◽

Optical Mapping ◽

Bacterial Genome ◽

Bacterial Genome Sequence

Download Full-text

Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies

10.1101/2021.12.14.472616 ◽

2021 ◽

Author(s):

David A Yarmosh ◽

Juan G Lopera ◽

Nikhita P Puthuveetil ◽

Patrick Ford Combs ◽

Amy L Reese ◽

...

Keyword(s):

Bacterial Genome ◽

Data Provenance ◽

Microbial Genomics ◽

Refseq Database ◽

Biological Source ◽

Genome Assemblies ◽

Public Health Epidemiology ◽

Source Materials ◽

Reference Genomes

The quality and traceability of microbial genomics data in public databases is deteriorating as they rapidly expand and struggle to cope with data curation challenges. While the availability of public genomic data has become essential for modern life sciences research, the curation of the data is a growing area of concern that has significant real-world impacts on public health epidemiology, drug discovery, and environmental biosurveillance research. While public microbial genome databases such as NCBI's RefSeq database leverage the scalability of crowd sourcing for growth, they do not require data provenance to the original biological source materials or accurate descriptions of how the data was produced. Here, we describe the de novo assembly of 1,113 bacterial genome references produced from authenticated materials sourced from the American Type Culture Collection (ATCC), each with full data provenance. Over 98% of these ATCC Standard Reference Genomes (ASRGs) are superior to assemblies for comparable strains found in NCBI's RefSeq database. Comparative genomics analysis revealed significant issues in RefSeq bacterial genome assemblies related to genome completeness, mutations, structural differences, metadata errors, and gaps in traceability to the original biological source materials. For example, nearly half of RefSeq assemblies lack details on sample source information, sequencing technology, or bioinformatics methods. We suggest there is an intrinsic connection between the quality of genomic metadata, the traceability of the data, and the methods used to produce them with the quality of the resulting genome assemblies themselves. Our results highlight common problems with "reference genomes" and underscore the importance of data provenance for precision science and reproducibility. These gaps in metadata accuracy and data provenance represent an "elephant in the room" for microbial genomics research, but addressing these issues would require raising the level of accountability for data depositors and our own expectations of data quality.

Download Full-text

Polypolish: short-read polishing of long-read bacterial genome assemblies

10.1101/2021.10.14.464465 ◽

2021 ◽

Author(s):

Ryan R Wick ◽

Kathryn E Holt

Keyword(s):

Bacterial Genome ◽

Short Read ◽

Read Alignment ◽

Short Reads ◽

Repeat Sequences ◽

Short Read Alignment ◽

Long Read ◽

Genome Assemblies ◽

Residual Errors

Long-read-only bacterial genome assemblies usually contain residual errors, most commonly homopolymer-length errors. Short-read polishing tools can use short reads to fix these errors, but most rely on short-read alignment which is unreliable in repeat regions. Errors in such regions are therefore challenging to fix and often remain after short-read polishing. Here we introduce Polypolish, a new short-read polisher which uses all-per-read alignments to repair errors in repeat sequences that other polishers cannot. In benchmarking tests using both simulated and real reads, we find that Polypolish performs well, and the best results are achieved by using Polypolish in combination with other short-read polishers.

Download Full-text