Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Valerie A. Schneider; Tina Graves-Lindsay; Kerstin Howe; Nathan Bouk; Hsiu-Chuan Chen; Paul A. Kitts; Terence D. Murphy; Kim D. Pruitt; Françoise Thibaud-Nissen; Derek Albracht; Robert S. Fulton; Milinn Kremitzki; Vincent Magrini; Chris Markovic; Sean McGrath; Karyn Meltz Steinberg; Kate Auger; William Chow; Joanna Collins; Glenn Harden; Timothy Hubbard; Sarah Pelan; Jared T. Simpson; Glen Threadgold; James Torrance; Jonathan M. Wood; Laura Clarke; Sergey Koren; Matthew Boitano; Paul Peluso; Heng Li; Chen-Shan Chin; Adam M. Phillippy; Richard Durbin; Richard K. Wilson; Paul Flicek; Evan E. Eichler; Deanna M. Church

doi:10.1101/gr.213611.116

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

10.1101/072116 ◽

2016 ◽

Cited By ~ 8

Author(s):

Valerie A. Schneider ◽

Tina Graves-Lindsay ◽

Kerstin Howe ◽

Nathan Bouk ◽

Hsiu-Chuan Chen ◽

...

Keyword(s):

De Novo ◽

Genome Mapping ◽

Population Variation ◽

Reference Assembly ◽

Sequence Generation ◽

Long Read ◽

Genomic Regions ◽

Genome Assemblies ◽

First Time

AbstractThe human reference genome assembly plays a central role in nearly all aspects of today’s basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009 and reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that while the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

Download Full-text

VGEA: an RNA viral assembly toolkit

PeerJ ◽

10.7717/peerj.12129 ◽

2021 ◽

Vol 9 ◽

pp. e12129

Author(s):

Paul E. Oluniyi ◽

Fehintola Ajogbasile ◽

Judith Oguzie ◽

Jessica Uwanibe ◽

Adeyemi Kayode ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Workflow Management ◽

Viral Population ◽

Lassa Virus ◽

Viral Genomes ◽

Bioinformatics Tools ◽

Reference Sequences ◽

Genome Assemblies

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.

Download Full-text

dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies

BMC Genomics ◽

10.1186/s12864-019-6070-x ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Gokhan Yavas ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Quality Assessment ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Quality Score ◽

De Novo Genome Assembly ◽

Genome Assemblies ◽

Reference Genomes ◽

Better Than

Abstract Background Accurate de novo genome assembly has become reality with the advancements in sequencing technology. With the ever-increasing number of de novo genome assembly tools, assessing the quality of assemblies has become of great importance in genome research. Although many quality metrics have been proposed and software tools for calculating those metrics have been developed, the existing tools do not produce a unified measure to reflect the overall quality of an assembly. Results To address this issue, we developed the de novo Assembly Quality Evaluation Tool (dnAQET) that generates a unified metric for benchmarking the quality assessment of assemblies. Our framework first calculates individual quality scores for the scaffolds/contigs of an assembly by aligning them to a reference genome. Next, it computes a quality score for the assembly using its overall reference genome coverage, the quality score distribution of its scaffolds and the redundancy identified in it. Using synthetic assemblies randomly generated from the latest human genome build, various builds of the reference genomes for five organisms and six de novo assemblies for sample NA24385, we tested dnAQET to assess its capability for benchmarking quality evaluation of genome assemblies. For synthetic data, our quality score increased with decreasing number of misassemblies and redundancy and increasing average contig length and coverage, as expected. For genome builds, dnAQET quality score calculated for a more recent reference genome was better than the score for an older version. To compare with some of the most frequently used measures, 13 other quality measures were calculated. The quality score from dnAQET was found to be better than all other measures in terms of consistency with the known quality of the reference genomes, indicating that dnAQET is reliable for benchmarking quality assessment of de novo genome assemblies. Conclusions The dnAQET is a scalable framework designed to evaluate a de novo genome assembly based on the aggregated quality of its scaffolds (or contigs). Our results demonstrated that dnAQET quality score is reliable for benchmarking quality assessment of genome assemblies. The dnQAET can help researchers to identify the most suitable assembly tools and to select high quality assemblies generated.

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

High-Quality Assembly of an Individual of Yoruban Descent

10.1101/067447 ◽

2016 ◽

Cited By ~ 9

Author(s):

Karyn Meltz Steinberg ◽

Tina Graves Lindsay ◽

Valerie A. Schneider ◽

Mark J.P. Chaisson ◽

Chad Tomlinson ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Bac Library ◽

Segmental Duplications ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Genome Assemblies ◽

Complete Genomic

ABSTRACTDe novo assembly of human genomes is now a tractable effort due in part to advances in sequencing and mapping technologies. We use PacBio single-molecule, real-time (SMRT) sequencing and BioNano genomic maps to construct the first de novo assembly of NA19240, a Yoruban individual from Africa. This chromosome-scaffolded assembly of 3.08 Gb with a contig N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb represents one of the most contiguous high-quality human genomes. We utilize a BAC library derived from NA19240 DNA and novel haplotype-resolving sequencing technologies and algorithms to characterize regions of complex genomic architecture that are normally lost due to compression to a linear haploid assembly. Our results demonstrate that multiple technologies are still necessary for complete genomic representation, particularly in regions of highly identical segmental duplications. Additionally, we show that diploid assembly has utility in improving the quality of de novo human genome assemblies.

Download Full-text

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Nature Communications ◽

10.1038/s41467-020-20536-y ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zev N. Kronenberg ◽

Arang Rhie ◽

Sergey Koren ◽

Gregory T. Concepcion ◽

Paul Peluso ◽

...

Keyword(s):

Zebra Finch ◽

Cultured Cells ◽

De Novo ◽

Single Cells ◽

Variant Calling ◽

Chromatin Interaction ◽

Extended Haplotype ◽

Benchmark Datasets ◽

And Performance ◽

Genome Assemblies

AbstractHaplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80–91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.

Download Full-text

Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies

Briefings in Bioinformatics ◽

10.1093/bib/bbr074 ◽

2011 ◽

Vol 14 (2) ◽

pp. 213-224 ◽

Cited By ~ 32

Author(s):

M. C. Schatz ◽

A. M. Phillippy ◽

D. D. Sommer ◽

A. L. Delcher ◽

D. Puiu ◽

...

Keyword(s):

Genome Assemblies

Download Full-text

What Is Known and Unknown About Twice-Weekly Hemodialysis

Blood Purification ◽

10.1159/000441577 ◽

2015 ◽

Vol 40 (4) ◽

pp. 298-305 ◽

Cited By ~ 21

Author(s):

Yoshitsugu Obi ◽

Rieko Eriguchi ◽

Shuo-Ming Ou ◽

Connie M. Rhee ◽

Kamyar Kalantar-Zadeh

Keyword(s):

Quality Of Life ◽

De Novo ◽

Urine Volume ◽

Cost Effective ◽

Standard Of Care ◽

Treatment Regimens ◽

Interdialytic Weight Gain ◽

Cost Effective Treatment ◽

Incremental Hemodialysis

Background: The 2006 Kidney Disease Outcomes Quality Initiative guidelines suggest twice-weekly or incremental hemodialysis for patients with substantial residual kidney function (RKF). However, in most affluent nations de novo and abrupt transition to thrice-weekly hemodialysis is routinely prescribed for all dialysis-naïve patients regardless of their RKF. We review historical developments in hemodialysis therapy initiation and revisit twice-weekly hemodialysis as an individualized, incremental treatment especially upon first transitioning to hemodialysis therapy. Summary: In the 1960's, hemodialysis treatment was first offered as a life-sustaining treatment in the form of long sessions (≥10 hours) administered every 5 to 7 days. Twice- and then thrice-weekly treatment regimens were subsequently developed to prevent uremic symptoms on a long-term basis. The thrice-weekly regimen has since become the ‘standard of care' despite a lack of comparative studies. Some clinical studies have shown benefits of high hemodialysis dose by more frequent or longer treatment times mainly among patients with limited or no RKF. Conversely, in selected patients with higher levels of RKF and particularly higher urine volume, incremental or twice-weekly hemodialysis may preserve RKF and vascular access longer without compromising clinical outcomes. Proposed criteria for twice-weekly hemodialysis include urine output >500 ml/day, limited interdialytic weight gain, smaller body size relative to RKF, and favorable nutritional status, quality of life, and comorbidity profile. Key Messages: Incremental hemodialysis including twice-weekly regimens may be safe and cost-effective treatment regimens that provide better quality of life for incident dialysis patients who have substantial RKF. These proposed criteria may guide incremental hemodialysis frequency and warrant future randomized controlled trials.

Download Full-text

Donkey genome and insight into the imprinting of fast karyotype evolution

Scientific Reports ◽

10.1038/srep14106 ◽

2015 ◽

Vol 5 (1) ◽

Cited By ~ 12

Author(s):

Jinlong Huang ◽

Yiping Zhao ◽

Dongyi Bai ◽

Wunierfu Shiraigol ◽

Bei Li ◽

...

Keyword(s):

Target Genes ◽

Karyotype Evolution ◽

De Novo ◽

Cycle Phase ◽

Satellite Sequences ◽

Chromatid Segregation ◽

Karyotypic Instability ◽

Wild Ass ◽

Genome Assemblies ◽

Insight Into

Abstract The donkey, like the horse, is a promising model for exploring karyotypic instability. We report the de novo whole-genome assemblies of the donkey and the Asiatic wild ass. Our results reflect the distinct characteristics of donkeys, including more effective energy metabolism and better immunity than horses. The donkey shows a steady demographic trajectory. We detected abundant satellite sequences in some inactive centromere regions but not in neocentromere regions, while ribosomal RNAs frequently emerged in neocentromere regions but not in the obsolete centromere regions. Expanded miRNA families and five newly discovered miRNA target genes involved in meiosis may be associated with fast karyotype evolution. APC/C, controlling sister chromatid segregation, cytokinesis and the establishment of the G1 cell cycle phase were identified by analysis of miRNA targets and rapidly evolving genes.

Download Full-text

Facile, High Quality Sequencing of Bacterial Genomes from Small Amounts of DNA

International Journal of Genomics ◽

10.1155/2014/434575 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8

Author(s):

Momchilo Vuyisich ◽

Ayesha Arefin ◽

Karen Davenport ◽

Shihai Feng ◽

Cheryl Gleasner ◽

...

Keyword(s):

Genomic Dna ◽

De Novo ◽

Gc Content ◽

Library Preparation ◽

Sequencing Data ◽

Bacterial Genomes ◽

Dna Amount ◽

High Quality ◽

Preparation Methods

Sequencing bacterial genomes has traditionally required large amounts of genomic DNA (~1 μg). There have been few studies to determine the effects of the input DNA amount or library preparation method on the quality of sequencing data. Several new commercially available library preparation methods enable shotgun sequencing from as little as 1 ng of input DNA. In this study, we evaluated the NEBNext Ultra library preparation reagents for sequencing bacterial genomes. We have evaluated the utility of NEBNext Ultra for resequencing andde novoassembly of four bacterial genomes and compared its performance with the TruSeq library preparation kit. The NEBNext Ultra reagents enable high quality resequencing andde novoassembly of a variety of bacterial genomes when using 100 ng of input genomic DNA. For the two most challenging genomes (Burkholderiaspp.), which have the highest GC content and are the longest, we also show that the quality of both resequencing andde novoassembly is not decreased when only 10 ng of input genomic DNA is used.

Download Full-text