Comparative Annotation Toolkit (CAT) - simultaneous clade and personal genome annotation

Mapping Intimacies ◽

10.1101/231118 ◽

2017 ◽

Cited By ~ 6

Author(s):

Ian T. Fiddes ◽

Joel Armstrong ◽

Mark Diekhans ◽

Stefanie Nachtweide ◽

Zev N. Kronenberg ◽

...

Keyword(s):

Genome Annotation ◽

De Novo ◽

Low Cost ◽

Great Apes ◽

Personal Genome ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Read ◽

Genome Assemblies ◽

Rat Genome

ABSTRACTThe recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultra-contiguous genome assemblies. To compare these genomes we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms and structural variants, even in genomes as well studied as rat and the great apes, and how these annotations improve cross-species RNA expression experiments.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

10.1101/715722 ◽

2019 ◽

Cited By ~ 21

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Ryan Lorig-Roach ◽

Marina Haukness ◽

Hugh E. Olsen ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Proximity Ligation ◽

Current State ◽

Human Genomes ◽

Sequencing Method ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Assembly Performance

AbstractPresent workflows for producing human genome assemblies from long-read technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. We demonstrate an optimized PromethION nanopore sequencing method for eleven human genomes. The sequencing, performed on one machine in nine days, achieved an average 63x coverage, 42 Kb read N50, 90% median read identity and 6.5x coverage in 100 Kb+ reads using just three flow cells per sample. To assemble these data we introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms. On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone. We evaluate assembly performance for diploid, haploid and trio-binned human samples in terms of accuracy, cost, and time and demonstrate improvements relative to current state-of-the-art methods in all areas. We further show that addition of proximity ligation (Hi-C) sequencing yields near chromosome-level scaffolds for all eleven genomes.

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

How well can we create phased, diploid, human genomes?: An assessment of FALCON-Unzip phasing using a human trio

10.1101/262196 ◽

2018 ◽

Cited By ~ 4

Author(s):

Arkarachai Fungtammasan ◽

Brett Hannigan

Keyword(s):

De Novo ◽

Personal Genome ◽

Specific Expression ◽

Human Genomes ◽

Future Improvement ◽

Long Read ◽

Allele Specific ◽

Personal Genomes ◽

Reference Genomes ◽

Haplotype Information

ABSTRACTLong read sequencing technology has allowed researchers to create de novo assemblies with impressive continuity[1,2]. This advancement has dramatically increased the number of reference genomes available and hints at the possibility of a future where personal genomes are assembled rather than resequenced. In 2016 Pacific Biosciences released the FALCON-Unzip framework, which can provide long, phased haplotype contigs from de novo assemblies. This phased genome algorithm enhances the accuracy of highly heterozygous organisms and allows researchers to explore questions that require haplotype information such as allele-specific expression and regulation. However, validation of this technique has been limited to small genomes or inbred individuals[3].As a roadmap to personal genome assembly and phasing, we assess the phasing accuracy of FALCON-Unzip in humans using publicly available data for the Ashkenazi trio from the Genome in a Bottle Consortium[4]. To assess the accuracy of the Unzip algorithm, we assembled the genome of the son using FALCON and FALCON Unzip, genotyped publicly available short read data for the mother and the father, and observed the inheritance pattern of the parental SNPs along the phased genome of the son. We found that 72.8% of haplotype contigs share SNPs with only one parent suggesting that these contigs are correctly phased. Most mis-phased SNPs are random but present in high frequency toward the end of haplotype contigs. Approximately 20.7% of mis-phased haplotype contigs contain clusters of mis-phased SNPs, suggesting that haplotypes were mis-joined by FALCON-Unzip. Mis-joined boundaries in those contigs are located in areas of low SNP density. This research demonstrates that the FALCON-Unzip algorithm can be used to create long and accurate haplotypes for humans and identifies problematic regions that could benefit in future improvement.

Download Full-text

High-Quality Assembly of an Individual of Yoruban Descent

10.1101/067447 ◽

2016 ◽

Cited By ~ 9

Author(s):

Karyn Meltz Steinberg ◽

Tina Graves Lindsay ◽

Valerie A. Schneider ◽

Mark J.P. Chaisson ◽

Chad Tomlinson ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Bac Library ◽

Segmental Duplications ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Genome Assemblies ◽

Complete Genomic

ABSTRACTDe novo assembly of human genomes is now a tractable effort due in part to advances in sequencing and mapping technologies. We use PacBio single-molecule, real-time (SMRT) sequencing and BioNano genomic maps to construct the first de novo assembly of NA19240, a Yoruban individual from Africa. This chromosome-scaffolded assembly of 3.08 Gb with a contig N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb represents one of the most contiguous high-quality human genomes. We utilize a BAC library derived from NA19240 DNA and novel haplotype-resolving sequencing technologies and algorithms to characterize regions of complex genomic architecture that are normally lost due to compression to a linear haploid assembly. Our results demonstrate that multiple technologies are still necessary for complete genomic representation, particularly in regions of highly identical segmental duplications. Additionally, we show that diploid assembly has utility in improving the quality of de novo human genome assemblies.

Download Full-text

HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution

10.1101/062117 ◽

2016 ◽

Cited By ~ 4

Author(s):

Govinda M. Kamath ◽

Ilan Shomorony ◽

Fei Xia ◽

Thomas A. Courtade ◽

David N. Tse

Keyword(s):

Gold Standard ◽

De Novo ◽

Error Resilience ◽

De Bruijn Graph ◽

Sequencing Technologies ◽

Long Read ◽

De Bruijn ◽

Genome Assemblies

ABSTRACTLong-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce mis-assemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding "hinges" to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial datasets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 datasets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.

Download Full-text

Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Science ◽

10.1126/science.abf7117 ◽

2021 ◽

Vol 372 (6537) ◽

pp. eabf7117 ◽

Cited By ~ 4

Author(s):

Peter Ebert ◽

Peter A. Audano ◽

Qihui Zhu ◽

Bernardo Rodriguez-Martin ◽

David Porubsky ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Mobile Element ◽

Integrated Analysis ◽

Base Pairs ◽

Adaptive Selection ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Read

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

Download Full-text

Higher quality de novo genome assemblies from degraded museum specimens: a linked-read approach to museomics

10.1101/716506 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jocelyn P. Colella ◽

Anna Tigano ◽

Matthew D. MacManes

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Deer Mouse ◽

Cost Effective ◽

Molecular Data ◽

Degraded Dna ◽

Museum Specimens ◽

Sequencing Technologies ◽

Long Read ◽

Genome Assemblies

AbstractHigh-throughput sequencing technologies are a proposed solution for accessing the molecular data in historic specimens. However, degraded DNA combined with the computational demands of short-read assemblies has posed significant laboratory and bioinformatics challenges. Linked-read or ‘synthetic long-read’ sequencing technologies, such as 10X Genomics, may provide a cost-effective alternative solution to assemble higher quality de novo genomes from degraded specimens. Here, we compare assembly quality (e.g., genome contiguity and completeness, presence of orthogroups) between four published genomes assembled from a single shotgun library and four deer mouse (Peromyscus spp.) genomes assembled using 10X Genomics technology. At a similar price-point, these approaches produce vastly different assemblies, with linked-read assemblies having overall higher quality, measured by larger N50 values and greater gene content. Although not without caveats, our results suggest that linked-read sequencing technologies may represent a viable option to build de novo genomes from historic museum specimens, which may prove particularly valuable for extinct, rare, or difficult to collect taxa.

Download Full-text

Yet another de novo genome assembler

10.1101/656306 ◽

2019 ◽

Cited By ~ 4

Author(s):

Robert Vaser ◽

Mile Šikić

Keyword(s):

De Novo ◽

Sequence Classification ◽

De Novo Genome Assembly ◽

Development Fund ◽

European Regional Development Fund ◽

Sequencing Technologies ◽

Single Genome ◽

Long Read ◽

Metagenome Assembly ◽

Genome Assemblies

AbstractAdvances in sequencing technologies have pushed the limits of genome assemblies beyond imagination. The sheer amount of long read data that is being generated enables the assembly for even the largest and most complex organism for which efficient algorithms are needed. We present a new tool, called Ra, for de novo genome assembly of long uncorrected reads. It is a fast and memory friendly assembler based on sequence classification and assembly graphs, developed with large genomes in mind. It is freely available at https://github.com/lbcb-sci/ra.This work has been supported in part by the Croatian Science Foundation under the project Single genome and metagenome assembly (IP-2018-01-5886), and in part by the European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS). In addition, M.Š. is partly supported by funding from A*STAR, Singapore.

Download Full-text

Deep repeat resolution—the assembly of the Drosophila Histone Complex

Nucleic Acids Research ◽

10.1093/nar/gky1194 ◽

2018 ◽

Vol 47 (3) ◽

pp. e18-e18 ◽

Cited By ~ 2

Author(s):

Philipp Bongartz ◽

Siegfried Schloissnig

Keyword(s):

De Novo ◽

Machine Learning Algorithms ◽

Single Nucleotide Variants ◽

Major Step ◽

Base Pairs ◽

Sequencing Technologies ◽

Long Reads ◽

Wide Range ◽

Long Read ◽

Genome Assemblies

Abstract Though the advent of long-read sequencing technologies has led to a leap in contiguity of de novo genome assemblies, current reference genomes of higher organisms still do not provide unbroken sequences of complete chromosomes. Despite reads in excess of 30 000 base pairs, there are still repetitive structures that cannot be resolved by current state-of-the-art assemblers. The most challenging of these structures are tandemly arrayed repeats, which occur in the genomes of all eukaryotes. Untangling tandem repeat clusters is exceptionally difficult, since the rare differences between repeat copies are obscured by the high error rate of long reads. Solving this problem would constitute a major step towards computing fully assembled genomes. Here, we demonstrate by example of the Drosophila Histone Complex that via machine learning algorithms, it is possible to exploit the underlying distinguishing patterns of single nucleotide variants of repeats from very noisy data to resolve a large and highly conserved repeat cluster. The ideas explored in this paper are a first step towards the automated assembly of complex repeat structures and promise to be applicable to a wide range of eukaryotic genomes.

Download Full-text