Identifying and removing haplotypic duplication in primary genome assemblies

Dengfeng Guan; Shane A McCarthy; Jonathan Wood; Kerstin Howe; Yadong Wang; Richard Durbin

doi:10.1093/bioinformatics/btaa025

Identifying and removing haplotypic duplication in primary genome assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa025 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2896-2898 ◽

Cited By ~ 22

Author(s):

Dengfeng Guan ◽

Shane A McCarthy ◽

Jonathan Wood ◽

Kerstin Howe ◽

Yadong Wang ◽

...

Keyword(s):

Gene Annotation ◽

Sequence Similarity ◽

Rapid Development ◽

Relevant Information ◽

Read Depth ◽

Supplementary Information ◽

Long Read ◽

Reference Quality ◽

Fully Automatic ◽

Genome Assemblies

Abstract Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identifying and removing haplotypic duplication in primary genome assemblies

10.1101/729962 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Jonathan Wood ◽

Kerstin Howe ◽

Yadong Wang ◽

...

Keyword(s):

Gene Annotation ◽

Sequence Similarity ◽

Rapid Development ◽

Relevant Information ◽

Read Depth ◽

Current Standard ◽

Long Read ◽

Reference Quality ◽

Fully Automatic ◽

Genome Assemblies

AbstractMotivationRapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with the current standard, purge_haplotigs, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can be easy integrated into assembly pipelines.AvailabilityThe source code is written in C and is available at https://github.com/dfguan/[email protected], [email protected]

Download Full-text

Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Molecular Biology and Evolution ◽

10.1093/molbev/msaa109 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2584-2600 ◽

Cited By ~ 3

Author(s):

Bryan D Clifton ◽

Jamie Jimenez ◽

Ashlyn Kimura ◽

Zeinab Chahine ◽

Pablo Librado ◽

...

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Gene Families ◽

Read Depth ◽

Specific Gene ◽

Protein Variant ◽

Protein Coding ◽

Expression Levels ◽

Number Variation ◽

Reference Quality

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

Download Full-text

Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies

10.1101/286252 ◽

2018 ◽

Cited By ~ 7

Author(s):

Michael J Roach ◽

Simon Schmidt ◽

Anthony R Borneman

Keyword(s):

De Novo ◽

Haplotype Reconstruction ◽

Minimal Impact ◽

Variant Discovery ◽

Rapid Release ◽

Long Read ◽

Recent Developments ◽

Reference Quality ◽

Downstream Analysis ◽

Genome Assemblies

AbstractRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembling highly heterozygous genomes is still facing a major problem where the two haplotypes for a region are highly polymorphic and the synteny is not recognised during assembly. This causes issues with downstream analysis, for example variant discovery using the haploid assembly, or haplotype reconstruction using the diploid assembly. A new pipeline—Purge Haplotigs—was developed specifically for third-gen assemblies to identify and reassign the duplicate contigs. The pipeline takes a draft haplotype-fused assembly or a diploid assembly, and read alignments to produce an improved assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing. All assemblies after processing with Purge Haplotigs were less duplicated with minimal impact on genome completeness. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

Download Full-text

rMETL: sensitive mobile element insertion detection with long read realignment

Bioinformatics ◽

10.1093/bioinformatics/btz106 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3484-3486 ◽

Cited By ~ 3

Author(s):

Tao Jiang ◽

Bo Liu ◽

Junyi Li ◽

Yadong Wang

Keyword(s):

Rapid Development ◽

Mobile Element ◽

Error Rates ◽

Supplementary Information ◽

Sequencing Error ◽

Complex Signals ◽

Element Insertion ◽

Sequencing Technologies ◽

Long Read ◽

Mobile Element Insertion

Abstract Summary Mobile element insertion (MEI) is a major category of structure variations (SVs). The rapid development of long read sequencing technologies provides the opportunity to detect MEIs sensitively. However, the signals of MEI implied by noisy long reads are highly complex due to the repetitiveness of mobile elements as well as the high sequencing error rates. Herein, we propose the Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). Benchmarking results of simulated and real datasets demonstrate that rMETL enables to handle the complex signals to discover MEIs sensitively. It is suited to produce high-quality MEI callsets in many genomics studies. Availability and implementation rMETL is available from https://github.com/hitbc/rMETL. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Graph analysis of fragmented long-read bacterial genome assemblies

Bioinformatics ◽

10.1093/bioinformatics/btz219 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4239-4246 ◽

Cited By ~ 3

Author(s):

Pierre Marijon ◽

Rayan Chikhi ◽

Jean-Stéphane Varré

Keyword(s):

Simple Procedure ◽

Bacterial Genome ◽

Careful Analysis ◽

Supplementary Information ◽

Hamiltonian Cycles ◽

Computational Techniques ◽

Bacterial Genomes ◽

Sequencing Project ◽

Long Read ◽

Genome Assemblies

Abstract Motivation Long-read genome assembly tools are expected to reconstruct bacterial genomes nearly perfectly; however, they still produce fragmented assemblies in some cases. It would be beneficial to understand whether these cases are intrinsically impossible to resolve, or if assemblers are at fault, implying that genomes could be refined or even finished with little to no additional experimental cost. Results We propose a set of computational techniques to assist inspection of fragmented bacterial genome assemblies, through careful analysis of assembly graphs. By finding paths of overlapping raw reads between pairs of contigs, we recover potential short-range connections between contigs that were lost during the assembly process. We show that our procedure recovers 45% of missing contig adjacencies in fragmented Canu assemblies, on samples from the NCTC bacterial sequencing project. We also observe that a simple procedure based on enumerating weighted Hamiltonian cycles can suggest likely contig orderings. In our tests, the correct contig order is ranked first in half of the cases and within the top-three predictions in nearly all evaluated cases, providing a direction for finishing fragmented long-read assemblies. Availability and implementation https://gitlab.inria.fr/pmarijon/knot . Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Hayai-Annotation Plants: an ultra-fast and comprehensive functional gene annotation system in plants

Bioinformatics ◽

10.1093/bioinformatics/btz380 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4427-4429 ◽

Cited By ~ 7

Author(s):

Andrea Ghelfi ◽

Kenta Shirasawa ◽

Hideki Hirakawa ◽

Sachiko Isobe

Keyword(s):

Enzyme Commission ◽

Gene Annotation ◽

Sequence Similarity ◽

Enzyme Commission Number ◽

Functional Gene ◽

Supplementary Information ◽

Annotation System ◽

Evidence Type ◽

Similarity Searches ◽

Functional Gene Annotation

Abstract Summary Hayai-Annotation Plants is a browser-based interface for an ultra-fast and accurate functional gene annotation system for plant species using R. The pipeline combines the sequence-similarity searches, using USEARCH against UniProtKB (taxonomy Embryophyta), with a functional annotation step. Hayai-Annotation Plants provides five layers of annotation: i) protein name; ii) gene ontology terms consisting of its three main domains (Biological Process, Molecular Function and Cellular Component); iii) enzyme commission number; iv) protein existence level; and v) evidence type. It implements a new algorithm that gives priority to protein existence level to propagate GO and EC information and annotated Arabidopsis thaliana representative peptide sequences (Araport11) within 5 min at the PC level. Availability and implementation The software is implemented in R and runs on Macintosh and Linux systems. It is freely available at https://github.com/kdri-genomics/Hayai-Annotation-Plants under the GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking

Bioinformatics ◽

10.1093/bioinformatics/btaa158 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3594-3596 ◽

Cited By ~ 5

Author(s):

Cédric R Weber ◽

Rahmad Akbar ◽

Alexander Yermanos ◽

Milena Pavlović ◽

Igor Snapkov ◽

...

Keyword(s):

T Cell ◽

T Cell Receptor ◽

Network Architecture ◽

Gene Annotation ◽

Sequence Similarity ◽

Cell Receptor ◽

Supplementary Information ◽

Germline Gene ◽

Immune Receptor ◽

Estimation Sequence

Abstract Summary B- and T-cell receptor repertoires of the adaptive immune system have become a key target for diagnostics and therapeutics research. Consequently, there is a rapidly growing number of bioinformatics tools for immune repertoire analysis. Benchmarking of such tools is crucial for ensuring reproducible and generalizable computational analyses. Currently, however, it remains challenging to create standardized ground truth immune receptor repertoires for immunoinformatics tool benchmarking. Therefore, we developed immuneSIM, an R package that allows the simulation of native-like and aberrant synthetic full-length variable region immune receptor sequences by tuning the following immune receptor features: (i) species and chain type (BCR, TCR, single and paired), (ii) germline gene usage, (iii) occurrence of insertions and deletions, (iv) clonal abundance, (v) somatic hypermutation and (vi) sequence motifs. Each simulated sequence is annotated by the complete set of simulation events that contributed to its in silico generation. immuneSIM permits the benchmarking of key computational tools for immune receptor analysis, such as germline gene annotation, diversity and overlap estimation, sequence similarity, network architecture, clustering analysis and machine learning methods for motif detection. Availability and implementation The package is available via https://github.com/GreiffLab/immuneSIM and on CRAN at https://cran.r-project.org/web/packages/immuneSIM. The documentation is hosted at https://immuneSIM.readthedocs.io. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HLA*PRG:LA – HLA typing from linearly projected graph alignments

10.1101/453555 ◽

2018 ◽

Cited By ~ 1

Author(s):

Alexander T Dilthey ◽

Alexander J Mentzer ◽

Raphael Carapito ◽

Clare Cutland ◽

Nezih Cereb ◽

...

Keyword(s):

Type Inference ◽

Hla Typing ◽

Supplementary Information ◽

Whole Genome ◽

Typical Sample ◽

Whole Exome ◽

Oxford Nanopore ◽

Hla Type ◽

Long Read ◽

Genome Assemblies

AbstractSummary:HLA*PRG:LA implements a new graph alignment model for HLA type inference, based on the projection of linear alignments onto a variation graph. It enables accurate HLA type inference from whole-genome (99% accuracy) and whole-exome (93% accuracy) Illumina data; from long-read Oxford Nanopore and Pacific Biosciences data (98% accuracy for whole-genome and targeted data); and from genome assemblies. Computational requirements for a typical sample vary between 0.7 and 14 CPU hours per sample.Availability and Implementation:HLA*PRG:LA is implemented in C++ and Perl and freely available from https://github.com/DiltheyLab/HLA-PRG-LA (GPL v3).Contact:[email protected] informationSupplementary data are available online.

Download Full-text

AnAms1.0: A high-quality chromosome-scale assembly of a domestic cat Felis catus of American Shorthair breed

10.1101/2020.05.19.103788 ◽

2020 ◽

Author(s):

Sachiko Isobe ◽

Yuki Matsumoto ◽

Claire Chung ◽

Mika Sakamoto ◽

Ting-Fung Chan ◽

...

Keyword(s):

Gene Annotation ◽

Companion Animals ◽

Felis Catus ◽

Domestic Cat ◽

Genome Database ◽

Genomic Resources ◽

Genomic Technologies ◽

Long Read ◽

Limited Power ◽

Genome Assemblies

AbstractThe domestic cat (Felis catus) is one of the most popular companion animals in the world. Comprehensive genomic resources will aid the development and application of veterinary medicine including to improve feline health, in particular, to enable precision medicine which is promising in human application. However, currently available cat genome assemblies were mostly built based on the Abyssinian cat breed which is highly inbred and has limited power in representing the vast diversity of the cat population. Moreover, the current reference assembly remains fragmented with sequences contained in thousands of scaffolds. We constructed a reference-grade chromosome-scale genome assembly of a domestic cat, Felis catus genome of American Shorthair breed, Anicom American shorthair 1.0 (AnAms1.0) with high contiguity (scaffold N50 > 120 Mb), by combining multiple advanced genomic technologies, including PacBio long-read sequencing as well as sequence scaffolding by long-range genomic information obtained from Hi-C and optical mapping data. Homology-based and ab initio gene annotation was performed with the Iso-Seq data. Analyzed data is be publicly accessible on Cats genome informatics (Cats-I, https://cat.annotation.jp/), a cat genome database established as a platform to facilitate the accumulation and sharing of genomic resources to improve veterinary care.

Download Full-text

Rapid characterization of complex genomic regions using Cas9 enrichment and Nanopore sequencing

10.1101/2021.03.11.434935 ◽

2021 ◽

Author(s):

Jesse Bruijnesteijn ◽

Marit van der Wiel ◽

Natasja G. de Groot ◽

Ronald E. Bontrop

Keyword(s):

Sequence Similarity ◽

Gene Clusters ◽

Oxford Nanopore ◽

Long Read ◽

Number Variation ◽

Rapid Characterization ◽

Multiple Species ◽

Genomic Regions ◽

Genome Assemblies

AbstractLong-read sequencing approaches have considerably improved the quality and contiguity of genome assemblies. Such platforms bear the potential to resolve even extremely complex regions, such as multigenic families and repetitive stretches of DNA. Deep sequencing coverage, however, is required to overcome low nucleotide accuracy, especially in regions with high homopolymer density, copy number variation, and sequence similarity, such as the MHC and KIR gene clusters of the immune system. Therefore, we have adapted a targeted enrichment protocol in combination with long-read sequencing to efficiently annotate complex genomic regions. Using Cas9 endonuclease activity, segments of the complex KIR gene cluster were enriched and sequenced on an Oxford Nanopore Technologies platform. This provided sufficient coverage to accurately resolve and phase highly complex KIR haplotypes. Our strategy facilitates rapid characterization of large and complex multigenic regions, including its epigenetic footprint, in multiple species, even in the absence of a reference genome.

Download Full-text