Identifying and removing haplotypic duplication in primary genome assemblies

Mapping Intimacies ◽

10.1101/729962 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Jonathan Wood ◽

Kerstin Howe ◽

Yadong Wang ◽

...

Keyword(s):

Gene Annotation ◽

Sequence Similarity ◽

Rapid Development ◽

Relevant Information ◽

Read Depth ◽

Current Standard ◽

Long Read ◽

Reference Quality ◽

Fully Automatic ◽

Genome Assemblies

AbstractMotivationRapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with the current standard, purge_haplotigs, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can be easy integrated into assembly pipelines.AvailabilityThe source code is written in C and is available at https://github.com/dfguan/[email protected], [email protected]

Download Full-text

Identifying and removing haplotypic duplication in primary genome assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa025 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2896-2898 ◽

Cited By ~ 22

Author(s):

Dengfeng Guan ◽

Shane A McCarthy ◽

Jonathan Wood ◽

Kerstin Howe ◽

Yadong Wang ◽

...

Keyword(s):

Gene Annotation ◽

Sequence Similarity ◽

Rapid Development ◽

Relevant Information ◽

Read Depth ◽

Supplementary Information ◽

Long Read ◽

Reference Quality ◽

Fully Automatic ◽

Genome Assemblies

Abstract Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Understanding the Early Evolutionary Stages of a Tandem Drosophilamelanogaster-Specific Gene Family: A Structural and Functional Population Study

Molecular Biology and Evolution ◽

10.1093/molbev/msaa109 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2584-2600 ◽

Cited By ~ 3

Author(s):

Bryan D Clifton ◽

Jamie Jimenez ◽

Ashlyn Kimura ◽

Zeinab Chahine ◽

Pablo Librado ◽

...

Keyword(s):

Gene Family ◽

Sequence Similarity ◽

Gene Families ◽

Read Depth ◽

Specific Gene ◽

Protein Variant ◽

Protein Coding ◽

Expression Levels ◽

Number Variation ◽

Reference Quality

Abstract Gene families underlie genetic innovation and phenotypic diversification. However, our understanding of the early genomic and functional evolution of tandemly arranged gene families remains incomplete as paralog sequence similarity hinders their accurate characterization. The Drosophila melanogaster-specific gene family Sdic is tandemly repeated and impacts sperm competition. We scrutinized Sdic in 20 geographically diverse populations using reference-quality genome assemblies, read-depth methodologies, and qPCR, finding that ∼90% of the individuals harbor 3–7 copies as well as evidence of population differentiation. In strains with reliable gene annotations, copy number variation (CNV) and differential transposable element insertions distinguish one structurally distinct version of the Sdic region per strain. All 31 annotated copies featured protein-coding potential and, based on the protein variant encoded, were categorized into 13 paratypes differing in their 3′ ends, with 3–5 paratypes coexisting in any strain examined. Despite widespread gene conversion, the only copy present in all strains has functionally diverged at both coding and regulatory levels under positive selection. Contrary to artificial tandem duplications of the Sdic region that resulted in increased male expression, CNV in cosmopolitan strains did not correlate with expression levels, likely as a result of differential genome modifier composition. Duplicating the region did not enhance sperm competitiveness, suggesting a fitness cost at high expression levels or a plateau effect. Beyond facilitating a minimally optimal expression level, Sdic CNV acts as a catalyst of protein and regulatory diversity, showcasing a possible evolutionary path recently formed tandem multigene families can follow toward long-term consolidation in eukaryotic genomes.

Download Full-text

Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies

10.1101/286252 ◽

2018 ◽

Cited By ~ 7

Author(s):

Michael J Roach ◽

Simon Schmidt ◽

Anthony R Borneman

Keyword(s):

De Novo ◽

Haplotype Reconstruction ◽

Minimal Impact ◽

Variant Discovery ◽

Rapid Release ◽

Long Read ◽

Recent Developments ◽

Reference Quality ◽

Downstream Analysis ◽

Genome Assemblies

AbstractRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembling highly heterozygous genomes is still facing a major problem where the two haplotypes for a region are highly polymorphic and the synteny is not recognised during assembly. This causes issues with downstream analysis, for example variant discovery using the haploid assembly, or haplotype reconstruction using the diploid assembly. A new pipeline—Purge Haplotigs—was developed specifically for third-gen assemblies to identify and reassign the duplicate contigs. The pipeline takes a draft haplotype-fused assembly or a diploid assembly, and read alignments to produce an improved assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing. All assemblies after processing with Purge Haplotigs were less duplicated with minimal impact on genome completeness. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.

Download Full-text

Efficient iterative Hi-C scaffolder based on N-best neighbors

BMC Bioinformatics ◽

10.1186/s12859-021-04453-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dengfeng Guan ◽

Shane A. McCarthy ◽

Zemin Ning ◽

Guohua Wang ◽

Yadong Wang ◽

...

Keyword(s):

De Novo ◽

A Priori ◽

Sequencing Technology ◽

Current Standard ◽

A Genome ◽

Eukaryotic Species ◽

Long Read ◽

Reference Quality ◽

Comparable Accuracy ◽

Chromosomal Profile

Abstract Background Efficient and effective genome scaffolding tools are still in high demand for generating reference-quality assemblies. While long read data itself is unlikely to create a chromosome-scale assembly for most eukaryotic species, the inexpensive Hi-C sequencing technology, capable of capturing the chromosomal profile of a genome, is now widely used to complete the task. However, the existing Hi-C based scaffolding tools either require a priori chromosome number as input, or lack the ability to build highly continuous scaffolds. Results We design and develop a novel Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. Subsequent to scaffolding, it identifies potential misjoins and breaks them to keep the scaffolding accuracy. Through our tests on three long read based de novo assemblies from three different species, we demonstrate that pin_hic is more efficient than current standard state-of-art tools, and it can generate much more continuous scaffolds, while achieving a higher or comparable accuracy. Conclusions Pin_hic is an efficient Hi-C based scaffolding tool, which can be useful for building chromosome-scale assemblies. As many sequencing projects have been launched in the recent years, we believe pin_hic has potential to be applied in these projects and makes a meaningful contribution.

Download Full-text

AnAms1.0: A high-quality chromosome-scale assembly of a domestic cat Felis catus of American Shorthair breed

10.1101/2020.05.19.103788 ◽

2020 ◽

Author(s):

Sachiko Isobe ◽

Yuki Matsumoto ◽

Claire Chung ◽

Mika Sakamoto ◽

Ting-Fung Chan ◽

...

Keyword(s):

Gene Annotation ◽

Companion Animals ◽

Felis Catus ◽

Domestic Cat ◽

Genome Database ◽

Genomic Resources ◽

Genomic Technologies ◽

Long Read ◽

Limited Power ◽

Genome Assemblies

AbstractThe domestic cat (Felis catus) is one of the most popular companion animals in the world. Comprehensive genomic resources will aid the development and application of veterinary medicine including to improve feline health, in particular, to enable precision medicine which is promising in human application. However, currently available cat genome assemblies were mostly built based on the Abyssinian cat breed which is highly inbred and has limited power in representing the vast diversity of the cat population. Moreover, the current reference assembly remains fragmented with sequences contained in thousands of scaffolds. We constructed a reference-grade chromosome-scale genome assembly of a domestic cat, Felis catus genome of American Shorthair breed, Anicom American shorthair 1.0 (AnAms1.0) with high contiguity (scaffold N50 > 120 Mb), by combining multiple advanced genomic technologies, including PacBio long-read sequencing as well as sequence scaffolding by long-range genomic information obtained from Hi-C and optical mapping data. Homology-based and ab initio gene annotation was performed with the Iso-Seq data. Analyzed data is be publicly accessible on Cats genome informatics (Cats-I, https://cat.annotation.jp/), a cat genome database established as a platform to facilitate the accumulation and sharing of genomic resources to improve veterinary care.

Download Full-text

Rapid characterization of complex genomic regions using Cas9 enrichment and Nanopore sequencing

10.1101/2021.03.11.434935 ◽

2021 ◽

Author(s):

Jesse Bruijnesteijn ◽

Marit van der Wiel ◽

Natasja G. de Groot ◽

Ronald E. Bontrop

Keyword(s):

Sequence Similarity ◽

Gene Clusters ◽

Oxford Nanopore ◽

Long Read ◽

Number Variation ◽

Rapid Characterization ◽

Multiple Species ◽

Genomic Regions ◽

Genome Assemblies

AbstractLong-read sequencing approaches have considerably improved the quality and contiguity of genome assemblies. Such platforms bear the potential to resolve even extremely complex regions, such as multigenic families and repetitive stretches of DNA. Deep sequencing coverage, however, is required to overcome low nucleotide accuracy, especially in regions with high homopolymer density, copy number variation, and sequence similarity, such as the MHC and KIR gene clusters of the immune system. Therefore, we have adapted a targeted enrichment protocol in combination with long-read sequencing to efficiently annotate complex genomic regions. Using Cas9 endonuclease activity, segments of the complex KIR gene cluster were enriched and sequenced on an Oxford Nanopore Technologies platform. This provided sufficient coverage to accurately resolve and phase highly complex KIR haplotypes. Our strategy facilitates rapid characterization of large and complex multigenic regions, including its epigenetic footprint, in multiple species, even in the absence of a reference genome.

Download Full-text

Rapid Characterization of Complex Killer Cell Immunoglobulin-Like Receptor (KIR) Regions Using Cas9 Enrichment and Nanopore Sequencing

Frontiers in Immunology ◽

10.3389/fimmu.2021.722181 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jesse Bruijnesteijn ◽

Marit van der Wiel ◽

Natasja G. de Groot ◽

Ronald E. Bontrop

Keyword(s):

Sequence Similarity ◽

Killer Cell ◽

Gene Clusters ◽

Oxford Nanopore ◽

Long Read ◽

Number Variation ◽

Rapid Characterization ◽

Multiple Species ◽

Genome Assemblies

Long-read sequencing approaches have considerably improved the quality and contiguity of genome assemblies. Such platforms bear the potential to resolve even extremely complex regions, such as multigenic immune families and repetitive stretches of DNA. Deep sequencing coverage, however, is required to overcome low nucleotide accuracy, especially in regions with high homopolymer density, copy number variation, and sequence similarity, such as the MHC and KIR gene clusters of the immune system. Therefore, we have adapted a targeted enrichment protocol in combination with long-read sequencing to efficiently annotate complex KIR gene regions. Using Cas9 endonuclease activity, segments of the KIR gene cluster were enriched and sequenced on an Oxford Nanopore Technologies platform. This provided sufficient coverage to accurately resolve and phase highly complex KIR haplotypes. Our strategy eliminates PCR-induced amplification errors, facilitates rapid characterization of large and complex multigenic regions, including its epigenetic footprint, and is applicable in multiple species, even in the absence of a reference genome.

Download Full-text

Chromosome-level genome assembly and annotation of two lineages of the ant Cataglyphis hispanica: steppingstones towards genomic studies of hybridogenesis and thermal adaptation in desert ants

10.1101/2022.01.07.475286 ◽

2022 ◽

Author(s):

Hugo Darras ◽

Natalia de Souza Araujo ◽

Lyam Baudry ◽

Nadege Guiglielmoni ◽

Pedro Lorite ◽

...

Keyword(s):

Molecular Mechanisms ◽

Gene Annotation ◽

Thermal Adaptation ◽

Thermal Limit ◽

Chromosome Conformation ◽

Long Read ◽

Genomic Studies ◽

Genetic Mechanisms ◽

Genome Assemblies ◽

High Quality Genome

Cataglyphis are thermophilic ants that forage during the day when temperatures are highest and sometimes close to their critical thermal limit. Several Cataglyphis species have evolved unusual reproductive systems such as facultative queen parthenogenesis or social hybridogenesis, which have not yet been investigated in detail at the molecular level. We generated high-quality genome assemblies for two hybridogenetic lineages of the Iberian ant Cataglyphis hispanica using long-read Nanopore sequencing and exploited chromosome conformation capture (3C) sequencing to assemble contigs into 26 and 27 chromosomes, respectively. Males of one lineage were karyotyped to confirm the number of chromosomes inferred from 3C data. We obtained transcriptomic data to assist gene annotation and built custom repeat libraries for each of the two assemblies. Comparative analyses with 19 other published ant genomes were also conducted. These new genomic resources pave the way for exploring the genetic mechanisms underlying the remarkable thermal adaptation and the molecular mechanisms associated with transitions between different genetic systems characteristics of the ant genus Cataglyphis.

Download Full-text

BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies

10.1101/593889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Joel Vizueta ◽

Alejandro Sánchez-Gracia ◽

Julio Rozas

Keyword(s):

Dna Sequences ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Families ◽

Genomic Research ◽

Model Organisms ◽

Large Gene ◽

Genomic Annotation ◽

Gene Models ◽

Genome Assemblies

AbstractGene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of non-model organisms. Despite the recent progress in automatic methods, the tools developed for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, which require considerable extra efforts to be amended. Here we present BITACORA, a bioinformatics solution that integrates sequence similarity search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly from DNA sequences. We tested the performance of the BITACORA pipeline in annotating the members of two chemosensory gene families of different sizes in seven available chelicerate genome drafts. Despite the relatively high fragmentation of some of these drafts, BITACORA was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program generates an output file in the general feature format (GFF) files, with both curated and novel gene models, and a FASTA file with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.

Download Full-text

Identification of Novel Toxin Genes from the Stinging Nettle Caterpillar Parasa lepida (Cramer, 1799): Insights into the Evolution of Lepidoptera Toxins

Insects ◽

10.3390/insects12050396 ◽

2021 ◽

Vol 12 (5) ◽

pp. 396

Author(s):

Natrada Mitpuangchon ◽

Kwan Nualcharoen ◽

Singtoe Boonrotpong ◽

Patamarerk Engsontia

Keyword(s):

Protease Inhibitors ◽

Proteolytic Enzymes ◽

Gene Annotation ◽

Sequence Similarity ◽

New Drugs ◽

Toxin Gene ◽

Cone Snail ◽

Stinging Nettle ◽

Toxin Genes ◽

Nettle Caterpillar

Many animal species can produce venom for defense, predation, and competition. The venom usually contains diverse peptide and protein toxins, including neurotoxins, proteolytic enzymes, protease inhibitors, and allergens. Some drugs for cancer, neurological disorders, and analgesics were developed based on animal toxin structures and functions. Several caterpillar species possess venoms that cause varying effects on humans both locally and systemically. However, toxins from only a few species have been investigated, limiting the full understanding of the Lepidoptera toxin diversity and evolution. We used the RNA-seq technique to identify toxin genes from the stinging nettle caterpillar, Parasa lepida (Cramer, 1799). We constructed a transcriptome from caterpillar urticating hairs and reported 34,968 unique transcripts. Using our toxin gene annotation pipeline, we identified 168 candidate toxin genes, including protease inhibitors, proteolytic enzymes, and allergens. The 21 P. lepida novel Knottin-like peptides, which do not show sequence similarity to any known peptide, have predicted 3D structures similar to tarantula, scorpion, and cone snail neurotoxins. We highlighted the importance of convergent evolution in the Lepidoptera toxin evolution and the possible mechanisms. This study opens a new path to understanding the hidden diversity of Lepidoptera toxins, which could be a fruitful source for developing new drugs.

Download Full-text