De Novo Annotation of Transposable Elements: Tackling the Fat Genome Issue

Software Evaluation for de novo Detection of Transposons

10.1101/2021.02.08.430290 ◽

2021 ◽

Author(s):

Matias Rodriguez ◽

Wojciech Makałowski

Keyword(s):

Transposable Elements ◽

Genome Evolution ◽

De Novo ◽

Simulated Data ◽

Genomic Sequences ◽

Software Evaluation ◽

Easy Task ◽

Eukaryotic Genomes

AbstractTransposable elements (TEs) are major genomic components in most eukaryotic genomes and play an important role in genome evolution. However, despite their relevance the identification of TEs is not an easy task and a number of tools were developed to tackle this problem. To better understand how they perform, we tested several widely used tools for de novo TE detection and compared their performance on both simulated data and well curated genomic sequences. The results will be helpful for identifying common issues associated with TE-annotation and for evaluating how comparable are the results obtained with different tools.

Download Full-text

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

Genome Biology ◽

10.1186/s13059-019-1905-y ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

Shujun Ou ◽

Weija Su ◽

Yi Liao ◽

Kapeel Chougule ◽

Jireh R. A. Agda ◽

...

Keyword(s):

Transposable Elements ◽

Animal Species ◽

Performance Metrics ◽

De Novo ◽

Terminal Inverted Repeat ◽

Miniature Inverted Transposable Elements ◽

Sensitivity Specificity ◽

Genomic Regions ◽

Assembly Algorithms ◽

Eukaryotic Genomes

Abstract Background Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. Results We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. Conclusions The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Download Full-text

Illumina TruSeq synthetic long-reads empowerde novoassembly and resolve complex, highly repetitive transposable elements

10.1101/001834 ◽

2014 ◽

Cited By ~ 1

Author(s):

Rajiv C McCoy ◽

Ryan W Taylor ◽

Timothy A Blauwkamp ◽

Joanna L Kelley ◽

Michael Kertesz ◽

...

Keyword(s):

Transposable Elements ◽

Reference Genome ◽

De Novo ◽

Model Organism ◽

Genomic Analysis ◽

High Sequence Identity ◽

Current Reference ◽

Sequencing Technologies ◽

Long Reads ◽

Whole Genomes

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including thede novoassembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or present in complex genomic arrangements. While TEs strongly affect genome function and evolution, most currentde novoassembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (∼0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organismDrosophila melanogaster(reference genome strainy;cn,bw,sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 of annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long- reads, and likely other methods that generate long reads, offer a powerful approach to improvede novoassemblies of whole genomes.

Download Full-text

RepeatModeler2: automated genomic discovery of transposable element families

10.1101/856591 ◽

2019 ◽

Cited By ~ 12

Author(s):

Jullien M. Flynn ◽

Robert Hubley ◽

Clément Goubert ◽

Jeb Rosen ◽

Andrew G. Clark ◽

...

Keyword(s):

Transposable Elements ◽

De Novo ◽

False Positive Rate ◽

Fruit Fly ◽

Sequence Coverage ◽

Genome Sequences ◽

Model Species ◽

Link Type ◽

Eukaryotic Species ◽

Ltr Retroelements

AbstractThe accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a new pipeline that greatly facilitates this process. This new program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete LTR retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately three times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. The program had an extremely low false positive rate when applied to simulated genomes devoid of TEs. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, https://github.com/Dfam-consortium/TETools).SignificanceGenome sequences are being produced for more and more eukaryotic species. The bulk of these genomes is composed of parasitic, self-mobilizing transposable elements (TEs) that play important roles in organismal evolution. Thus there is a pressing need for developing software that can accurately identify the diverse set of TEs dispersed in genome sequences. Here we introduce RepeatModeler2, an easy-to-use package for the curation of reference TE libraries which can be applied to any eukaryotic species. Through several major improvements over the previous version, RepeatModeler2 is able to produce libraries that recapitulate the known composition of three model species with some of the most complex TE landscapes. Thus RepeatModeler2 will greatly enhance the discovery and annotation of TEs in genome sequences.

Download Full-text

De novo identification of satellite DNAs in the sequenced genomes of Drosophila virilis and D. americana using the RepeatExplorer and TAREAN pipelines

10.1101/781146 ◽

2019 ◽

Author(s):

Bráulio S.M.L. Silva ◽

Pedro Heringer ◽

Guilherme B. Dias ◽

Marta Svartman ◽

Gustavo C.S. Kuhn

Keyword(s):

Transposable Elements ◽

Tandem Repeat ◽

Tandem Repeats ◽

De Novo ◽

Chromosome Mapping ◽

Drosophila Virilis ◽

Satellite Dnas ◽

Bioinformatic Tools ◽

A Genome ◽

Genome Assemblies

AbstractSatellite DNAs are among the most abundant repetitive DNAs found in eukaryote genomes, where they participate in a variety of biological roles, from being components of important chromosome structures to gene regulation. Experimental methodologies used before the genomic era were not sufficient despite being too laborious and time-consuming to recover the collection of all satDNAs from a genome. Today, the availability of whole sequenced genomes combined with the development of specific bioinformatic tools are expected to foster the identification of virtually all of the “satellitome” from a particular species. While whole genome assemblies are important to obtain a global view of genome organization, most assemblies are incomplete and lack repetitive regions. Here, we applied short-read sequencing and similarity clustering in order to perform a de novo identification of the most abundant satellite families in two Drosophila species from the virilis group: Drosophila virilis and D. americana. These species were chosen because they have been used as a model to understand satDNA biology since early 70’s. We combined computational tandem repeat detection via similarity-based read clustering (implemented in Tandem Repeat Analyzer pipeline – “TAREAN”) with data from the literature and chromosome mapping to obtain an overview of satDNAs in D. virilis and D. americana. The fact that all of the abundant tandem repeats we detected were previously identified in the literature allowed us to evaluate the efficiency of TAREAN in correctly identifying true satDNAs. Our results indicate that raw sequencing reads can be efficiently used to detect satDNAs, but that abundant tandem repeats present in dispersed arrays or associated with transposable elements are frequent false positives. We demonstrate that TAREAN with its parent method RepeatExplorer, may be used as resources to detect tandem repeats associated with transposable elements and also to reveal families of dispersed tandem repeats.

Download Full-text

The de novo genome of the “Spanish” slug Arion vulgaris Moquin-Tandon, 1855 (Gastropoda: Panpulmonata): massive expansion of transposable elements in a major pest species

10.1101/2020.11.30.403303 ◽

2020 ◽

Author(s):

Zeyuan Chen ◽

Özgül Doğan ◽

Nadège Guiglielmoni ◽

Anne Guichard ◽

Michael Schrödl

Keyword(s):

Transposable Elements ◽

Genome Assembly ◽

De Novo ◽

Repetitive Sequences ◽

Land Snails ◽

Whole Genome ◽

Pest Species ◽

High Quality ◽

Genome Duplication Event ◽

Arion Vulgaris

AbstractBackgroundThe “Spanish” slug, Arion vulgaris Moquin-Tandon, 1855, is considered to be among the 100 worst pest species in Europe. It is common and invasive to at least northern and eastern parts of Europe, probably benefitting from climate change and the modern human lifestyle. The origin and expansion of this species, the mechanisms behind its outstanding adaptive success and ability to outcompete other land slugs are worth to be explored on a genomic level. However, a high-quality chromosome-level genome is still lacking.FindingsThe final assembly of A. vulgaris was obtained by combining short reads, linked reads, Nanopore long reads, and Hi-C data. The genome assembly size is 1.54 Gb with a contig N50 length of 8.6 Mb. We found a recent expansion of transposable elements (TEs) which results in repetitive sequences accounting for more than 75% of the A. vulgaris genome, which is the highest among all known gastropod species. We identified 32,518 protein coding genes, and 2,763 species specific genes were functionally enriched in response to stimuli, nervous system and reproduction. With 1,237 single-copy orthologs from A. vulgaris and other related mollusks with whole-genome data available, we reconstructed the phylogenetic relationships of gastropods and estimated the divergence time of stylommatophoran land snails (Achatina) and Arion slugs at around 126 million years ago, and confirmed the whole genome duplication event shared by them.ConclusionsTo our knowledge, the A. vulgaris genome is the first land slug genome assembly published to date. The high-quality genomic data will provide valuable genetic resources for further phylogeographic studies of A. vulgaris origin and expansion, invasiveness, as well as molluscan aquatic-land transition and shell formation.

Download Full-text

Mosquito genomes are frequently invaded by transposable elements through horizontal transfer

PLoS Genetics ◽

10.1371/journal.pgen.1008946 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008946

Author(s):

Elverson Soares de Melo ◽

Gabriel Luz Wallau

Keyword(s):

Transposable Elements ◽

Horizontal Transfer ◽

De Novo ◽

Mosquito Species ◽

Wuchereria Bancrofti ◽

Model Organisms ◽

Eukaryotic Species ◽

Horizontal Spread ◽

Horizontal Transfers

Transposable elements (TEs) are mobile genetic elements that parasitize basically all eukaryotic species genomes. Due to their complexity, an in-depth TE characterization is only available for a handful of model organisms. In the present study, we performed a de novo and homology-based characterization of TEs in the genomes of 24 mosquito species and investigated their mode of inheritance. More than 40% of the genome of Aedes aegypti, Aedes albopictus, and Culex quinquefasciatus is composed of TEs, while it varied substantially among Anopheles species (0.13%–19.55%). Class I TEs are the most abundant among mosquitoes and at least 24 TE superfamilies were found. Interestingly, TEs have been extensively exchanged by horizontal transfer (172 TE families of 16 different superfamilies) among mosquitoes in the last 30 million years. Horizontally transferred TEs represents around 7% of the genome in Aedes species and a small fraction in Anopheles genomes. Most of these horizontally transferred TEs are from the three ubiquitous LTR superfamilies: Gypsy, Bel-Pao and Copia. Searching more than 32,000 genomes, we also uncovered transfers between mosquitoes and two different Phyla—Cnidaria and Nematoda—and two subphyla—Chelicerata and Crustacea, identifying a vector, the worm Wuchereria bancrofti, that enabled the horizontal spread of a Tc1-mariner element among various Anopheles species. These data also allowed us to reconstruct the horizontal transfer network of this TE involving more than 40 species. In summary, our results suggest that TEs are frequently exchanged by horizontal transfers among mosquitoes, influencing mosquito's genome size and variability.

Download Full-text

Transposable elements employ distinct integration strategies with respect to transcriptional landscapes in eukaryotic genomes

Nucleic Acids Research ◽

10.1093/nar/gkaa370 ◽

2020 ◽

Vol 48 (12) ◽

pp. 6685-6698 ◽

Cited By ~ 2

Author(s):

Xinyan Zhang ◽

Meixia Zhao ◽

Donald R McCarty ◽

Damon Lisch

Keyword(s):

Transposable Elements ◽

Rna Polymerase Ii ◽

De Novo ◽

Open Chromatin ◽

Passive Targeting ◽

Site Distribution ◽

Highly Expressed Genes ◽

Negative Impacts ◽

Integration Strategies ◽

Eukaryotic Genomes

Abstract Transposable elements (TEs) are ubiquitous DNA segments capable of moving from one site to another within host genomes. The extant distributions of TEs in eukaryotic genomes have been shaped by both bona fide TE integration preferences in eukaryotic genomes and by selection following integration. Here, we compare TE target site distribution in host genomes using multiple de novo transposon insertion datasets in both plants and animals and compare them in the context of genome-wide transcriptional landscapes. We showcase two distinct types of transcription-associated TE targeting strategies that suggest a process of convergent evolution among eukaryotic TE families. The integration of two precision-targeting elements are specifically associated with initiation of RNA Polymerase II transcription of highly expressed genes, suggesting the existence of novel mechanisms of precision TE targeting in addition to passive targeting of open chromatin. We also highlight two features that can facilitate TE survival and rapid proliferation: tissue-specific transposition and minimization of negative impacts on nearby gene function due to precision targeting.

Download Full-text

Evidence for De Novo rearrangements of Drosophila transposable elements induced by the passage to the cell culture

Genetica ◽

10.1007/bf00120994 ◽

1992 ◽

Vol 87 (2) ◽

pp. 65-73 ◽

Cited By ~ 15

Author(s):

C. Di Franco ◽

C. Pisano ◽

F. Fourcade-Peronnet ◽

G. Echalier ◽

N. Junakovic

Keyword(s):

Cell Culture ◽

Transposable Elements ◽

De Novo

Download Full-text

Corrections to “De Novo Annotation of Transposable Elements: Tackling the Fat Genome Issue” [Jamilloux et al., Proc. IEEE, vol. 105, no. 3, pp. 474–481, Mar. 2017, DOI: 10.1109/JPROC.2016.2590833]

Proceedings of the IEEE ◽

10.1109/jproc.2017.2680218 ◽

2017 ◽

Vol 105 (5) ◽

pp. 978-978 ◽

Cited By ~ 1

Author(s):

Veronique Jamilloux ◽

Josquin Daron ◽

Frederic Choulet ◽

Hadi Quesneville

Keyword(s):

Transposable Elements ◽

De Novo

Download Full-text