LTR_retriever: a highly accurate and sensitive program for identification of LTR retrotransposons

Mapping Intimacies ◽

10.1101/137141 ◽

2017 ◽

Cited By ~ 4

Author(s):

Shujun Ou ◽

Ning Jiang

Keyword(s):

De Novo ◽

Gene Annotation ◽

Long Terminal Repeat Retrotransposons ◽

Ltr Retrotransposons ◽

High Quality ◽

Genome Coverage ◽

Plant Genomes ◽

False Discovery ◽

Genome Wide ◽

Perl Program

ABSTRACTLong terminal-repeat retrotransposons (LTR-RTs) are prevalent in plant genomes. Identification of LTR-RTs is critical for achieving high-quality gene annotation. Based on the well-conserved structure, multiple programs were developed for de novo identification of LTR-RTs; however, these programs are associated with low specificity and high false discovery rate (FDR). Here we report LTR_retriever, a multithreading empowered Perl program that identifies LTR-RTs and generates high-quality LTR libraries from genomic sequences. LTR_retriever demonstrated significant improvements by achieving high levels of sensitivity (91.8%), specificity (94.7%), accuracy (94.3%), and precision (90.6%) in model plants. LTR_retriever is also compatible with long sequencing reads. With 40k self-corrected PacBio reads equivalent to 4.5X genome coverage in Arabidopsis, the constructed LTR library showed excellent sensitivity and specificity. In addition to canonical LTR-RTs with 5'-TG..CA-3' termini, LTR_retriever also identifies non-canonical LTR-RTs (non-TGCA), which have been largely ignored in genome-wide studies. We identified seven types of non-canonical LTRs from 42 out of 50 plant genomes. The majority of non-canonical LTRs are Copia elements, with which the LTR is four times shorter than that of other Copia elements, which may be a result of their target specificity. Strikingly, non-TGCA Copia elements are often located in genic regions and preferentially insert nearby or within genes, indicating their impact on the evolution of genes and potential as mutagenesis tools.

Download Full-text

Genome-wide characterization of LTR retrotransposons in the non-model deep-sea annelid Lamellibrachia luymesi

BMC Genomics ◽

10.1186/s12864-021-07749-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Oluchi Aroh ◽

Kenneth M. Halanych

Keyword(s):

Deep Sea ◽

Long Terminal Repeat Retrotransposons ◽

Marine Organisms ◽

Model Systems ◽

Ltr Retrotransposons ◽

Genome Wide ◽

Lamellibrachia Luymesi ◽

Vestimentiferan Tubeworm ◽

Retrotransposon Activity

Abstract Background Long Terminal Repeat retrotransposons (LTR retrotransposons) are mobile genetic elements composed of a few genes between terminal repeats and, in some cases, can comprise over half of a genome’s content. Available data on LTR retrotransposons have facilitated comparative studies and provided insight on genome evolution. However, data are biased to model systems and marine organisms, including annelids, have been underrepresented in transposable elements studies. Here, we focus on genome of Lamellibrachia luymesi, a vestimentiferan tubeworm from deep-sea hydrocarbon seeps, to gain knowledge of LTR retrotransposons in a deep-sea annelid. Results We characterized LTR retrotransposons present in the genome of L. luymesi using bioinformatic approaches and found that intact LTR retrotransposons makes up about 0.1% of L. luymesi genome. Previous characterization of the genome has shown that this tubeworm hosts several known LTR-retrotransposons. Here we describe and classify LTR retrotransposons in L. luymesi as within the Gypsy, Copia and Bel-pao superfamilies. Although, many elements fell within already recognized families (e.g., Mag, CSRN1), others formed clades distinct from previously recognized families within these superfamilies. However, approximately 19% (41) of recovered elements could not be classified. Gypsy elements were the most abundant while only 2 Copia and 2 Bel-pao elements were present. In addition, analysis of insertion times indicated that several LTR-retrotransposons were recently transposed into the genome of L. luymesi, these elements had identical LTR’s raising possibility of recent or ongoing retrotransposon activity. Conclusions Our analysis contributes to knowledge on diversity of LTR-retrotransposons in marine settings and also serves as an important step to assist our understanding of the potential role of retroelements in marine organisms. We find that many LTR retrotransposons, which have been inserted in the last few million years, are similar to those found in terrestrial model species. However, several new groups of LTR retrotransposons were discovered suggesting that the representation of LTR retrotransposons may be different in marine settings. Further study would improve understanding of the diversity of retrotransposons across animal groups and environments.

Download Full-text

Genome assembly of the JD17 soybean provides a new reference genome for Comparative genomics

10.1101/2021.11.23.469778 ◽

2021 ◽

Author(s):

Xinxin Yi ◽

Jing Liu ◽

Shengcai Chen ◽

Hao Wu ◽

Min Liu ◽

...

Keyword(s):

Nitrogen Fixation ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Genomic Analysis ◽

Comparative Genomic ◽

High Quality ◽

Genome Wide ◽

A Genome ◽

Cultivated Soybean

Cultivated soybean (Glycine max) is an important source for protein and oil. Many elite cultivars with different traits have been developed for different conditions. Each soybean strain has its own genetic diversity, and the availability of more high-quality soybean genomes can enhance comparative genomic analysis for identifying genetic underpinnings for its unique traits. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with chromsome contiguity and high accuracy. We annotated 52,840 gene models and reconstructed 74,054 high-quality full-length transcripts. We performed a genome-wide comparative analysis based on the reference genome of JD17 with three published soybeans (WM82, ZH13 and W05) , which identified five large inversions and two large translocations specific to JD17, 20,984 - 46,912 PAVs spanning 13.1 - 46.9 Mb in size, and 5 - 53 large PAV clusters larger than 500kb. 1,695,741 - 3,664,629 SNPs and 446,689 - 800,489 Indels were identified and annotated between JD17 and them. Symbiotic nitrogen fixation (SNF) genes were identified and the effects from these variants were further evaluated. It was found that the coding sequences of 9 nitrogen fixation-related genes were greatly affected. The high-quality genome assembly of JD17 can serve as a valuable reference for soybean functional genomics research.

Download Full-text

De Novo Sequencing, Assembly, and Annotation of Four Threespine Stickleback Genomes Based on Microfluidic Partitioned DNA Libraries

Genes ◽

10.3390/genes10060426 ◽

2019 ◽

Vol 10 (6) ◽

pp. 426 ◽

Cited By ~ 3

Author(s):

Daniel Berner ◽

Marius Roesti ◽

Steven Bilobram ◽

Simon K. Chan ◽

Heather Kirk ◽

...

Keyword(s):

Reference Genome ◽

De Novo ◽

Gene Annotation ◽

Model System ◽

Threespine Stickleback ◽

Evolutionary Genomics ◽

Digital Repository ◽

High Quality ◽

De Novo Gene ◽

Dna Libraries

The threespine stickleback is a geographically widespread and ecologically highly diverse fish that has emerged as a powerful model system for evolutionary genomics and developmental biology. Investigations in this species currently rely on a single high-quality reference genome, but would benefit from the availability of additional, independently sequenced and assembled genomes. We present here the assembly of four new stickleback genomes, based on the sequencing of microfluidic partitioned DNA libraries. The base pair lengths of the four genomes reach 92–101% of the standard reference genome length. Together with their de novo gene annotation, these assemblies offer a resource enhancing genomic investigations in stickleback. The genomes and their annotations are available from the Dryad Digital Repository (https://doi.org/10.5061/dryad.113j3h7).

Download Full-text

LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons

Mobile DNA ◽

10.1186/s13100-019-0193-0 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Shujun Ou ◽

Ning Jiang

Keyword(s):

Triticum Aestivum ◽

Long Terminal Repeat ◽

Bread Wheat ◽

Repetitive Sequences ◽

Long Terminal Repeat Retrotransposons ◽

Rapid Identification ◽

Terminal Repeat ◽

Parallel Operation ◽

Ltr Retrotransposons ◽

Plant Genomes

AbstractAnnotation of plant genomes is still a challenging task due to the abundance of repetitive sequences, especially long terminal repeat (LTR) retrotransposons. LTR_FINDER is a widely used program for the identification of LTR retrotransposons but its application on large genomes is hindered by its single-threaded processes. Here we report an accessory program that allows parallel operation of LTR_FINDER, resulting in up to 8500X faster identification of LTR elements. It takes only 72 min to process the 14.5 Gb bread wheat (Triticum aestivum) genome in comparison to 1.16 years required by the original sequential version. LTR_FINDER_parallel is freely available at https://github.com/oushujun/LTR_FINDER_parallel.

Download Full-text

An integrated personal and population-based Egyptian genome reference

10.1101/681254 ◽

2019 ◽

Author(s):

Inken Wohlers ◽

Axel Künstner ◽

Matthias Munz ◽

Michael Olbrich ◽

Anke Fähnrich ◽

...

Keyword(s):

Genetic Variation ◽

Human Genome ◽

De Novo ◽

Population Based ◽

European Ancestry ◽

Personal Genome ◽

High Quality ◽

Data Set ◽

Egyptian Population ◽

Genome Wide

AbstractThe human genome is composed of chromosomal DNA sequences consisting of bases A, C, G and T – the blueprint to implement the molecular functions that are the basis of every individual’s life. Deciphering the first human genome was a consortium effort that took more than a decade and considerable cost. With the latest technological advances, determining an individual’s entire personal genome with manageable cost and effort has come within reach. Although the benefits of the all-encompassing genetic information that entire genomes provide are manifold, only a small number of de novo assembled human genomes have been reported to date 1–3, and few have been complemented with population-based genetic variation 4, which is particularly important for North Africans who are not represented in current genome-wide data sets 5–7. Here, we combine long- and short-read whole-genome next-generation sequencing data with recent assembly approaches into the first de novo assembly of the genome of an Egyptian individual. The resulting assembly demonstrates well-balanced quality metrics and is complemented with high-quality variant phasing via linked reads into haploblocks, which we can associate with gene expression changes in blood. To construct an Egyptian genome reference, we further assayed genome-wide genetic variation occurring in the Egyptian population within a representative cohort of 110 Egyptian individuals. We show that differences in allele frequencies and linkage disequilibrium between Egyptians and Europeans may compromise the transferability of European ancestry-based genetic disease risk and polygenic scores, substantiating the need for multi-ethnic genetic studies and corresponding genome references. The Egyptian genome reference represents a comprehensive population data set based on a high-quality personal genome. It is a proof of concept to be considered by the many national and international genome initiatives underway. More importantly, we anticipate that the Egyptian genome reference will be a valuable resource for precision medicine targeting the Egyptian population and beyond.

Download Full-text

Biodiversity genomics of small metazoans: high quality de novo genomes from single specimens of field-collected and ethanol-preserved springtails

10.1101/2020.08.10.244541 ◽

2020 ◽

Cited By ~ 2

Author(s):

Clément Schneider ◽

Christian Woehle ◽

Carola Greve ◽

Cyrille A. D’Haese ◽

Magnus Wolf ◽

...

Keyword(s):

Genome Sequencing ◽

De Novo ◽

Genetic Model ◽

Nuclear Genome ◽

High Quality ◽

Sequencing Technologies ◽

Natural Product Research ◽

Genome Wide ◽

Long Read ◽

Hmw Dna

ABSTRACTGenome sequencing of all known eukaryotes on Earth promises unprecedented advances in evolutionary sciences, ecology, systematics and in biodiversity-related applied fields such as environmental management and natural product research. Advances in DNA sequencing technologies make genome sequencing feasible for many non-genetic model species. However, genome sequencing today relies on large quantities of high quality, high molecular weight (HMW) DNA which is mostly obtained from fresh tissues. This is problematic for biodiversity genomics of Metazoa as most species are small and yield minute amounts of DNA. Furthermore, briging living specimens to the lab bench not realistic for the majority of species.Here we overcome those difficulties by sequencing two species of springtails (Collembola) from single specimens preserved in ethanol. We used a newly developed, genome-wide amplification-based protocol to generate PacBio libraries for HiFi long-read sequencing.The assembled genomes were highly continuous. They can be considered complete as we recovered over 95% of BUSCOs. Genome-wide amplification does not seem to bias genome recovery. Presence of almost complete copies of the mitochondrial genome in the nuclear genome were pitfalls for automatic assemblers. The genomes fit well into an existing phylogeny of springtails. A neotype is designated for one of the species, blending genome sequencing and creation of taxonomic references.Our study shows that it is possible to obtain high quality genomes from small, field-preserved sub-millimeter metazoans, thus making their vast diversity accessible to the fields of genomics.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality de novo plant genomes using long-read sequencing technology

Genome Biology ◽

10.1186/s13059-021-02475-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Plant Genome ◽

Time Cost ◽

Sequencing Technology ◽

High Quality ◽

Plant Genomes ◽

Long Read ◽

Generate Plant ◽

Sequencing Platforms ◽

Chromosome Level

AbstractCurrently, different sequencing platforms are used to generate plant genomes and no workflow has been properly developed to optimize time, cost, and assembly quality. We present LeafGo, a complete de novo plant genome workflow, that starts from tissue and produces genomes with modest laboratory and bioinformatic resources in approximately 7 days and using one long-read sequencing technology. LeafGo is optimized with ten different plant species, three of which are used to generate high-quality chromosome-level assemblies without any scaffolding technologies. Finally, we report the diploid genomes of Eucalyptus rudis and E. camaldulensis and the allotetraploid genome of Arachis hypogaea.

Download Full-text

Refinement of Draft Genome Assemblies of Pigeonpea (Cajanus cajan)

10.1101/2020.08.10.243949 ◽

2020 ◽

Author(s):

Soma Marla ◽

Pallavi Mishra ◽

Ranjeet Maurya ◽

Mohar Singh ◽

D. P. Wankhede ◽

...

Keyword(s):

Cajanus Cajan ◽

De Novo ◽

Draft Genome ◽

Data Sets ◽

Segmental Duplications ◽

Genome Coverage ◽

Plant Genomes ◽

Original Genome ◽

Specific Trait ◽

Important Quality

AbstractGenome assembly of short reads from large plant genomes remains a challenge in computational biology despite major developments in Next Generation sequencing. Of late multiple draft assemblies of plant genomes are reported in many organisms. The draft assemblies of Cajanus cajan are with different levels of genome completeness; contain large number of repeats, gaps and segmental duplications. Draft assemblies with portions of genome missing, are shorter than the referenced original genome. These assemblies come with low map accuracy affecting further functional annotation and prediction of gene component as desired by crop researchers. Genome coverage i.e. number of sequenced raw reads mapped on to certain locations of the genome is an important quality indicator of completeness and assembly quality in draft assemblies. Present work was aimed at improvement of coverage in reported de novo sequenced draft genomes (GCA_000340665.1 and GCA_000230855.2) of Pigeonpea, a legume widely cultivated in India. The two assemblies comprised 72% and 75% of estimated coverage of genome respectively. We employed assembly reconciliation approach to compare draft assemblies and merged them to generate a high quality near complete assembly with enhanced contiguity. Finished assembly has reduced number of gaps than reported in draft assemblies and improved genome coverage of 82.4%. Quality of the finished assembly was evaluated using various quality metrics and for presence of specific trait related functional genes. Employed pair-end and mate-pair local library data sets enabled to resolve gaps, repeats and other sequence errors yielding lengthier scaffolds compared to two draft assemblies. We report prediction of putative host resistance genes from improved sequence against Fusarium wilt disease and evaluated them in both wet laboratory and field phenotypic conditions.

Download Full-text

De novo assembly of a Tibetan genome and identification of novel structural variants associated with high altitude adaptation

10.1101/753186 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ouzhuluobu ◽

Yaoxi He ◽

Haiyi Lou ◽

Chaoying Cui ◽

Lian Deng ◽

...

Keyword(s):

High Altitude ◽

De Novo ◽

Population Analysis ◽

Extreme Environments ◽

Structural Variants ◽

Base Pairs ◽

High Quality ◽

Genome Wide ◽

Long Read ◽

Altitude Adaptation

AbstractStructural variants (SVs) may play important roles in human adaption to extreme environments such as high altitude but have been under-investigated. Here, combining long-read sequencing with multiple scaffolding techniques, we assembled a high-quality Tibetan genome (ZF1), with a contig N50 length of 24.57 mega-base pairs (Mb) and a scaffold N50 length of 58.80 Mb. The ZF1 assembly filled 80 remaining N-gaps (0.25 Mb in total length) in the reference human genome (GRCh38). Markedly, we detected 17,900 SVs, among which the ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway. Further population analysis uncovered a 163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese. This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans. Moreover, with the use of the high quality de novo assembly, we observed a much higher rate of genome-wide archaic hominid (Altai Neanderthal and Denisovan) shared non-reference sequences in ZF1 (1.32%-1.53%) compared to other East Asian genomes (0.70%-0.98%), reflecting a unique genomic composition of Tibetans. One such archaic-hominid shared sequence, a 662-bp intronic insertion in the SCUBE2 gene, is enriched and associated with better lung function (the FEV1/FVC ratio) in Tibetans. Collectively, we generated the first high-resolution Tibetan reference genome, and the identified SVs may serve as valuable resources for future evolutionary and medical studies.

Download Full-text

LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale

10.1101/448969 ◽

2018 ◽

Author(s):

Joseph D Valencia ◽

Hani Z Girgis

Keyword(s):

Long Terminal Repeat ◽

Large Scale ◽

De Novo ◽

False Positive Rate ◽

Long Terminal Repeat Retrotransposons ◽

Terminal Repeat ◽

Plant Genomes ◽

Positive Rate ◽

Genomic Scale ◽

Species Specific

AbstractLong terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and genomic evolution. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None are concurrent or have features to support manual review of new elements. To overcome these limitations, we developed LtrDetector, which uses signal-processing techniques. LtrDetector is easy to install and use. It is not species specific. It utilizes multi-core processors available in personal computers. It is more sensitive than other tools by 14.4%–50.8% while maintaining a low false positive rate on six plant genomes.

Download Full-text