scholarly journals IsoPlotter+: A Tool for Studying the Compositional Architecture of Genomes

2013 ◽  
Vol 2013 ◽  
pp. 1-6 ◽  
Author(s):  
Eran Elhaik ◽  
Dan Graur

Eukaryotic genomes, particularly animal genomes, have a complex, nonuniform, and nonrandom internal compositional organization. The compositional organization of animal genomes can be described as a mosaic of discrete genomic regions, called “compositional domains,” each with a distinct GC content that significantly differs from those of its upstream and downstream neighboring domains. A typical animal genome consists of a mixture of compositionally homogeneous and nonhomogeneous domains of varying lengths and nucleotide compositions that are interspersed with one another. We have devised IsoPlotter, an unbiased segmentation algorithm for inferring the compositional organization of genomes. IsoPlotter has become an indispensable tool for describing genomic composition and has been used in the analysis of more than a dozen genomes. Applications include describing new genomes, correlating domain composition with gene composition and their density, studying the evolution of genomes, testing phylogenomic hypotheses, and detect regions of potential interbreeding between human and extinct hominines. To extend the use of IsoPlotter, we designed a completely automated pipeline, called IsoPlotter+ to carry out all segmentation analyses, including graphical display, and built a repository for compositional domain maps of all fully sequenced vertebrate and invertebrate genomes. The IsoPlotter+ pipeline and repository offer a comprehensive solution to the study of genome compositional architecture. Here, we demonstrate IsoPlotter+ by applying it to human and insect genomes. The computational tools and data repository are available online.

Genes ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 9
Author(s):  
Mikhail Biryukov ◽  
Kirill Ustyantsev

Retrotransposons comprise a substantial fraction of eukaryotic genomes, reaching the highest proportions in plants. Therefore, identification and annotation of retrotransposons is an important task in studying the regulation and evolution of plant genomes. The majority of computational tools for mining transposable elements (TEs) are designed for subsequent genome repeat masking, often leaving aside the element lineage classification and its protein domain composition. Additionally, studies focused on the diversity and evolution of a particular group of retrotransposons often require substantial customization efforts from researchers to adapt existing software to their needs. Here, we developed a computational pipeline to mine sequences of protein-coding retrotransposons based on the sequences of their conserved protein domains—DARTS (Domain-Associated Retrotransposon Search). Using the most abundant group of TEs in plants—long terminal repeat (LTR) retrotransposons (LTR-RTs)—we show that DARTS has radically higher sensitivity for LTR-RT identification compared to the widely accepted tool LTRharvest. DARTS can be easily customized for specific user needs. As a result, DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses. DARTS may facilitate researchers interested in the discovery and detailed analysis of the diversity and evolution of retrotransposons, LTR-RTs, and other protein-coding TEs.


2021 ◽  
Author(s):  
Mikhail Biryukov ◽  
Kirill Ustyantsev

AbstractRetrotransposons comprise a substantial fraction of eukaryotic genomes reaching the highest proportions in plants. Therefore, identification and annotation of retrotransposons is an important task in studying regulation and evolution of plant genomes. A majority of computational tools for mining transposable elements (TEs) are designed for subsequent genome repeat masking, often leaving aside the element lineage classification and its protein domain composition. Additionally, studies focused on diversity and evolution of a particular group of retrotransposons often require substantial customization efforts from researchers to adapt existing software to their needs. Here, we developed a computational pipeline to mine sequences of protein-coding retrotransposons based on the sequences of their conserved protein domains - DARTS. Using the most abundant group of TEs in plants - long terminal repeat (LTR) retrotransposons (LTR-RTs), we show that DARTS has radically higher sensitivity of LTR-RTs identification compared to a widely accepted LTRharvest tool. DARTS can be easily customized for specific user needs. As a result, DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses. DARTS should facilitate researchers interested in discovery and in-detail analysis of diversity and evolution of retrotransposons, LTR-RTs, and other protein-coding TEs.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Alexandre Perochon ◽  
Harriet R. Benbow ◽  
Katarzyna Ślęczka-Brady ◽  
Keshav B. Malla ◽  
Fiona M. Doohan

AbstractThere is increasing evidence that some functionally related, co-expressed genes cluster within eukaryotic genomes. We present a novel pipeline that delineates such eukaryotic gene clusters. Using this tool for bread wheat, we uncovered 44 clusters of genes that are responsive to the fungal pathogen Fusarium graminearum. As expected, these Fusarium-responsive gene clusters (FRGCs) included metabolic gene clusters, many of which are associated with disease resistance, but hitherto not described for wheat. However, the majority of the FRGCs are non-metabolic, many of which contain clusters of paralogues, including those implicated in plant disease responses, such as glutathione transferases, MAP kinases, and germin-like proteins. 20 of the FRGCs encode nonhomologous, non-metabolic genes (including defence-related genes). One of these clusters includes the characterised Fusarium resistance orphan gene, TaFROG. Eight of the FRGCs map within 6 FHB resistance loci. One small QTL on chromosome 7D (4.7 Mb) encodes eight Fusarium-responsive genes, five of which are within a FRGC. This study provides a new tool to identify genomic regions enriched in genes responsive to specific traits of interest and applied herein it highlighted gene families, genetic loci and biological pathways of importance in the response of wheat to disease.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Shujun Ou ◽  
Weija Su ◽  
Yi Liao ◽  
Kapeel Chougule ◽  
Jireh R. A. Agda ◽  
...  

Abstract Background Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. Results We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. Conclusions The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.


2015 ◽  
Author(s):  
Rob W Ness ◽  
Andrew D Morgan ◽  
Radhakrishnan B Vasanthakrishnan ◽  
Nick Colegrave ◽  
Peter D Keightley

Describing the process of spontaneous mutation is fundamental for understanding the genetic basis of disease, the threat posed by declining population size in conservation biology, and in much evolutionary biology. However, directly studying spontaneous mutation is difficult because of the rarity of de novo mutations. Mutation accumulation (MA) experiments overcome this by allowing mutations to build up over many generations in the near absence of natural selection. In this study, we sequenced the genomes of 85 MA lines derived from six genetically diverse wild strains of the green algaChlamydomonas reinhardtii. We identified 6,843 spontaneous mutations, more than any other study of spontaneous mutation. We observed seven-fold variation in the mutation rate among strains and that mutator genotypes arose, increasing the mutation rate dramatically in some replicates. We also found evidence for fine-scale heterogeneity in the mutation rate, driven largely by the sequence flanking mutated sites, and by clusters of multiple mutations at closely linked sites. There was little evidence, however, for mutation rate heterogeneity between chromosomes or over large genomic regions of 200Kbp. Using logistic regression, we generated a predictive model of the mutability of sites based on their genomic properties, including local GC content, gene expression level and local sequence context. Our model accurately predicted the average mutation rate and natural levels of genetic diversity of sites across the genome. Notably, trinucleotides vary 17-fold in rate between the most mutable and least mutable sites. Our results uncover a rich heterogeneity in the process of spontaneous mutation both among individuals and across the genome.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Linda Beauclair ◽  
Christelle Ramé ◽  
Peter Arensburger ◽  
Benoît Piégu ◽  
Florian Guillou ◽  
...  

Abstract Background More and more eukaryotic genomes are sequenced and assembled, most of them presented as a complete model in which missing chromosomal regions are filled by Ns and where a few chromosomes may be lacking. Avian genomes often contain sequences with high GC content, which has been hypothesized to be at the origin of many missing sequences in these genomes. We investigated features of these missing sequences to discover why some may not have been integrated into genomic libraries and/or sequenced. Results The sequences of five red jungle fowl cDNA models with high GC content were used as queries to search publicly available datasets of Illumina and Pacbio sequencing reads. These were used to reconstruct the leptin, TNFα, MRPL52, PCP2 and PET100 genes, all of which are absent from the red jungle fowl genome model. These gene sequences displayed elevated GC contents, had intron sizes that were sometimes larger than non-avian orthologues, and had non-coding regions that contained numerous tandem and inverted repeat sequences with motifs able to assemble into stable G-quadruplexes and intrastrand dyadic structures. Our results suggest that Illumina technology was unable to sequence the non-coding regions of these genes. On the other hand, PacBio technology was able to sequence these regions, but with dramatically lower efficiency than would typically be expected. Conclusions High GC content was not the principal reason why numerous GC-rich regions of avian genomes are missing from genome assembly models. Instead, it is the presence of tandem repeats containing motifs capable of assembling into very stable secondary structures that is likely responsible.


2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Alessandra Borgognone ◽  
Walter Sanseverino ◽  
Riccardo Aiese Cigliano ◽  
Raúl Castanera

Long noncoding RNAs have been thoroughly studied in plants, animals, and yeasts, where they play important roles as regulators of transcription. Nevertheless, almost nothing is known about their presence and characteristics in filamentous fungi, especially in basidiomycetes. In the present study, we have carried out an exhaustive annotation and characterization of lncRNAs in two lignin degrader basidiomycetes, Coniophora puteana and Serpula lacrymans. We identified 2,712 putative lncRNAs in the former and 2,242 in the latter, mainly originating from intergenic locations of transposon-sparse genomic regions. The lncRNA length, GC content, expression levels, and stability of the secondary structure differ from coding transcripts but are similar in these two species and resemble that of other eukaryotes. Nevertheless, they lack sequence conservation. Also, we found that lncRNAs are transcriptionally regulated in the same proportion as genes when the fungus actively decomposes soil organic matter. Finally, up to 7% of the upstream gene regions of Coniophora puteana and Serpula lacrymans are transcribed and produce lncRNAs. The study of expression trends in these gene-lncRNA pairs uncovered groups with similar and opposite transcriptional profiles which may be the result of cis-transcriptional regulation.


2020 ◽  
Vol 37 (8) ◽  
pp. 2197-2210 ◽  
Author(s):  
Rodrigo Pracana ◽  
Adam D Hargreaves ◽  
John F Mulley ◽  
Peter W H Holland

Abstract Recombination increases the local GC-content in genomic regions through GC-biased gene conversion (gBGC). The recent discovery of a large genomic region with extreme GC-content in the fat sand rat Psammomys obesus provides a model to study the effects of gBGC on chromosome evolution. Here, we compare the GC-content and GC-to-AT substitution patterns across protein-coding genes of four gerbil species and two murine rodents (mouse and rat). We find that the known high-GC region is present in all the gerbils, and is characterized by high substitution rates for all mutational categories (AT-to-GC, GC-to-AT, and GC-conservative) both at synonymous and nonsynonymous sites. A higher AT-to-GC than GC-to-AT rate is consistent with the high GC-content. Additionally, we find more than 300 genes outside the known region with outlying values of AT-to-GC synonymous substitution rates in gerbils. Of these, over 30% are organized into at least 17 large clusters observable at the megabase-scale. The unusual GC-skewed substitution pattern suggests the evolution of genomic regions with very high recombination rates in the gerbil lineage, which can lead to a runaway increase in GC-content. Our results imply that rapid evolution of GC-content is possible in mammals, with gerbil species providing a powerful model to study the mechanisms of gBGC.


2019 ◽  
Author(s):  
Eric J. Foss ◽  
Smitha Sripathy ◽  
Tonibelle Gatbonton-Schwager ◽  
Hyunchang Kwak ◽  
Adam H. Thiesen ◽  
...  

AbstractThe spatio-temporal program of genome replication across eukaryotes is thought to be driven both by the uneven loading of pre-replication complexes (pre-RCs) across the genome at the onset of S-phase, and by differences in the timing of activation of these complexes during S-phase. To determine the degree to which distribution of pre-RC loading alone could account for chromosomal replication patterns, we mapped the binding sites of the Mcm2-7 helicase complex (MCM) in budding yeast, fission yeast, mouse and humans. We observed identical MCM double-hexamer footprints across the species, but notable differences in their distribution: In budding yeast, complexes were present in sharp peaks comprised largely of single double-hexamers; in fission yeast, corresponding peaks typically contained 4 to 8 double-hexamers, were more disperse, and showed a striking correlation with AT content. In mouse and humans, complexes were even more disperse, with a preference for regions of high GC content. Nonetheless, most fluctuations in replication timing in all four organisms could be accounted for by differences in chromosomal MCM distribution. This analysis also identified genomic regions whose replication timing was clearly not attributable to MCM density. The most notable was the inactive X-chromosome, which replicates late in S phase despite the fact that both MCM abundance and chromosomal distribution were comparable to those on the early replicating active X-chromosome. We conclude that, although certain genomic regions, most notably the inactive X-chromosome, are subject to post-licensing regulation, most differences in replication timing along the chromosome reflect uneven chromosomal distribution of stochastically firing pre-replication complexes.


2021 ◽  
Author(s):  
Ruben Chevez-Guardado ◽  
Lourdes Pena-Castillo

Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compared Promotech's performance with the performance of five other promoter prediction methods. Promotech outperformed these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech.


Sign in / Sign up

Export Citation Format

Share Document