large genomes
Recently Published Documents


TOTAL DOCUMENTS

120
(FIVE YEARS 8)

H-INDEX

30
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Thaissa Boldieri de Souza ◽  
Leticia Maria Parteka ◽  
Rafael de Assis ◽  
Andre Luis Laforga Vanzela

Abstract Cestrum species present large genomes (~24 pg), a high occurrence of B chromosomes, and great diversity in heterochromatin bands. Despite this, there is maintenance of chromosome shape and karyotype symmetry. To deepen our knowledge on Cestrum genome composition, low coverage sequencing data of C. strigilatum and C. elegans were compared. Bioinformatics analyses showed retrotransposons comprising more than 70% of the repetitive fraction, followed by transposons (~18%). The four satDNA families that accumulated the most in the datasets were used as probes in FISH assays, and showed different distribution profiles along chromosomes. Most hybridization signals were located in the C-CMA/DAPI banding sites, including those related to AT-rich Cold-Sensitive Regions (CSRs) and heterochromatin. Although satellite probes hybridized in all tested species, a satDNA family named CsSat49 was highlighted as it predominates in centromeric regions. Data suggest that the satDNA fraction is still conserved in the genus, although there is variation in the number of FISH signals between karyotypes, as well as in the B chromosomes. This study brings an important advance in the knowledge on genome organization and heterochromatin composition in Cestrum, especially on the distribution and differentiation mechanisms of satellite fraction between species of a genus of Solanaceae with large genomes.


2021 ◽  
Author(s):  
Eerik Aunin ◽  
Matthew Berriman ◽  
Adam James Reid

AbstractGenome architecture describes how genes and other features are arranged in genomes. These arrangements reflect the evolutionary pressures on genomes and underlie biological processes such as chromosomal segregation and the regulation of gene expression. We present a new tool called Genome Decomposition Analysis (GDA) that characterises genome architectures and acts as an accessible approach for discovering hidden features of a genome assembly. With the imminent deluge of high quality genome assemblies from projects such as the Darwin Tree of Life and the Earth BioGenome Project, GDA has been designed to facilitate their exploration and the discovery of novel genome biology. We highlight the effectiveness of our approach in characterising the genome architectures of single-celled eukaryotic parasites from the phylum Apicomplexa and show that it scales well to large genomes.SignificanceGenome sequencing has revealed that there are functionally important arrangements of genes, repetitive elements and regulatory sequences within chromosomes. Identifying these arrangements requires extensive computation and analysis. Furthermore, improvements in genome sequencing technology and the establishment of consortia aiming to sequence all species of eukaryotes mean that there is a need for high throughput methods for discovering new genome biology. Here we present a software pipeline, named GDA, which determines the patterns of genomic features across chromosomes and uses these to characterise genome architecture. We show that it recapitulates the known genome architecture of several Apicomplexan parasites and use it to identify features in a recently sequenced, less well-characterised genome. GDA scales well to large genomes and is freely available.


2021 ◽  
Author(s):  
Dieke Boezen ◽  
Ghulam Ali ◽  
Manli Wang ◽  
Xi Wang ◽  
Wopke van der Werf ◽  
...  

AbstractMutation rates are of key importance for understanding evolutionary processes and predicting their outcomes. Empirical estimates of mutation rate are available for a number of RNA viruses, but few are available for DNA viruses, which tend to have larger genomes. Whilst some viruses have very high mutation rates, lower mutation rates are expected for viruses with large genomes to ensure genome integrity. Alphabaculoviruses are insect viruses with large genomes and often have high levels of polymorphism, suggesting high mutation rates despite evidence of proofreading activity by the replication machinery. Here, we report an empirical estimate of the mutation rate per base per strand copying (s/n/r) of Autographa californica multiple nucleopolyhedrovirus (AcMNPV). To avoid biases due to selection, we analyzed mutations that occurred in a stable, non-functional genomic insert after five serial passages in Spodoptera exigua larvae. Population bottlenecks, viral mode of replication and thresholds for mutation detection likely affect mutation rate estimates, and we therefore used population genetic models that account for these processes to infer the mutation rate. We estimated a mutation rate of 1×10−7 s/n/r. This estimate was not sensitive to different model assumptions or including whole genome data. The rates at which different classes of mutations accumulate provide good evidence for neutrality of mutations occurring within the inserted region. We therefore present a robust approach for mutation rate estimation for viruses with stable genomes, and strong evidence of a much lower alphabaculovirus mutation rate than supposed based on the high levels of polymorphism observed.Author SummaryVirus populations can evolve rapidly, driven by the large number of mutations that occur during virus replication. It is challenging to measure mutation rates because selection will affect which mutations are observed: beneficial mutations are overrepresented in virus populations, while deleterious mutations are selected against and therefore underrepresented. Few mutation rates have been estimated for viruses with large DNA genomes, and there are no estimates for any insect virus. Here, we estimate the mutation rate for an alphabaculovirus, a virus that infects caterpillars and has a large, 134 kilobase pair DNA genome. To ensure that selection did not bias our estimate of mutation rate, we studied which mutations occurred in a large artificial region inserted into the virus genome, where mutations did not affect viral fitness. We deep sequenced evolved virus populations, and compared the distribution of observed mutants to predictions from a simulation model to estimate mutation rate. We found evidence for a relatively low mutation rate, of one mutation in every 10 million bases replicated. This estimate is in line with expectations for a virus with self-correcting replication machinery and a large genome.


2021 ◽  
Author(s):  
Slawomir Michniewski ◽  
Branko Rihtman ◽  
Ryan Cook ◽  
Michael Jones ◽  
William Wilson ◽  
...  

Megaphages - bacteriophages harbouring extremely large genomes - have recently been found to be ubiquitous, being described from a variety of microbiomes ranging from the animal gut to soil and freshwater systems. However, no complete marine megaphage has been identified to date. Here, using both short and long read sequencing, we assembled >900 high-quality draft viral genomes from water in the English Channel. One of these genomes included a novel megaphage, Mar_Mega_1 at >650 Kb, making it one of the largest phage genomes assembled to date. Utilising phylogenetic and network approaches, we found this phage represents a new family of bacteriophages. Genomic analysis showed Mar_Mega_1 shares relatively few homologues with its closest relatives, but, as with other mega-phages Mar_Mega_1 contained a variety of auxiliary metabolic genes responsible for carbon metabolism and nucleotide biosynthesis, including isocitrate dehydrogenase [NADP] and nicotinamide-nucleotide amidohydrolase [PncC] which have not previously been identified in megaphages. The results of this study indicate that phages containing extremely large genomes can be found in abundance in the marine environment and augment host metabolism by mechanisms not previously described.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Thomas Gatter ◽  
Sarah von Löhneysen ◽  
Jörg Fallmann ◽  
Polina Drozdova ◽  
Tom Hartmann ◽  
...  

Abstract Background Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. Results is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of , entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Conclusions is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. Availability The prototype is available at https://github.com/TGatter/LazyB.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Lijuan Hou ◽  
Jin Xie ◽  
Yaoyao Wu ◽  
Jiaojiao Wang ◽  
Anqi Duan ◽  
...  

Abstract Background Only 1.5% of the human genome encodes proteins, while large part of the remaining encodes noncoding RNAs (ncRNA). Many ncRNAs form structures and perform many important functions. Accurately identifying structured ncRNAs in the human genome and discovering their biological functions remain a major challenge. Results Here, we have established a pipeline (CM-line) with the following features for analyzing the large genomes of humans and other animals. First, we selected species with larger genetic distances to facilitate the discovery of covariations and compatible mutations. Second, we used CMfinder, which can generate useful alignments even with low sequence conservation. Third, we removed repetitive sequences and known structured ncRNAs to reduce the workload of CMfinder. Fourth, we used Infernal to find more representatives and refine the structure. We reported 11 classes of structured ncRNA candidates with significant covariations in humans. Functional analysis showed that these ncRNAs may have variable functions. Some may regulate circadian clock genes through poly (A) signals (PAS); some may regulate the elongation factor (EEF1A) and the T-cell receptor signaling pathway by cooperating with RNA binding proteins. Conclusions By searching for important features of RNA structure from large genomes, the CM-line has revealed the existence of a variety of novel structured ncRNAs. Functional analysis suggests that some newly discovered ncRNA motifs may have biological functions. The pipeline we have established for the discovery of structured ncRNAs and the identification of their functions can also be applied to analyze other large genomes.


2021 ◽  
Author(s):  
Mallory J Choudoir ◽  
Marko J Järvenpää ◽  
Pekka Marttinen ◽  
Daniel H Buckley

AbstractThe evolution of microbial genome size is driven by gene acquisition and loss events that occur at scales from individual genomes to entire pangenomes. The equilibrium between gene gain and loss is shaped by evolutionary forces, including selection and drift, which are in turn influenced by population demographics. There is a well-known bias towards deletion in microbial genomes, which promotes genome streamlining. Less well described are mechanisms that promote genome expansion, giving rise to the many microbes, such as Streptomyces, that have unusually large genomes. We find evidence of genome expansion in Streptomyces sister-taxa, and we hypothesize that a recent demographic range expansion drove increases in genome size through a non-adaptive mechanism. These Streptomyces sister-taxa, NDR (northern-derived) and SDR (southern-derived), represent recently diverged lineages that occupy distinct geographic ranges. Relative to SDR genomes, NDR genomes are larger, have more genes, and their genomes are enriched in intermediate frequency genes. We also find evidence of relaxed selection in NDR genomes relative to SDR genomes. We hypothesize that geographic range expansion, coupled with relaxed selection, facilitated the introgression of non-adaptive horizontally acquired genes, which accumulated at intermediate frequencies through a mechanism known as genome surfing. We show that similar patterns of pangenome structure and genome expansion occur in a simulation that models the effects of population expansion on genome dynamics. We show that non-adaptive evolutionary phenomena can explain expansion of microbial genome size, and suggest that this mechanism might explain why some bacteria with large genomes can be found in soil.


2021 ◽  
Vol 70 (1) ◽  
pp. 156-169
Author(s):  
Deepak Ohri

Abstract Gymnosperms show a significantly higher mean (1C=18.16, 1Cx=16.80) and a narrow range (16.89-fold) of genome sizes as compared with angiosperms. Among the 12 families the largest ranges of 1C values is shown by Ephedraceae (4.73-fold) and Cupressaceae (4.45-fold) which are partly due to polyploidy as 1Cx values vary 2.41 and 1.37-fold respectively. In rest of the families which have only diploid taxa the range of 1C values is from 1.18-fold (Cycadaeae) to 4.36-fold (Podocarpaceae). The question is how gymnosperms acquired such big genome sizes despite the rarity of recent instances of polyploidy. A general survey of different families and genera shows that gymnosperms have experienced both increase and decrease in their genome size during evolution. Various genomic components which have accounted for these large genomes have been discussed. The major contributors are the transposable elements particularly LTR-retrotransposons comprising of Ty3gypsy, Ty1copia and gymny superfamilies which are most widespread. The genomes of gymnosperms have been acquiring diverse LTR-RTs in their long evolution in the absence of any efficient mechanism of their elimination. The epigenetic machinery which silences these large tracts of repeat sequences into the stretches of heterochromatin and the adaptive value of these silenced repeat sequences need further investigation.


2020 ◽  
Author(s):  
Anton Bankevich ◽  
Andrey Bzikadze ◽  
Mikhail Kolmogorov ◽  
Pavel A. Pevzner

AbstractAlthough the de Bruijn graphs represent the basis of many genome assemblers, it remains unclear how to construct these graphs for large genomes and large k-mer sizes. This algorithmic challenge has become particularly important with the emergence of long and accurate high-fidelity (HiFi) reads that were recently utilized to generate a semi-manual telomere-to-telomere assembly of the human genome using the alternative string graph assembly approach. To enable fully automated high-quality HiFi assemblies of various genomes, we developed an efficient jumboDB algorithm for constructing the de Bruijn graph for large genomes and large k-mer sizes and the LJA genome assembler that error-corrects HiFi reads and uses jumboDB to construct the de Bruijn graph on the error-corrected reads. Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a multiplex de Bruijn graph with varying k-mer sizes. We demonstrate that LJA produces contiguous assemblies of complex repetitive regions in genomes including automated assemblies of various highly-repetitive human centromeres.


2020 ◽  
Author(s):  
Lijuan Hou ◽  
Jin Xie ◽  
Yaoyao Wu ◽  
Jiaojiao Wang ◽  
Anqi Duan ◽  
...  

Abstract Background Only 1.5% of the human genome encodes proteins, while most of the remaining encodes noncoding RNAs (ncRNA). Many ncRNAs form structures and perform many important functions. Accurately identifying structured ncRNAs in the human genome and discovering their biological functions remain a major challenge. Results Here, we have established a pipeline (CM-line) with the following features for analyzing the large genomes of humans and other animals. First, we selected species with larger genetic distances to facilitate the discovery of covariations and compatible mutations. Second, we used CMfinder, which can generate useful alignments even with low sequence conservation. Third, we removed repetitive sequences and known structured ncRNAs to reduce the workload of CMfinder. Fourth, we used Infernal to find more representatives and refine the structure. We reported 11 classes of structured ncRNA candidates with significant covariations in humans. Functional analysis showed that these ncRNAs have variable functions. Some may regulate circadian clock genes through poly (A) signals (PAS); some may regulate the elongation factor (EEF1A) and the T-cell receptor signaling pathway by cooperating with RNA binding proteins. Conclusions By searching for important features of RNA structure from large genomes, the CM-line has revealed the existence of a variety of novel structured ncRNAs. Functional analysis provides evidence for the potential biological functions of some newly found ncRNA motifs. The pipeline we have established for the discovery of structured ncRNAs and the identification of their functions can also be applied to analyze other large genomes.


Sign in / Sign up

Export Citation Format

Share Document