Allele Phasing Greatly Improves the Phylogenetic Utility of Ultraconserved Elements

Mapping Intimacies ◽

10.1101/255752 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Alexandre M. Fernandes ◽

Urban Olsson ◽

Mats Töpel ◽

Bernard Pfeil ◽

...

Keyword(s):

Statistical Power ◽

High Throughput Sequencing ◽

De Novo ◽

Allelic Variation ◽

Sequencing Depth ◽

High Rate ◽

Model Organisms ◽

Sequence Capture ◽

Phylogenetic Studies ◽

Biogeographic History

AbstractAdvances in high-throughput sequencing techniques now allow relatively easy and affordable sequencing of large portions of the genome, even for non-model organisms. Many phylogenetic studies reduce costs by focusing their sequencing efforts on a selected set of targeted loci, commonly enriched using sequence capture. The advantage of this approach is that it recovers a consistent set of loci, each with high sequencing depth, which leads to more confidence in the assembly of target sequences. High sequencing depth can also be used to identify phylogenetically informative allelic variation within sequenced individuals, but allele sequences are infrequently assembled in phylogenetic studies.Instead, many scientists perform their phylogenetic analyses using contig sequences which result from the de novo assembly of sequencing reads into contigs containing only canonical nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, and we use simulated and empirical data to demonstrate the utility of integrating these allele sequences to analyses performed under the Multispecies Coalescent (MSC) model. Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the South American hummingbird genus Topaza demonstrate that phased allele sequences carry sufficient phylogenetic information to infer the genetic structure, lineage divergence, and biogeographic history of a genus that diversified during the last three million years. The phylogenetic results support the recognition of two species, and suggest a high rate of gene flow across large distances of rainforest habitats but rare admixture across the Amazon River. Our simulations provide evidence that analyzing allele sequences leads to more accurate estimates of tree topology and divergence times than the more common approach of using contig sequences.

Download Full-text

Assessing Genome-Wide Diversity in European Hantaviruses through Sequence Capture from Natural Host Samples

Viruses ◽

10.3390/v12070749 ◽

2020 ◽

Vol 12 (7) ◽

pp. 749 ◽

Cited By ~ 3

Author(s):

Melanie Hiltbrunner ◽

Gerald Heckel

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Phylogenetic Analyses ◽

Common Vole ◽

Natural Host ◽

Sequence Information ◽

Coding Region ◽

Sequence Capture ◽

Genome Wide ◽

The Impact

Research on the ecology and evolution of viruses is often hampered by the limitation of sequence information to short parts of the genomes or single genomes derived from cultures. In this study, we use hybrid sequence capture enrichment in combination with high-throughput sequencing to provide efficient access to full genomes of European hantaviruses from rodent samples obtained in the field. We applied this methodology to Tula (TULV) and Puumala (PUUV) orthohantaviruses for which analyses from natural host samples are typically restricted to partial sequences of their tri-segmented RNA genome. We assembled a total of ten novel hantavirus genomes de novo with very high coverage (on average >99%) and sequencing depth (average >247×). A comparison with partial Sanger sequences indicated an accuracy of >99.9% for the assemblies. An analysis of two common vole (Microtus arvalis) samples infected with two TULV strains each allowed for the de novo assembly of all four TULV genomes. Combining the novel sequences with all available TULV and PUUV genomes revealed very similar patterns of sequence diversity along the genomes, except for remarkably higher diversity in the non-coding region of the S-segment in PUUV. The genomic distribution of polymorphisms in the coding sequence was similar between the species, but differed between the segments with the highest sequence divergence of 0.274 for the M-segment, 0.265 for the S-segment, and 0.248 for the L-segment (overall 0.258). Phylogenetic analyses showed the clustering of genome sequences consistent with their geographic distribution within each species. Genome-wide data yielded extremely high node support values, despite the impact of strong mutational saturation that is expected for hantavirus sequences obtained over large spatial distances. We conclude that genome sequencing based on capture enrichment protocols provides an efficient means for ecological and evolutionary investigations of hantaviruses at an unprecedented completeness and depth.

Download Full-text

Assembly by Reduced Complexity (ARC): a hybrid approach for targeted assembly of homologous sequences.

10.1101/014662 ◽

2015 ◽

Cited By ~ 17

Author(s):

Samual S Hunter ◽

Robert T Lyon ◽

Brice A.J. Sarver ◽

Kayla Hardwick ◽

Larry J Forney ◽

...

Keyword(s):

De Novo Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

Hybrid Approach ◽

Reference Sequence ◽

Model Organisms ◽

Exome Capture ◽

Mitochondrial Genomes ◽

Homologous Sequences ◽

Reduced Complexity

Analysis of High-throughput sequencing (HTS) data is a difficult problem, especially in the context of non-model organisms where comparison of homologous sequences may be hindered by the lack of a close reference genome. Current mapping-based methods rely on the availability of a highly similar reference sequence, whereas de novo assemblies produce anonymous (unannotated) contigs that are not easily compared across samples. Here, we present Assembly by Reduced Complexity (ARC) a hybrid mapping and assembly approach for targeted assembly of homologous sequences. ARC is an open-source project (http://ibest.github.io/ARC/) implemented in the Python language and consists of the following stages: 1) align sequence reads to reference targets, 2) use alignment results to distribute reads into target specific bins, 3) perform assemblies for each bin (target) to produce contigs, and 4) replace previous reference targets with assembled contigs and iterate. We show that ARC is able to assemble high quality, unbiased mitochondrial genomes seeded from 11 progressively divergent references, and is able to assemble full mitochondrial genomes starting from short, poor quality ancient DNA reads. We also show ARC compares favorably to de novo assembly of a large exome capture dataset for CPU and memory requirements; assembling 7,627 individual targets across 55 samples, completing over 1.3 million assemblies in less than 78 hours, while using under 32 Gb of system memory. ARC breaks the assembly problem down into many smaller problems, solving the anonymous contig and poor scaling inherent in some de novo assembly methods and reference bias inherent in traditional read mapping.

Download Full-text

CircParser: a novel streamlined pipeline for circular RNA structure and host gene prediction in non-model organisms

PeerJ ◽

10.7717/peerj.8757 ◽

2020 ◽

Vol 8 ◽

pp. e8757 ◽

Cited By ~ 2

Author(s):

Artem Nedoluzhko ◽

Fedor Sharko ◽

Md. Golam Rbbani ◽

Anton Teslyuk ◽

Ioannis Konstantinidis ◽

...

Keyword(s):

Stress Responses ◽

High Throughput Sequencing ◽

De Novo ◽

Gene Prediction ◽

Host Gene ◽

Model Organisms ◽

Circular Rnas ◽

Prediction Tools ◽

A Genome ◽

Host Genes

Circular RNAs (circRNAs) are long noncoding RNAs that play a significant role in various biological processes, including embryonic development and stress responses. These regulatory molecules can modulate microRNA activity and are involved in different molecular pathways as indirect regulators of gene expression. Thousands of circRNAs have been described in diverse taxa due to the recent advances in high throughput sequencing technologies, which led to a huge variety of total RNA sequencing being publicly available. A number of circRNA de novo and host gene prediction tools are available to date, but their ability to accurately predict circRNA host genes is limited in the case of low-quality genome assemblies or annotations. Here, we present CircParser, a simple and fast Unix/Linux pipeline that uses the outputs from the most common circular RNAs in silico prediction tools (CIRI, CIRI2, CircExplorer2, find_circ, and circFinder) to annotate circular RNAs, assigning presumptive host genes from local or public databases such as National Center for Biotechnology Information (NCBI). Also, this pipeline can discriminate circular RNAs based on their structural components (exonic, intronic, exon-intronic or intergenic) using a genome annotation file.

Download Full-text

The Oyster River Protocol: A Multi Assembler and Kmer Approach For de novo Transcriptome Assembly

10.1101/177253 ◽

2017 ◽

Cited By ~ 2

Author(s):

Matthew D. MacManes

Keyword(s):

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome ◽

Link Type ◽

Biological Phenomena ◽

Complicated Process ◽

Downstream Analysis

AbstractCharacterizing transcriptomes in non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. The Oyster River Protocol (ORP), described here, implements a standardized and benchmarked set of bioinformatic processes, resulting in an assembly with enhanced qualities over other standard assembly methods. Specifically, ORP produced assemblies have higher Detonate and TransRate scores and mapping rates, which is largely a product of the fact that it leverages a multi-assembler and kmer assembly process, thereby bypassing the shortcomings of any one approach. These improvements are important, as previously unassembled transcripts are included in ORP assemblies, resulting in a significant enhancement of the power of downstream analysis. Further, as part of this study, I show that assembly quality is unrelated with the number of reads generated, above 30 million reads. Code Availability: The version controlled open-source code is available at https://github.com/macmanes-lab/Oyster_River_Protocol. Instructions for software installation and use, and other details are available at http://oyster-river-protocol.rtfd.org/.

Download Full-text

The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly

PeerJ ◽

10.7717/peerj.5428 ◽

2018 ◽

Vol 6 ◽

pp. e5428 ◽

Cited By ~ 22

Author(s):

Matthew D. MacManes

Keyword(s):

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome ◽

Biological Phenomena ◽

Complicated Process ◽

Downstream Analysis

Characterizing transcriptomes in non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary, and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. The Oyster River Protocol (ORP), described here, implements a standardized and benchmarked set of bioinformatic processes, resulting in an assembly with enhanced qualities over other standard assembly methods. Specifically, ORP produced assemblies have higher Detonate and TransRate scores and mapping rates, which is largely a product of the fact that it leverages a multi-assembler and kmer assembly process, thereby bypassing the shortcomings of any one approach. These improvements are important, as previously unassembled transcripts are included in ORP assemblies, resulting in a significant enhancement of the power of downstream analysis. Further, as part of this study, I show that assembly quality is unrelated with the number of reads generated, above 30 million reads. Code Availability: The version controlled open-source code is available at https://github.com/macmanes-lab/Oyster_River_Protocol. Instructions for software installation and use, and other details are available at http://oyster-river-protocol.rtfd.org/.

Download Full-text

Target-sequence Capture and High Throughput Sequencing Identify a De novo CARD14 Mutation in an Infant with Erythrodermic Pityriasis Rubra Pilaris

Acta Dermato Venereologica ◽

10.2340/00015555-2446 ◽

2016 ◽

Vol 96 (7) ◽

pp. 989-990 ◽

Cited By ~ 9

Author(s):

C Has ◽

A Schwieger-Briel ◽

N Schlipf ◽

I Hausser ◽

N Chmel ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Target Sequence ◽

Sequence Capture ◽

Pityriasis Rubra Pilaris

Download Full-text

Establishing evidenced-based best practice for the de novo assembly and evaluation of transcriptomes from non-model organisms

10.1101/035642 ◽

2015 ◽

Cited By ~ 25

Author(s):

Matthew D MacManes

Keyword(s):

Best Practice ◽

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

Single Individual ◽

Biological Phenomena ◽

Or Gene ◽

Evidenced Based

Characterizing transcriptomes in both model and non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. Each step may be accomplished in one of several different ways, using different software packages, each producing different results. This analytical complexity begs the question -- Which method(s) are optimal? Using reference and non-reference based evaluative methods, I propose a set of guidelines that aim to standardize and facilitate the process of transcriptome assembly. These recommendations include the generation of between 20 million and 40 million sequencing reads from single individual where possible, error correction of reads, gentle quality trimming, assembly filtering using Transrate and/or gene expression, annotation using dammit, and appropriate reporting. These recommendations have been extensively benchmarked and applied to publicly available transcriptomes, resulting in improvements in both content and contiguity. To facilitate the implementation of the proposed standardized methods, I have released a set of version controlled open-sourced code, The Oyster River Protocol for Transcriptome Assembly, available at http://oyster-river-protocol.rtfd.org/.

Download Full-text

HP1 drives de novo 3D genome reorganization in early Drosophila embryos

Nature ◽

10.1038/s41586-021-03460-z ◽

2021 ◽

Author(s):

Fides Zenk ◽

Yinxiu Zhan ◽

Pavel Kos ◽

Eva Löser ◽

Nazerke Atinbayeva ◽

...

Keyword(s):

Genome Organization ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

De Novo ◽

Early Embryo ◽

Heterochromatin Protein ◽

Chromosome Conformation ◽

3D Genome ◽

Genome Reorganization

AbstractFundamental features of 3D genome organization are established de novo in the early embryo, including clustering of pericentromeric regions, the folding of chromosome arms and the segregation of chromosomes into active (A-) and inactive (B-) compartments. However, the molecular mechanisms that drive de novo organization remain unknown1,2. Here, by combining chromosome conformation capture (Hi-C), chromatin immunoprecipitation with high-throughput sequencing (ChIP–seq), 3D DNA fluorescence in situ hybridization (3D DNA FISH) and polymer simulations, we show that heterochromatin protein 1a (HP1a) is essential for de novo 3D genome organization during Drosophila early development. The binding of HP1a at pericentromeric heterochromatin is required to establish clustering of pericentromeric regions. Moreover, HP1a binding within chromosome arms is responsible for overall chromosome folding and has an important role in the formation of B-compartment regions. However, depletion of HP1a does not affect the A-compartment, which suggests that a different molecular mechanism segregates active chromosome regions. Our work identifies HP1a as an epigenetic regulator that is involved in establishing the global structure of the genome in the early embryo.

Download Full-text

Removing the bad apples: A simple bioinformatic method to improve loci‐recovery in de novo RADseq data for non‐model organisms

Methods in Ecology and Evolution ◽

10.1111/2041-210x.13562 ◽

2021 ◽

Cited By ~ 1

Author(s):

José Cerca ◽

Marius F. Maurstad ◽

Nicolas C. Rochette ◽

Angel G. Rivera‐Colón ◽

Niraj Rayamajhi ◽

...

Keyword(s):

De Novo ◽

Model Organisms

Download Full-text

In Search of Species-Specific SNPs in a Non-Model Animal (European Bison (Bison bonasus))—Comparison of De Novo and Reference-Based Integrated Pipeline of STACKS Using Genotyping-by-Sequencing (GBS) Data

Animals ◽

10.3390/ani11082226 ◽

2021 ◽

Vol 11 (8) ◽

pp. 2226

Author(s):

Sazia Kunvar ◽

Sylwia Czarnomska ◽

Cino Pertoldi ◽

Małgorzata Tokarska

Keyword(s):

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Model Organism ◽

Genotyping By Sequencing ◽

Model Organisms ◽

European Bison ◽

Model Animal ◽

Pcr Duplicates ◽

Species Specific

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.

Download Full-text