IterCluster: a barcode clustering algorithm for long fragment read analysis

AbstractBackgroundRNA-Seq analyses can benefit from performing a genome-guided and de novo assembly, in particular for species where the reference genome or the annotation is incomplete. However, tools for integrating assembled transcriptome with reference annotation are lacking.FindingsNecklace is a software pipeline that runs genome-guided and de novo assembly and combines the resulting transcriptomes with reference genome annotations. Necklace constructs a compact but comprehensive superTranscriptome out of the assembled and reference data. Reads are subsequently aligned and counted in preparation for differential expression testing.ConclusionsNecklace allows a comprehensive transcriptome to be built from a combination of assembled and annotated transcripts which results in a more comprehensive transcriptome for the majority of organisms. In addition RNA-seq data is mapped back to this newly created superTranscript reference to enable differential expression testing with standard methods. Necklace is available from https://github.com/Oshlack/necklace/wiki under GPL 3.0.

Download Full-text

Draft sequencing and assembly of the genome of the world’s largest fish, the whale shark: Rhincodon typus Smith 1828

10.7287/peerj.preprints.837v1 ◽

2015 ◽

Cited By ~ 1

Author(s):

Timothy D Read ◽

Robert A Petit III ◽

Sandeep J Joseph ◽

Md T Alam ◽

Ryan Weil ◽

...

Keyword(s):

De Novo ◽

Nuclear Genome ◽

Receptor Protein ◽

Shark Species ◽

Whale Shark ◽

Conservation Units ◽

Data Set ◽

Rhincodon Typus ◽

A Genome ◽

Extant Species

The whale shark (Rhincodon typus) has by far the largest body size of any elasmobranch (shark or ray) species and is therefore also the largest extant species of the paraphyletic assemblage commonly referred to as “fishes”. As both a phenotypic extreme and a member of the group basal to the remaining gnathostomes, which includes all tetrapods and therefore also humans, its genome is of substantial comparative interest. Whale sharks are also listed as a “vulnerable” species on the International Union for Conservation of Nature (IUCN)'s Red List of threatened species and are of growing popularity as both a target of ecotourism and as a charismatic conservation ambassador for the pelagic ecosystem. A genome map for this species would aid in defining effective conservation units and understanding global population structure. We characterised the nuclear genome of the whale shark using next generation sequencing (454, Illumina) and de novo assembly and annotation methods, based on material collected from the Georgia Aquarium. The data set consisted of 878,654,233 reads, which assembled into 11,347,816 contigs and 3,606,038 scaffolds. The estimated genome size was 3.44Gb. As expected, the proteome of the whale shark was most closely related to the only other complete genome of a cartilaginous fish, the Holocephali Elephant shark. The whale shark contained a novel Toll-like-receptor protein with sequence conservation to both the TLR4 and TLR13 proteins of mammals. The data are publicly available on a Galaxy bioinformatic server (http://whaleshark.georgiaaquarium.org). This represents the first shotgun elasmobranch genome and will aid studies of molecular systematics, biogeography, genetic differentiation, and conservation genetics in this and other shark species, as well as providing comparative data for studies of evolutionary biology and immunology across the jawed vertebrate lineages.

Download Full-text

De novo Assembly of a Genome

Bioinformatics ◽

10.1142/9789813144750_0006 ◽

2017 ◽

pp. 107-125

Author(s):

Joel Zi-Bin Low ◽

Martti T. Tammi

Keyword(s):

De Novo Assembly ◽

De Novo ◽

A Genome

Download Full-text

PyRAD: assembly ofde novoRADseq loci for phylogenetic analyses

10.1101/001081 ◽

2013 ◽

Cited By ~ 1

Author(s):

Deren A. R. Eaton

Keyword(s):

Clustering Algorithm ◽

De Novo ◽

Phylogenetic Analyses ◽

Data Sets ◽

Data Generation ◽

Clustering Methods ◽

Alternative Source ◽

Data Set ◽

Phylogenetic Data ◽

Indel Variation

Restriction-site associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an alternative source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale.PyRADis a pipeline to assemblede novoRADseq loci with the aim of optimizing coverage across phylogenetic data sets. It utilizes a wrapper around an alignment-clustering algorithm which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g., paired-end). Here I comparePyRADwith the programStacksin their performance analyzing a simulated RADseq data set that includes indel variation. Indels disrupt clustering of homologous loci inStacksbut not inPyRAD, such that the latter recovers more shared loci across disparate taxa. I show through re-analysis of an empirical RADseq data set that indels are a common feature of such data, even at shallow phylogenetic scales.PyRADutilizes parallel processing as well as an optional hierarchical clustering method which allow it to rapidly assemble phylogenetic data sets with hundreds of sampled individuals.

Download Full-text

Genome Report: A draft genome of Alliaria petiolata (garlic mustard) as a model system for invasion genetics

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab339 ◽

2021 ◽

Author(s):

Nikolay Alabi ◽

Yihan Wu ◽

Oliver Bossdorf ◽

Loren H Rieseberg ◽

Robert I Colautti

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Genome Mapping ◽

Draft Genome ◽

Alliaria Petiolata ◽

Conservation Strategies ◽

Ecological Knowledge ◽

Garlic Mustard ◽

A Genome ◽

Invasion Genetics

Abstract The emerging field of invasion genetics examines the genetic causes and consequences of biological invasions, but few study systems are available that integrate deep ecological knowledge with genomic tools. Here we report on the de novo assembly and annotation of a genome for the biennial herb Alliaria petiolata (M. Bieb.) Cavara & Grande (Brassicaceae), which is widespread in Eurasia and invasive across much of temperate North America. Our goal was to sequence and annotate a genome to complement resources available from hundreds of published ecological studies, a global field survey, and hundreds of genetic lines maintained in Germany and Canada. We sequenced a genotype (EFCC3-3-20) collected from the native range near Venice, Italy and sequenced paired-end and mate pair libraries at ∼70 × coverage. A de novo assembly resulted in a highly continuous draft genome (N50 = 121 Mb; L50 = 2) with 99.7% of the 1.1 Gb genome mapping to scaffolds of at least 50 Kb in length. A total of 64,770 predicted genes in the annotated genome include 99% of plant BUSCO genes and 98% of transcriptome reads. Consistent with previous reports of (auto)hexaploidy in western Europe, we found that almost one third of BUSCO genes (390/1440) mapped to two or more scaffolds despite < 2% genome-wide average heterozygosity. The continuity and gene space quality of our draft assembly will enable molecular and functional genomic studies of A. petiolata to address questions relevant to invasion genetics and conservation strategies.

Download Full-text

Draft sequencing and assembly of the genome of the world’s largest fish, the whale shark: Rhincodon typus Smith 1828

10.7287/peerj.preprints.837 ◽

2018 ◽

Author(s):

Timothy D Read ◽

Robert A Petit III ◽

Sandeep J Joseph ◽

Md T Alam ◽

Ryan Weil ◽

...

Keyword(s):

De Novo ◽

Nuclear Genome ◽

Receptor Protein ◽

Shark Species ◽

Whale Shark ◽

Conservation Units ◽

Data Set ◽

Rhincodon Typus ◽

A Genome ◽

Extant Species

The whale shark (Rhincodon typus) has by far the largest body size of any elasmobranch (shark or ray) species and is therefore also the largest extant species of the paraphyletic assemblage commonly referred to as “fishes”. As both a phenotypic extreme and a member of the group basal to the remaining gnathostomes, which includes all tetrapods and therefore also humans, its genome is of substantial comparative interest. Whale sharks are also listed as a “vulnerable” species on the International Union for Conservation of Nature (IUCN)'s Red List of threatened species and are of growing popularity as both a target of ecotourism and as a charismatic conservation ambassador for the pelagic ecosystem. A genome map for this species would aid in defining effective conservation units and understanding global population structure. We characterised the nuclear genome of the whale shark using next generation sequencing (454, Illumina) and de novo assembly and annotation methods, based on material collected from the Georgia Aquarium. The data set consisted of 878,654,233 reads, which assembled into 11,347,816 contigs and 3,606,038 scaffolds. The estimated genome size was 3.44Gb. As expected, the proteome of the whale shark was most closely related to the only other complete genome of a cartilaginous fish, the Holocephali Elephant shark. The whale shark contained a novel Toll-like-receptor protein with sequence conservation to both the TLR4 and TLR13 proteins of mammals. The data are publicly available on a Galaxy bioinformatic server (http://whaleshark.georgiaaquarium.org). This represents the first shotgun elasmobranch genome and will aid studies of molecular systematics, biogeography, genetic differentiation, and conservation genetics in this and other shark species, as well as providing comparative data for studies of evolutionary biology and immunology across the jawed vertebrate lineages.

Download Full-text

Faculty Opinions recommendation of Efficient de novo assembly of single-cell bacterial genomes from short-read data sets.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13296960.14657061 ◽

2011 ◽

Author(s):

Steven Salzberg

Keyword(s):

Single Cell ◽

De Novo Assembly ◽

De Novo ◽

Data Sets ◽

Bacterial Genomes ◽

Short Read

Download Full-text

Faculty Opinions recommendation of The sequence and de novo assembly of the giant panda genome.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.2367956.1997054 ◽

2010 ◽

Author(s):

Victoria Prince

Keyword(s):

De Novo Assembly ◽

Giant Panda ◽

De Novo

Download Full-text

An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome

Current Bioinformatics ◽

10.2174/1574893615999200724145835 ◽

2020 ◽

Vol 15 ◽

Author(s):

Dicle Yalcin ◽

Hasan H. Otu

Keyword(s):

Model Building ◽

De Novo ◽

Cpg Islands ◽

Treatment Strategies ◽

Area Under The Curve ◽

Global Methylation ◽

Sequence Features ◽

A Genome ◽

Combined Features ◽

Epigenetic Repression

Background: Epigenetic repression mechanisms play an important role in gene regulation, specifically in cancer development. In many cases, a CpG island’s (CGI) susceptibility or resistance to methylation are shown to be contributed by local DNA sequence features. Objective: To develop unbiased machine learning models–individually and combined for different biological features–that predict the methylation propensity of a CGI. Methods: We developed our model consisting of CGI sequence features on a dataset of 75 sequences (28 prone, 47 resistant) representing a genome-wide methylation structure. We tested our model on two independent datasets that are chromosome (132 sequences) and disease (70 sequences) specific. Results: We provided improvements in prediction accuracy over previous models. Our results indicate that combined features better predict the methylation propensity of a CGI (area under the curve (AUC) ~0.81). Our global methylation classifier performs well on independent datasets reaching an AUC of ~0.82 for the complete model and an AUC of ~0.88 for the model using select sequences that better represent their classes in the training set. We report certain de novo motifs and transcription factor binding site (TFBS) motifs that are consistently better in separating prone and resistant CGIs. Conclusion: Predictive models for the methylation propensity of CGIs lead to a better understanding of disease mechanisms and can be used to classify genes based on their tendency to contain methylation prone CGIs, which may lead to preventative treatment strategies. MATLAB and Python™ scripts used for model building, prediction, and downstream analyses are available at https://github.com/dicleyalcin/methylProp_predictor.

Download Full-text