Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software

Mapping Intimacies ◽

10.1101/706531 ◽

2019 ◽

Author(s):

Melanie E. F. LaCava ◽

Ellen O. Aikens ◽

Libby C. Megna ◽

Gregg Randolph ◽

Charley Hubbard ◽

...

Keyword(s):

Dna Sequences ◽

De Novo ◽

Homo Sapiens ◽

Simulated Data ◽

Model Organisms ◽

Reduced Representation ◽

Insertion And Deletion ◽

Large Sets ◽

Sequencing Library ◽

Software Programs

AbstractAdvances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated datasets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

Download Full-text

GBStools: A Unified Approach for Reduced Representation Sequencing and Genotyping

10.1101/030494 ◽

2015 ◽

Author(s):

Thomas F Cooke ◽

Muh-Ching Yee ◽

Marina Muzzio ◽

Alexandra Sockell ◽

Ryan Bell ◽

...

Keyword(s):

Restriction Site ◽

Variant Calling ◽

Simulated Data ◽

Error Rates ◽

Genomic Diversity ◽

Model Organisms ◽

Data Sets ◽

Reduced Representation ◽

Restriction Site Polymorphisms ◽

Reduced Representation Sequencing

Reduced representation sequencing methods such as genotyping-by-sequencing (GBS) enable low-cost measurement of genetic variation without the need for a reference genome assembly. These methods are widely used in genetic mapping and population genetics studies, especially with non-model organisms. Variant calling error rates, however, are higher in GBS than in standard sequencing, in particular due to restriction site polymorphisms, and few computational tools exist that specifically model and correct these errors. We developed a statistical method to remove errors caused by restriction site polymorphisms, implemented in the software package GBStools. We evaluated it in several simulated data sets, varying in number of samples, mean coverage and population mutation rate, and in two empirical human data sets (N = 8 and N = 63 samples). In our simulations, GBStools improved genotype accuracy more than commonly used filters such as Hardy-Weinberg equilibrium p-values. GBStools is most effective at removing genotype errors in data sets over 100 samples when coverage is 40X or higher, and the improvement is most pronounced in species with high genomic diversity. We also demonstrate the utility of GBS and GBStools for human population genetic inference in Argentine populations and reveal widely varying individual ancestry proportions and an excess of singletons, consistent with recent population growth.

Download Full-text

MobiSeq: De Novo SNP discovery in model and non-model species through sequencing the flanking region of transposable elements

10.1101/349290 ◽

2018 ◽

Author(s):

Alba Rey-Iglesia ◽

Shyam Gopalakrishan ◽

Christian Carøe ◽

David E. Alquezar-Planas ◽

Anne Ahlmann Nielsen ◽

...

Keyword(s):

Transposable Elements ◽

Dna Sequences ◽

Population Genomics ◽

De Novo ◽

Model Organisms ◽

Snp Discovery ◽

High Molecular Weight Dna ◽

A Genome ◽

Wide Range ◽

Flanking Region

AbstractIn recent years, the availability of reduced representation library (RRL) methods has catalysed an expansion of genome-scale studies to characterize both model and non-model organisms. Most of these methods rely on the use of restriction enzymes to obtain DNA sequences at a genome-wide level. These approaches have been widely used to sequence thousands of markers across individuals for many organisms at a reasonable cost, revolutionizing the field of population genomics. However, there are still some limitations associated with these methods, in particular, the high molecular weight DNA required as starting material, the reduced number of common loci among investigated samples, and the short length of the sequenced site-associated DNA. Here, we present MobiSeq, a RRL protocol exploiting simple laboratory techniques, that generates genomic data based on PCR targeted-enrichment of transposable elements and the sequencing of the associated flanking region. We validate its performance across 103 DNA extracts derived from three mammalian species: grey wolf (Canis lupus), red deer complex (Cervus sp.), and brown rat (Rattus norvegicus). MobiSeq enables the sequencing of hundreds of thousands loci across the genome, and performs SNP discovery with relatively low rates of clonality. Given the ease and flexibility of MobiSeq protocol, the method has the potential to be implemented for marker discovery and population genomics across a wide range of organisms – enabling the exploration of diverse evolutionary and conservation questions.

Download Full-text

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

10.21203/rs.3.rs-777702/v1 ◽

2021 ◽

Author(s):

Angela Brooks ◽

Francisco Pardo-Palacios ◽

Fairlie Reese ◽

Silvia Carbonell-Sala ◽

Mark Diekhans ◽

...

Keyword(s):

De Novo ◽

Simulated Data ◽

Model Organisms ◽

Rna Seq ◽

Sequencing Platform ◽

Systematic Assessment ◽

Sequencing Technologies ◽

Long Read ◽

Community Effort ◽

Transcriptome Analyses

Abstract With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.

Download Full-text

Removing the bad apples: A simple bioinformatic method to improve loci‐recovery in de novo RADseq data for non‐model organisms

Methods in Ecology and Evolution ◽

10.1111/2041-210x.13562 ◽

2021 ◽

Cited By ~ 1

Author(s):

José Cerca ◽

Marius F. Maurstad ◽

Nicolas C. Rochette ◽

Angel G. Rivera‐Colón ◽

Niraj Rayamajhi ◽

...

Keyword(s):

De Novo ◽

Model Organisms

Download Full-text

In Search of Species-Specific SNPs in a Non-Model Animal (European Bison (Bison bonasus))—Comparison of De Novo and Reference-Based Integrated Pipeline of STACKS Using Genotyping-by-Sequencing (GBS) Data

Animals ◽

10.3390/ani11082226 ◽

2021 ◽

Vol 11 (8) ◽

pp. 2226

Author(s):

Sazia Kunvar ◽

Sylwia Czarnomska ◽

Cino Pertoldi ◽

Małgorzata Tokarska

Keyword(s):

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Model Organism ◽

Genotyping By Sequencing ◽

Model Organisms ◽

European Bison ◽

Model Animal ◽

Pcr Duplicates ◽

Species Specific

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.

Download Full-text

The brain transcriptome of the wolf spider, Schizocosa ocreata

BMC Research Notes ◽

10.1186/s13104-021-05648-y ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Daniel Stribling ◽

Peter L. Chang ◽

Justin E. Dalton ◽

Christopher A. Conow ◽

Malcolm Rosenthal ◽

...

Keyword(s):

Gene Expression ◽

De Novo ◽

Transcriptome Assembly ◽

Model Organisms ◽

De Novo Transcriptome Assembly ◽

De Novo Transcriptome ◽

Wolf Spiders ◽

Schizocosa Ocreata ◽

Genomic Studies ◽

The Brain

Abstract Objectives Arachnids have fascinating and unique biology, particularly for questions on sex differences and behavior, creating the potential for development of powerful emerging models in this group. Recent advances in genomic techniques have paved the way for a significant increase in the breadth of genomic studies in non-model organisms. One growing area of research is comparative transcriptomics. When phylogenetic relationships to model organisms are known, comparative genomic studies provide context for analysis of homologous genes and pathways. The goal of this study was to lay the groundwork for comparative transcriptomics of sex differences in the brain of wolf spiders, a non-model organism of the pyhlum Euarthropoda, by generating transcriptomes and analyzing gene expression. Data description To examine sex-differential gene expression, short read transcript sequencing and de novo transcriptome assembly were performed. Messenger RNA was isolated from brain tissue of male and female subadult and mature wolf spiders (Schizocosa ocreata). The raw data consist of sequences for the two different life stages in each sex. Computational analyses on these data include de novo transcriptome assembly and differential expression analyses. Sample-specific and combined transcriptomes, gene annotations, and differential expression results are described in this data note and are available from publicly-available databases.

Download Full-text

Comparative Analysis of SNP Discovery and Genotyping in Fagus sylvatica L. and Quercus robur L. Using RADseq, GBS, and ddRAD Methods

Forests ◽

10.3390/f12020222 ◽

2021 ◽

Vol 12 (2) ◽

pp. 222

Author(s):

Bartosz Ulaszewski ◽

Joanna Meger ◽

Jaroslaw Burczyk

Keyword(s):

Population Genomics ◽

De Novo ◽

Genetic Studies ◽

Genomic Libraries ◽

Reduced Representation ◽

Large Numbers ◽

Broadleaved Tree Species ◽

Fagus Sylvatica L ◽

Reference Genomes ◽

Future Population

Next-generation sequencing of reduced representation genomic libraries (RRL) is capable of providing large numbers of genetic markers for population genetic studies at relatively low costs. However, one major concern of these types of markers is the precision of genotyping, which is related to the common problem of missing data, which appears to be particularly important in association and genomic selection studies. We evaluated three RRL approaches (GBS, RADseq, ddRAD) and different SNP identification methods (de novo or based on a reference genome) to find the best solutions for future population genomics studies in two economically and ecologically important broadleaved tree species, namely F. sylvatica and Q. robur. We found that the use of ddRAD method coupled with SNP calling based on reference genomes provided the largest numbers of markers (28 k and 36 k for beech and oak, respectively), given standard filtering criteria. Using technical replicates of samples, we demonstrated that more than 80% of SNP loci should be considered as reliable markers in GBS and ddRAD, but not in RADseq data. According to the reference genomes’ annotations, more than 30% of the identified ddRAD loci appeared to be related to genes. Our findings provide a solid support for using ddRAD-based SNPs for future population genomics studies in beech and oak.

Download Full-text

Genome-wide methylation sequencing identifies progression-related epigenetic drivers in myelodysplastic syndromes

Cell Death and Disease ◽

10.1038/s41419-020-03213-2 ◽

2020 ◽

Vol 11 (11) ◽

Author(s):

Jing-dong Zhou ◽

Ting-juan Zhang ◽

Zi-jun Xu ◽

Zhao-qun Deng ◽

Yu Gu ◽

...

Keyword(s):

Cancer Progression ◽

Myelodysplastic Syndromes ◽

Bisulfite Sequencing ◽

De Novo ◽

Dna Hypermethylation ◽

Reduced Representation ◽

Targeted Bisulfite Sequencing ◽

Specific Pcr ◽

Genome Wide ◽

Potential Biomarker

AbstractThe potential mechanism of myelodysplastic syndromes (MDS) progressing to acute myeloid leukemia (AML) remains poorly elucidated. It has been proved that epigenetic alterations play crucial roles in the pathogenesis of cancer progression including MDS. However, fewer studies explored the whole-genome methylation alterations during MDS progression. Reduced representation bisulfite sequencing was conducted in four paired MDS/secondary AML (MDS/sAML) patients and intended to explore the underlying methylation-associated epigenetic drivers in MDS progression. In four paired MDS/sAML patients, cases at sAML stage exhibited significantly increased methylation level as compared with the matched MDS stage. A total of 1090 differentially methylated fragments (DMFs) (441 hypermethylated and 649 hypomethylated) were identified involving in MDS pathogenesis, whereas 103 DMFs (96 hypermethylated and 7 hypomethylated) were involved in MDS progression. Targeted bisulfite sequencing further identified that aberrant GFRA1, IRX1, NPY, and ZNF300 methylation were frequent events in an additional group of de novo MDS and AML patients, of which only ZNF300 methylation was associated with ZNF300 expression. Subsequently, ZNF300 hypermethylation in larger cohorts of de novo MDS and AML patients was confirmed by real-time quantitative methylation-specific PCR. It was illustrated that ZNF300 methylation could act as a potential biomarker for the diagnosis and prognosis in MDS and AML patients. Functional experiments demonstrated the anti-proliferative and pro-apoptotic role of ZNF300 overexpression in MDS-derived AML cell-line SKM-1. Collectively, genome-wide DNA hypermethylation were frequent events during MDS progression. Among these changes, ZNF300 methylation, a regulator of ZNF300 expression, acted as an epigenetic driver in MDS progression. These findings provided a theoretical basis for the usage of demethylation drugs in MDS patients against disease progression.

Download Full-text

Genome sequence, transcriptome, and annotation of rodent malaria parasite Plasmodium yoelii nigeriensis N67

BMC Genomics ◽

10.1186/s12864-021-07555-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Cui Zhang ◽

Cihan Oguz ◽

Sue Huse ◽

Lu Xia ◽

Jian Wu ◽

...

Keyword(s):

Dna Sequences ◽

Malaria Parasite ◽

Vaccine Development ◽

De Novo ◽

Gene Families ◽

Plasmodium Yoelii ◽

Control Measures ◽

Effective Control ◽

Malaria Parasites ◽

Rodent Malaria

Abstract Background Rodent malaria parasites are important models for studying host-malaria parasite interactions such as host immune response, mechanisms of parasite evasion of host killing, and vaccine development. One of the rodent malaria parasites is Plasmodium yoelii, and multiple P. yoelii strains or subspecies that cause different disease phenotypes have been widely employed in various studies. The genomes and transcriptomes of several P. yoelii strains have been analyzed and annotated, including the lethal strains of P. y. yoelii YM (or 17XL) and non-lethal strains of P. y. yoelii 17XNL/17X. Genomic DNA sequences and cDNA reads from another subspecies P. y. nigeriensis N67 have been reported for studies of genetic polymorphisms and parasite response to drugs, but its genome has not been assembled and annotated. Results We performed genome sequencing of the N67 parasite using the PacBio long-read sequencing technology, de novo assembled its genome and transcriptome, and predicted 5383 genes with high overall annotation quality. Comparison of the annotated genome of the N67 parasite with those of YM and 17X parasites revealed a set of genes with N67-specific orthology, expansion of gene families, particularly the homologs of the Plasmodium chabaudi erythrocyte membrane antigen, large numbers of SNPs and indels, and proteins predicted to interact with host immune responses based on their functional domains. Conclusions The genomes of N67 and 17X parasites are highly diverse, having approximately one polymorphic site per 50 base pairs of DNA. The annotated N67 genome and transcriptome provide searchable databases for fast retrieval of genes and proteins, which will greatly facilitate our efforts in studying the parasite biology and gene function and in developing effective control measures against malaria.

Download Full-text

Myosinome: A Database of Myosins from Select Eukaryotic Genomes to Facilitate Analysis of Sequence-Structure-Function Relationships

Bioinformatics and Biology Insights ◽

10.4137/bbi.s9902 ◽

2012 ◽

Vol 6 ◽

pp. BBI.S9902 ◽

Cited By ~ 3

Author(s):

Divya P. Syamaladevi ◽

Margaret S Sunitha ◽

S. Kalaimathy ◽

Chandrashekar C. Reddy ◽

Mohammed Iftekhar ◽

...

Keyword(s):

Conformational Changes ◽

Atp Hydrolysis ◽

Homo Sapiens ◽

Relevant Literature ◽

Myosin Ii ◽

Coiled Coil ◽

Structural Features ◽

Model Organisms ◽

Congenital Diseases ◽

C Elegans

Myosins are one of the largest protein superfamilies with 24 classes. They have conserved structural features and catalytic domains yet show huge variation at different domains resulting in a variety of functions. Myosins are molecules driving various kinds of cellular processes and motility until the level of organisms. These are ATPases that utilize the chemical energy released by ATP hydrolysis to bring about conformational changes leading to a motor function. Myosins are important as they are involved in almost all cellular activities ranging from cell division to transcriptional regulation. They are crucial due to their involvement in many congenital diseases symptomatized by muscular malfunctions, cardiac diseases, deafness, neural and immunological dysfunction, and so on, many of which lead to death at an early age. We present Myosinome, a database of selected myosin classes (myosin II, V, and VI) from five model organisms. This knowledge base provides the sequences, phylogenetic clustering, domain architectures of myosins and molecular models, structural analyses, and relevant literature of their coiled-coil domains. In the current version of Myosinome, information about 71 myosin sequences belonging to three myosin classes (myosin II, V, and VI) in five model organisms ( Homo Sapiens, Mus musculus, D. melanogaster, C. elegans and S. cereviseae) identified using bioinformatics surveys are presented, and several of them are yet to be functionally characterized. As these proteins are involved in congenital diseases, such a database would be useful in short-listing candidates for gene therapy and drug development. The database can be accessed from http://caps.ncbs.res.in/myosinome .

Download Full-text