Massive influence of DNA isolation and library preparation approaches on palaeogenomic sequencing data

Mapping Intimacies ◽

10.1101/075911 ◽

2016 ◽

Cited By ~ 8

Author(s):

Axel Barlow ◽

Gloria G. Fortes ◽

Love Dalén ◽

Ron Pinhasi ◽

Boris Gasparyan ◽

...

Keyword(s):

Sample Preparation ◽

Dna Isolation ◽

Length Distribution ◽

Gc Content ◽

Nucleotide Composition ◽

Library Preparation ◽

Sequencing Data ◽

Dna Yield ◽

Specific Effects ◽

Laboratory Procedures

ABSTRACTThe ability to access genomic information from ancient samples has provided many important biological insights. Generating such palaeogenomic data requires specialised methodologies, and a variety of procedures for all stages of sample preparation have been proposed. However, the specific effects and biases introduced by alternative laboratory procedures is insufficiently understood. Here, we investigate the effects of three DNA isolation and two library preparation protocols on palaeogenomic data obtained from four Pleistocene subfossil bones. We find that alternative methodologies can significantly and substantially affect total DNA yield, the mean length and length distribution of recovered fragments, nucleotide composition, and the total amount of usable data generated. Furthermore, we also detect significant interaction effects between these stages of sample preparation on many of these factors. Effects and biases introduced in the laboratory can be sufficient to confound estimates of DNA degradation, sample authenticity and genomic GC content, and likely also estimates of genetic diversity and population structure. Future palaeogenomic studies need to carefully consider the effects of laboratory procedures during both experimental design and data analysis, particularly when studies involve multiple datasets generated using a mixture of methodologies.

Download Full-text

Facile, High Quality Sequencing of Bacterial Genomes from Small Amounts of DNA

International Journal of Genomics ◽

10.1155/2014/434575 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8

Author(s):

Momchilo Vuyisich ◽

Ayesha Arefin ◽

Karen Davenport ◽

Shihai Feng ◽

Cheryl Gleasner ◽

...

Keyword(s):

Genomic Dna ◽

De Novo ◽

Gc Content ◽

Library Preparation ◽

Sequencing Data ◽

Bacterial Genomes ◽

Dna Amount ◽

High Quality ◽

Preparation Methods

Sequencing bacterial genomes has traditionally required large amounts of genomic DNA (~1 μg). There have been few studies to determine the effects of the input DNA amount or library preparation method on the quality of sequencing data. Several new commercially available library preparation methods enable shotgun sequencing from as little as 1 ng of input DNA. In this study, we evaluated the NEBNext Ultra library preparation reagents for sequencing bacterial genomes. We have evaluated the utility of NEBNext Ultra for resequencing andde novoassembly of four bacterial genomes and compared its performance with the TruSeq library preparation kit. The NEBNext Ultra reagents enable high quality resequencing andde novoassembly of a variety of bacterial genomes when using 100 ng of input genomic DNA. For the two most challenging genomes (Burkholderiaspp.), which have the highest GC content and are the longest, we also show that the quality of both resequencing andde novoassembly is not decreased when only 10 ng of input genomic DNA is used.

Download Full-text

Pacific Biosciences long reads-based genome sequencing data from a widespread bee fungal parasite, Nosema ceranae

10.1101/2020.04.05.026849 ◽

2020 ◽

Author(s):

Huazhi Chen ◽

Wende Zhang ◽

Yu Du ◽

Xiaoxue Fan ◽

Jie Wang ◽

...

Keyword(s):

Genome Sequencing ◽

Average Length ◽

Length Distribution ◽

Gc Content ◽

Nosema Ceranae ◽

Sequencing Data ◽

Smrt Sequencing ◽

Fungal Parasite ◽

Total Length ◽

Pacific Biosciences

ABSTRACTNosema ceranae is a widespread fungal parasite that infects both adult honeybee and honeybee larvae, leading to microsporidiosis, which seriously affects bee health and apicultural industry. In this article, genome sequencing of clean spores of N. ceranae was conducted using third-generation Pacific Biosciences (PacBio) single molecule real time (SMRT) sequencing technology. In total, 152671 subreads were obtained after quality control of raw reads from PacBio SMRT sequencing, with a N50 and average length of 14422 bp and 11310 bp, respectively. Additionally, the length distribution of subreads was from 10000 bp to more than 50000 bp. Nineteen scaffords with a total length of 7354221 bp were assembled, and the N50, N90 and maximum scafford length were 728543 bp, 198795 bp and 1917792 bp, respectively. The GC content was 25.97%. Furthermore, by integration of genes predicted from de novo and homology-based methods, 3112 N. ceranae genes were finally assembled, with a total length of 2730179 bp and mean length of 877.31 bp. In addition, the total length and mean length of exons were 2657637 bp and 854 bp, respectively; and the total length and mean length of introns were 72542 bp and 23.31 bp, respectively. The genome sequencing data documented here will give deep insights into the molecular biology of N. ceranae, facilitate exploration of genes and pathways associated with toxin factors and infection-related factors, and benefit research on comparative genomics and phylogenetic diversity of Nosema species.

Download Full-text

Long fragments achieve lower base quality in Illumina paired-end sequencing

10.1101/397158 ◽

2018 ◽

Author(s):

Ge Tan ◽

Lennart Opitz ◽

Ralph Schlapbach ◽

Hubert Rehrauer

Keyword(s):

Fragment Length ◽

Length Distribution ◽

Error Rates ◽

Average Error ◽

Library Preparation ◽

Sequencing Data ◽

Average Error Rate ◽

Lower Base ◽

Illumina Data ◽

Paired End Sequencing

AbstractIllumina’s technology provides high quality reads of DNA fragments with error rates below 1/1000 per base. Runs typically generate a millions of reads where the vast majority of the reads has also an average error rate below 1/1000. However, some paired-end sequencing data show the presence of a subpopulation of reads where the second read has lower average qualities. We show that the fragment length is a major driver of increased error rates in the R2 reads. Fragments above 500 nt tend to yield lower base qualities and higher error rates than shorter fragments. We demonstrate the fragment length dependency of the R2 read qualities using publicly available Illumina data generated by various library protocols, in different labs and using different sequencer models. Our finding extends the understanding of the Illumina read quality and has implications on error models for Illumina reads. It also sheds a light on the importance of the fragmentation during library preparation and the resulting fragment length distribution.

Download Full-text

Adapterama IV: Sequence Capture of Dual-digest RADseq Libraries with Identifiable Duplicates (RADcap)

10.1101/044651 ◽

2016 ◽

Cited By ~ 2

Author(s):

Sandra L. Hoffberg ◽

Troy J. Kieran ◽

Julian M. Catchen ◽

Alison Devault ◽

Brant C. Faircloth ◽

...

Keyword(s):

Sample Preparation ◽

Restriction Site ◽

Preparation Method ◽

Low Cost ◽

Minimal Cost ◽

Library Preparation ◽

Sequencing Data ◽

Sequence Capture ◽

Pcr Duplicates ◽

Library Preparation Method

AbstractMolecular ecologists seek to genotype hundreds to thousands of loci from hundreds to thousands of individuals at minimal cost per sample. Current methods such as restriction site associated DNA sequencing (RADseq) and sequence capture are constrained by costs associated with inefficient use of sequencing data and sample preparation, respectively. Here, we demonstrate RADcap, an approach that combines the major benefits of RADseq (low cost with specific start positions) with those of sequence capture (repeatable sequencing of specific loci) to significantly increase efficiency and reduce costs relative to current approaches. The RADcap approach uses a new version of dual-digest RADseq (3RAD) to identify candidate SNP loci for capture bait design, and subsequently uses custom sequence capture baits to consistently enrich candidate SNP loci across many individuals. We combined this approach with a new library preparation method for identifying and removing PCR duplicates from 3RAD libraries, which allows researchers to process RADseq data using traditional pipelines, and we tested the RADcap method by genotyping sets of 96 to 384 Wisteria plants. Our results demonstrate that our RADcap method: 1) can methodologically reduce (to <5%) and computationally remove PCR duplicate reads from data; (2) achieves 80-90% reads-on-target in 11 of 12 enrichments; (3) returns consistent coverage (≥4x) across >90% of individuals at up to 99.9% of the targeted loci; (4) produces consistently high occupancy matrices of genotypes across hundreds of individuals; and (5) is inexpensive, with reagent and sequencing costs totaling <$6/sample and adapter and primer costs of only a few hundred dollars.

Download Full-text

The Evolution of Isochores: Evidence From SNP Frequency Distributions

Genetics ◽

10.1093/genetics/162.4.1805 ◽

2002 ◽

Vol 162 (4) ◽

pp. 1805-1810 ◽

Cited By ~ 1

Author(s):

Martin J Lercher ◽

Nick G C Smith ◽

Adam Eyre-Walker ◽

Laurence D Hurst

Keyword(s):

Population Genetics ◽

Large Scale ◽

Gc Content ◽

Nucleotide Composition ◽

Compositional Variation ◽

Mutation Bias ◽

Single Nucleotide ◽

Frequency Distributions ◽

Noncoding Regions ◽

Standard Population

AbstractThe large-scale systematic variation in nucleotide composition along mammalian and avian genomes has been a focus of the debate between neutralist and selectionist views of molecular evolution. Here we test whether the compositional variation is due to mutation bias using two new tests, which do not assume compositional equilibrium. In the first test we assume a standard population genetics model, but in the second we make no assumptions about the underlying population genetics. We apply the tests to single-nucleotide polymorphism data from noncoding regions of the human genome. Both models of neutral mutation bias fit the frequency distributions of SNPs segregating in low- and medium-GC-content regions of the genome adequately, although both suggest compositional nonequilibrium. However, neither model fits the frequency distribution of SNPs from the high-GC-content regions. In contrast, a simple population genetics model that incorporates selection or biased gene conversion cannot be rejected. The results suggest that mutation biases are not solely responsible for the compositional biases found in noncoding regions.

Download Full-text

Evidence of gene nucleotide composition favoring replication and growth in a fastidious plant pathogen

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab076 ◽

2021 ◽

Author(s):

Andreina I Castillo ◽

Rodrigo P P Almeida

Keyword(s):

Plant Pathogen ◽

Geographic Origin ◽

Gc Content ◽

Nucleotide Composition ◽

Plant Host ◽

Mixed Effect ◽

Genomic Changes ◽

Evolutionary Forces ◽

Multiple Variables ◽

Vector Borne

Abstract Nucleotide composition (GC content) varies across bacteria species, genome regions, and specific genes. In Xylella fastidiosa, a vector-borne fastidious plant pathogen infecting multiple crops, GC content ranges between ∼51-52%; however, these values were gathered using limited genomic data. We evaluated GC content variations across X. fastidiosa subspecies fastidiosa (N = 194), subsp. pauca (N = 107), and subsp. multiplex (N = 39). Genomes were classified based on plant host and geographic origin; individual genes within each genome were classified based on gene function, strand, length, ortholog group, Core vs. Accessory, and Recombinant vs. Non-recombinant. GC content was calculated for each gene within each evaluated genome. The effects of genome and gene level variables were evaluated with a mixed effect ANOVA, and the marginal-GC content was calculated for each gene. Also, the correlation between gene-specific GC content vs. natural selection (dN/dS) and recombination/mutation (r/m) was estimated. Our analyses show that intra-genomic changes in nucleotide composition in X. fastidiosa are small and influenced by multiple variables. Higher AT-richness is observed in genes involved in replication and translation, and genes in the leading strand. In addition, we observed a negative correlation between high-AT and dN/dS in subsp. pauca. The relationship between recombination and GC content varied between core and accessory genes. We hypothesize that distinct evolutionary forces and energetic constraints both drive and limit these small variations in nucleotide composition.

Download Full-text

Comparative analysis of codon usage patterns in SARS-CoV-2, its mutants and other respiratory viruses

10.1101/2021.03.03.433699 ◽

2021 ◽

Author(s):

Neetu Tyagi ◽

Rahila Sardar ◽

Dinesh Gupta

Keyword(s):

Codon Usage ◽

Codon Usage Bias ◽

Gc Content ◽

Respiratory Illness ◽

Respiratory Viruses ◽

Nucleotide Composition ◽

Health Crisis ◽

Study Results ◽

Usage Patterns ◽

The Difference

AbstractThe Coronavirus disease 2019 (COVID-19) outbreak caused by Severe Acute Respiratory Syndrome Coronavirus 2 virus (SARS-CoV-2) poses a worldwide human health crisis, causing respiratory illness with a high mortality rate. To investigate the factors governing codon usage bias in all the respiratory viruses, including SARS-CoV-2 isolates from different geographical locations (~62K), including two recently emerging strains from the United Kingdom (UK), i.e., VUI202012/01 and South Africa (SA), i.e., 501.Y.V2 codon usage bias (CUBs) analysis was performed. The analysis includes RSCU analysis, GC content calculation, ENC analysis, dinucleotide frequency and neutrality plot analysis. We were motivated to conduct the study to fulfil two primary aims: first, to identify the difference in codon usage bias amongst all SARS-CoV-2 genomes and, secondly, to compare their CUBs properties with other respiratory viruses. A biased nucleotide composition was found as most of the highly preferred codons were A/U-ending in all the respiratory viruses studied here. Compared with the human host, the RSCU analysis led to the identification of 11 over-represented codons and 9 under-represented codons in SARS-CoV-2 genomes. Correlation analysis of ENC and GC3s revealed that mutational pressure is the leading force determining the CUBs. The present study results yield a better understanding of codon usage preferences for SARS-CoV-2 genomes and discover the possible evolutionary determinants responsible for the biases found among the respiratory viruses, thus unveils a unique feature of the SARS-CoV-2 evolution and adaptation. To the best of our knowledge, this is the first attempt at comparative CUBs analysis on the worldwide genomes of SARS-CoV-2, including novel emerged strains and other respiratory viruses.

Download Full-text

Current Perspectives on High-Throughput Sequencing (HTS) for Adventitious Virus Detection: Upstream Sample Processing and Library Preparation

Viruses ◽

10.3390/v10100566 ◽

2018 ◽

Vol 10 (10) ◽

pp. 566 ◽

Cited By ~ 9

Author(s):

Siemon Ng ◽

Cassandra Braxton ◽

Marc Eloit ◽

Szi Feng ◽

Romain Fragnoud ◽

...

Keyword(s):

Sample Preparation ◽

Nucleic Acids ◽

High Throughput ◽

High Throughput Sequencing ◽

Virus Detection ◽

Extraction Methods ◽

Control Measures ◽

Acid Extraction ◽

Library Preparation ◽

Sample Processing

A key step for broad viral detection using high-throughput sequencing (HTS) is optimizing the sample preparation strategy for extracting viral-specific nucleic acids since viral genomes are diverse: They can be single-stranded or double-stranded RNA or DNA, and can vary from a few thousand bases to over millions of bases, which might introduce biases during nucleic acid extraction. In addition, viral particles can be enveloped or non-enveloped with variable resistance to pre-treatment, which may influence their susceptibility to extraction procedures. Since the identity of the potential adventitious agents is unknown prior to their detection, efficient sample preparation should be unbiased toward all different viral types in order to maximize the probability of detecting any potential adventitious viruses using HTS. Furthermore, the quality assessment of each step for sample processing is also a critical but challenging aspect. This paper presents our current perspectives for optimizing upstream sample processing and library preparation as part of the discussion in the Advanced Virus Detection Technologies Interest group (AVDTIG). The topics include: Use of nuclease treatment to enrich for encapsidated nucleic acids, techniques for amplifying low amounts of virus nucleic acids, selection of different extraction methods, relevant controls, the use of spike recovery experiments, and quality control measures during library preparation.

Download Full-text

Transcriptome analysis of two contrasting genotypes of pearl millet to gain insight into heat stress responses

10.21203/rs.3.rs-23605/v1 ◽

2020 ◽

Author(s):

Albert Maibam ◽

Sunil Nigombam ◽

Harinder Vishwakarma ◽

Showkat Ahmad Lone ◽

Kishor Gaikwad ◽

...

Keyword(s):

Heat Stress ◽

Pearl Millet ◽

Stress Responses ◽

Simple Sequence Repeats ◽

Pennisetum Glaucum ◽

Average Length ◽

Gc Content ◽

Functional Markers ◽

Sequencing Data ◽

Simple Sequence

Abstract Background Pennisetum glaucum (L.) R. Br. is mainly grown in arid and semi-arid regions. Being naturally tolerant to various adverse condtitions, it is a good biological resource for deciphering the molecular basis of abiotic stresses such as heat stress in plants but limited studies have been carried out till date to this effect. Here, we performed RNA-sequencing from the leaf of two contrasting genotypes of pearl millet (841-B and PPMI-69) subjected to heat stress (42 °C for 6 h). Results Over 274 million high quality reads with an average length of 150 nt were generated. Assembly was carried out using trinity, obtaining 47,310 unigenes having an average length of 1254 nucleotides, N50 length of 1853 nucleotides and GC content of 53.11%. Blastx resulted in annotation of 35,628 unigenes and functional classification showed 15,950 unigenes designated to 51 Gene Ontology terms, 13,786 unigenes allocated to 23 Clusters of Orthologous Groups and 4,255 unigenes distributed into 132 functional KEGG pathways. 12,976 simple sequence repeats were identified from 10,294 unigenes for the development of functional markers. A total of 3,05,759 SNPs were observed in the transcriptome data. Out of 2,301 differentially expressed genes, 10 potential candidates genes were selected based on log2 fold change and adjusted p-value parameters for their differential gene expression by qRT-PCR. Conclusions The dynamic expression changes in two genotypes of P. glaucum reflect transcriptome regulation of signaling pathways in heat stress response. In order to develop genetic markers, 12,976 simple sequence repeats (SSRs) were identified. The sequencing data generated in this study shall serve as an important resource for further research in the area of crop biotechnology.

Download Full-text

Evolution of Nucleotide Composition in the SARS-CoV-2 Lineage: Implications for Vaccine Design

10.20944/preprints202006.0250.v1 ◽

2020 ◽

Author(s):

Sankar Subramanian

Keyword(s):

Base Composition ◽

Scientific Community ◽

Significant Positive Correlation ◽

Gc Content ◽

Nucleotide Composition ◽

Reverse Direction ◽

Novel Coronavirus ◽

Near Future ◽

Systematic Reduction ◽

Gc Contents

The worldwide outbreak of a novel coronavirus, SARS-CoV-2 has caused a pandemic of respiratory disease. Due to this emergency, researchers around the globe have been investigating the evolution of the genome of SARS-CoV-2 in order to design vaccines. Here I examined the evolution of GC content of SARS-CoV-2 by comparing the genomes of the members of the group Betacoronavirus. The results of this investigation revealed a highly significant positive correlation between the GC contents of betacoronaviruses and their divergence from SARS-CoV-2. The betacoronaviruses that are distantly related to SARS-CoV-2 have much higher GC contents than the latter. Conversely, the closely related ones have low GC contents, which are only slightly higher than that of SARS-CoV-2. This suggests a systematic reduction in the GC content in the SARS-CoV-2 lineage over time. The declining trend in this lineage predicts a much-reduced GC content in the coronaviruses that will descend/evolve from SARS-CoV-2 in the future. Due to the three consecutive outbreaks (MERS-CoV, SARS-CoV and SARS-CoV-2) caused by the members of the SARS-CoV-2, the scientific community is emphasizing the need for universal vaccines that are effective across many strains including those, that will inevitably emerge in the near future. The reduction in GC contents implies an increase in the rate of GC→AT mutations than that the mutational changes in the reverse direction. Therefore, understanding the evolution of base composition and mutational patterns of SARS-CoV-2 could be useful in designing broad-spectrum vaccines that could identify and neutralize the present and future strains of this virus.

Download Full-text