DeepSNVMiner: a sequence analysis tool to detect emergent, rare mutations in subsets of cell populations

PeerJ ◽

10.7717/peerj.2074 ◽

2016 ◽

Vol 4 ◽

pp. e2074 ◽

Cited By ~ 12

Author(s):

T. Daniel Andrews ◽

Yogesh Jeelall ◽

Dipti Talaulikar ◽

Christopher C. Goodnow ◽

Matthew A. Field

Keyword(s):

Sequence Analysis ◽

Massively Parallel Sequencing ◽

Sequence Data ◽

False Negative ◽

Synthetic Data ◽

Dna Amplification ◽

Sequence Variants ◽

Analysis Tool ◽

Dna Molecules ◽

Read Group

Background.Massively parallel sequencing technology is being used to sequence highly diverse populations of DNA such as that derived from heterogeneous cell mixtures containing both wild-type and disease-related states. At the core of such molecule tagging techniques is the tagging and identification of sequence reads derived from individual input DNA molecules, which must be first computationally disambiguated to generate read groups sharing common sequence tags, with each read group representing a single input DNA molecule. This disambiguation typically generates huge numbers of reads groups, each of which requires additional variant detection analysis steps to be run specific to each read group, thus representing a significant computational challenge. While sequencing technologies for producing these data are approaching maturity, the lack of available computational tools for analysing such heterogeneous sequence data represents an obstacle to the widespread adoption of this technology.Results.Using synthetic data we successfully detect unique variants at dilution levels of 1 in a 1,000,000 molecules, and find DeeepSNVMiner obtains significantly lower false positive and false negative rates compared to popular variant callers GATK, SAMTools, FreeBayes and LoFreq, particularly as the variant concentration levels decrease. In a dilution series with genomic DNA from two cells lines, we find DeepSNVMiner identifies a known somatic variant when present at concentrations of only 1 in 1,000 molecules in the input material, the lowest concentration amongst all variant callers tested.Conclusions.Here we present DeepSNVMiner; a tool to disambiguate tagged sequence groups and robustly identify sequence variants specific to subsets of starting DNA molecules that may indicate the presence of a disease. DeepSNVMiner is an automated workflow of custom sequence analysis utilities and open source tools able to differentiate somatic DNA variants from artefactual sequence variants that likely arose during DNA amplification. The workflow remains flexible such that it may be customised to variants of the data production protocol used, and supports reproducible analysis through detailed logging and reporting of results. DeepSNVMiner is available for academic non-commercial research purposes athttps://github.com/mattmattmattmatt/DeepSNVMiner.

Download Full-text

Genetic Diversity and Pathogenic Variability Among Isolates of Colletotrichum Species from Strawberry

Phytopathology ◽

10.1094/phyto.2003.93.2.219 ◽

2003 ◽

Vol 93 (2) ◽

pp. 219-228 ◽

Cited By ~ 51

Author(s):

Béatrice Denoyes-Rothan ◽

Guy Guérin ◽

Christophe Délye ◽

Barbara Smith ◽

Dror Minz ◽

...

Keyword(s):

Sequence Analysis ◽

Sequence Data ◽

Random Amplified Polymorphic Dna ◽

Molecular Data ◽

Its2 Sequence ◽

Host Specialization ◽

Pathogenicity Tests ◽

Colletotrichum Spp ◽

Rapd Polymorphism ◽

Pathogenic Variability

Ninety-five isolates of Colletotrichum including 81 isolates of C. acutatum (62 from strawberry) and 14 isolates of C. gloeosporioides (13 from strawberry) were characterized by various molecular methods and pathogenicity tests. Results based on random amplified polymorphic DNA (RAPD) polymorphism and internal transcribed spacer (ITS) 2 sequence data provided clear genetic evidence of two subgroups in C. acutatum. The first subgroup, characterized as CA-clonal, included only isolates from strawberry and exhibited identical RAPD patterns and nearly identical ITS2 sequence analysis. A larger genetic group, CA-variable, included isolates from various hosts and exhibited variable RAPD patterns and divergent ITS2 sequence analysis. Within the C. acutatum population isolated from strawberry, the CA-clonal group is prevalent in Europe (54 isolates of 62). A subset of European C. acutatum isolates isolated from strawberry and representing the CA-clonal and CA-variable groups was assigned to two pathogenicity groups. No correlation could be drawn between genetic and pathogenicity groups. On the basis of molecular data, it is proposed that the CA-clonal subgroup contains closely related, highly virulent C. acutatum isolates that may have developed host specialization to strawberry. C. gloeosporioides isolates from Europe, which were rarely observed were either slightly or nonpathogenic on strawberry. The absence of correlation between genetic polymorphism and geographical origin in Colletotrichum spp. suggests a worldwide dissemination of isolates, probably through international plant exchanges.

Download Full-text

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Briefings in Bioinformatics ◽

10.1093/bib/bby017 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1542-1559 ◽

Cited By ~ 44

Author(s):

Damla Senol Cali ◽

Jeremie S Kim ◽

Saugata Ghose ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Analysis ◽

Genome Assembly ◽

Sequence Data ◽

Error Rates ◽

Nanopore Sequencing ◽

Memory Usage ◽

Sequencing Technology ◽

Assembly Pipeline ◽

And Performance ◽

Polishing Tool

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Download Full-text

Sequence analysis of heparan sulphate and heparin oligosaccharides

Biochemical Journal ◽

10.1042/bj3390767 ◽

1999 ◽

Vol 339 (3) ◽

pp. 767-773 ◽

Cited By ~ 35

Author(s):

Romain R. VIVÈS ◽

David A. PYE ◽

Markku SALMIVIRTA ◽

John J. HOPWOOD ◽

Ulf LINDAHL ◽

...

Keyword(s):

Sequence Analysis ◽

Protein Interactions ◽

Sequence Data ◽

Specific Binding ◽

Heparan Sulphate ◽

Biologically Active ◽

Simple Method ◽

Gag Protein ◽

Specific Binding Sites ◽

Strong Anion Exchange

The biological activity of heparan sulphate (HS) and heparin largely depends on internal oligosaccharide sequences that provide specific binding sites for an extensive range of proteins. Identification of such structures is crucial for the complete understanding of glycosaminoglycan (GAG)-protein interactions. We describe here a simple method of sequence analysis relying on the specific tagging of the sugar reducing end by 3H radiolabelling, the combination of chemical scission and specific enzymic digestion to generate intermediate fragments, and the analysis of the generated products by strong-anion-exchange HPLC. We present full sequence data on microgram quantities of four unknown oligosaccharides (three HS-derived hexasaccharides and one heparin-derived octasaccharide) which illustrate the utility and relative simplicity of the technique. The results clearly show that it is also possible to read sequences of inhomogeneous preparations. Application of this technique to biologically active oligosaccharides should accelerate progress in the understanding of HS and heparin structure-function relationships and provide new insights into the primary structure of these polysaccharides.

Download Full-text

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Aspergillus fuscicans (Aspergillaceae, Eurotiales), a new species in section Usti from Argentinean semi-arid soil

Phytotaxa ◽

10.11646/phytotaxa.343.1.6 ◽

2018 ◽

Vol 343 (1) ◽

pp. 67 ◽

Cited By ~ 2

Author(s):

STELLA M. ROMERO ◽

RICARDO M. COMERIO ◽

VIVIANA A. BARRERA ◽

ANDREA I. ROMERO

Keyword(s):

New Species ◽

Sequence Analysis ◽

Sequence Data ◽

Arid Soil ◽

Physiological Studies ◽

A New Species ◽

Semi Arid ◽

Aspergillus Section ◽

Β Tubulin

Aspergillus fuscicans, a new species within Aspergillus section Usti from Argentinean semi-arid soil is introduced. Molecular, morphological and physiological studies were conducted, based on sequence analysis of partial β-tubulin and calmodulin sequence data. Aspergillus fuscicans formed a distinct, well-defined clade related to A. calidoustus and A. pseudodeflectus. In addition, A. fuscicans was able to grow and sporulate at 37 ºC, and had a negative Ehrlich reaction. Morphological and physiological features could be used to differentiate the new species from its phylogenetically related taxa.

Download Full-text

Comparison of Normalization Methods for Construction of Large, Multiplex Amplicon Pools for Next-Generation Sequencing

Applied and Environmental Microbiology ◽

10.1128/aem.02585-09 ◽

2010 ◽

Vol 76 (12) ◽

pp. 3863-3868 ◽

Cited By ~ 48

Author(s):

J. Kirk Harris ◽

Jason W. Sahl ◽

Todd A. Castoe ◽

Brandie D. Wagner ◽

David D. Pollock ◽

...

Keyword(s):

Next Generation Sequencing ◽

Massively Parallel Sequencing ◽

Sequence Data ◽

Cost Savings ◽

Massively Parallel ◽

Next Generation ◽

Normalization Methods ◽

The Cost ◽

Generation Sequencing

ABSTRACT Constructing mixtures of tagged or bar-coded DNAs for sequencing is an important requirement for the efficient use of next-generation sequencers in applications where limited sequence data are required per sample. There are many applications in which next-generation sequencing can be used effectively to sequence large mixed samples; an example is the characterization of microbial communities where ≤1,000 sequences per samples are adequate to address research questions. Thus, it is possible to examine hundreds to thousands of samples per run on massively parallel next-generation sequencers. However, the cost savings for efficient utilization of sequence capacity is realized only if the production and management costs associated with construction of multiplex pools are also scalable. One critical step in multiplex pool construction is the normalization process, whereby equimolar amounts of each amplicon are mixed. Here we compare three approaches (spectroscopy, size-restricted spectroscopy, and quantitative binding) for normalization of large, multiplex amplicon pools for performance and efficiency. We found that the quantitative binding approach was superior and represents an efficient scalable process for construction of very large, multiplex pools with hundreds and perhaps thousands of individual amplicons included. We demonstrate the increased sequence diversity identified with higher throughput. Massively parallel sequencing can dramatically accelerate microbial ecology studies by allowing appropriate replication of sequence acquisition to account for temporal and spatial variations. Further, population studies to examine genetic variation, which require even lower levels of sequencing, should be possible where thousands of individual bar-coded amplicons are examined in parallel.

Download Full-text

Genomic prediction based on selected variants from imputed whole-genome sequence data in Australian sheep populations

Genetics Selection Evolution ◽

10.1186/s12711-019-0514-2 ◽

2019 ◽

Vol 51 (1) ◽

Cited By ~ 6

Author(s):

Nasir Moghaddar ◽

Majid Khansefid ◽

Julius H. J. van der Werf ◽

Sunduimijid Bolormaa ◽

Naomi Duijvesteijn ◽

...

Keyword(s):

Genome Sequence ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Sequence Data ◽

Whole Genome Sequence ◽

Sequence Variants ◽

Whole Genome ◽

Absolute Increase ◽

Genome Sequence Data ◽

Australian Sheep

Abstract Background Whole-genome sequence (WGS) data could contain information on genetic variants at or in high linkage disequilibrium with causative mutations that underlie the genetic variation of polygenic traits. Thus far, genomic prediction accuracy has shown limited increase when using such information in dairy cattle studies, in which one or few breeds with limited diversity predominate. The objective of our study was to evaluate the accuracy of genomic prediction in a multi-breed Australian sheep population of relatively less related target individuals, when using information on imputed WGS genotypes. Methods Between 9626 and 26,657 animals with phenotypes were available for nine economically important sheep production traits and all had WGS imputed genotypes. About 30% of the data were used to discover predictive single nucleotide polymorphism (SNPs) based on a genome-wide association study (GWAS) and the remaining data were used for training and validation of genomic prediction. Prediction accuracy using selected variants from imputed sequence data was compared to that using a standard array of 50k SNP genotypes, thereby comparing genomic best linear prediction (GBLUP) and Bayesian methods (BayesR/BayesRC). Accuracy of genomic prediction was evaluated in two independent populations that were each lowly related to the training set, one being purebred Merino and the other crossbred Border Leicester x Merino sheep. Results A substantial improvement in prediction accuracy was observed when selected sequence variants were fitted alongside 50k genotypes as a separate variance component in GBLUP (2GBLUP) or in Bayesian analysis as a separate category of SNPs (BayesRC). From an average accuracy of 0.27 in both validation sets for the 50k array, the average absolute increase in accuracy across traits with 2GBLUP was 0.083 and 0.073 for purebred and crossbred animals, respectively, whereas with BayesRC it was 0.102 and 0.087. The average gain in accuracy was smaller when selected sequence variants were treated in the same category as 50k SNPs. Very little improvement over 50k prediction was observed when using all WGS variants. Conclusions Accuracy of genomic prediction in diverse sheep populations increased substantially by using variants selected from whole-genome sequence data based on an independent multi-breed GWAS, when compared to genomic prediction using standard 50K genotypes.

Download Full-text

From Sequence Data to Patient Result: A Solution for HIV Drug Resistance Genotyping With Exatype, End to End Software for Pol-HIV-1 Sanger Based Sequence Analysis and Patient HIV Drug Resistance Result Generation

Journal of the International Association of Providers of AIDS Care (JIAPAC) ◽

10.1177/2325958220962687 ◽

2020 ◽

Vol 19 ◽

pp. 232595822096268

Author(s):

Leonard Kingwara ◽

Muthoni Karanja ◽

Catherine Ngugi ◽

Geoffrey Kangogo ◽

Kipkerich Bera ◽

...

Keyword(s):

Drug Resistance ◽

Sequence Analysis ◽

Standard Method ◽

Sequence Data ◽

Scale Up ◽

Reference Laboratory ◽

Hiv Viral Load ◽

Hiv Drug Resistance ◽

Base Calling ◽

Hands On

Introduction: With the rapid scale-up of antiretroviral therapy (ART) to treat HIV infection, there are ongoing concerns regarding probable emergence and transmission of HIV drug resistance (HIVDR) mutations. This scale-up has to lead to an increased need for routine HIVDR testing to inform the clinical decision on a regimen switch. Although the majority of wet laboratory processes are standardized, slow, labor-intensive data transfer and subjective manual sequence interpretation steps are still required to finalize and release patient results. We thus set out to validate the applicability of a software package to generate HIVDR patient results from raw sequence data independently. Methods: We assessed the performance characteristics of Hyrax Bioscience’s Exatype (a sequence data to patient result, fully automated sequence analysis software, which consolidates RECall, MEGA X and the Stanford HIV database) against the standard method (RECall and Stanford database). Exatype is a web-based HIV Drug resistance bioinformatic pipeline available at sanger. exatype.com . To validate the exatype, we used a test set of 135 remnant HIV viral load samples at the National HIV Reference Laboratory (NHRL). Result: We analyzed, and successfully generated results of 126 sequences out of 135 specimens by both Standard and Exatype software. Result production using Exatype required minimal hands-on time in comparison to the Standard (6 computation-hours using the standard method versus 1.5 Exatype computation-hours). Concordance between the 2 systems was 99.8% for 311,227 bases compared. 99.7% of the 0.2% discordant bases, were attributed to nucleotide mixtures as a result of the sequence editing in Recall. Both methods identified similar (99.1%) critical antiretroviral resistance-associated mutations resulting in a 99.2% concordance of resistance susceptibility interpretations. The Base-calling comparison between the 2 methods had Cohen’s kappa (0.97 to 0.99), implying an almost perfect agreement with minimal base calling variation. On a predefined dataset, RECall editing displayed the highest probability to score mixtures accurately 1 vs. 0.71 and the lowest chance to inaccurately assign mixtures to pure nucleotides (0.002–0.0008). This advantage is attributable to the manual sequence editing in RECall. Conclusion: The reduction in hands-on time needed is a benefit when using the Exatype HIV DR sequence analysis platform and result generation tool. There is a minimal difference in base calling between Exatype and standard methods. Although the discrepancy has minimal impact on drug resistance interpretation, allowance of sequence editing in Exatype as RECall can significantly improve its performance.

Download Full-text

Collaborative RISC-Score Database: Creation of an International Database for Retroviral Integration Analysis.

Blood ◽

10.1182/blood.v104.11.2110.2110 ◽

2004 ◽

Vol 104 (11) ◽

pp. 2110-2110

Author(s):

Stephanie Laufs ◽

Frank Giordano ◽

Daniel Lauterborn ◽

K. Zsuzsanna Nagy ◽

Kurt Fellernberg ◽

...

Keyword(s):

Gene Therapy ◽

T Cells ◽

Sequence Analysis ◽

Data Base ◽

Hematopoietic Cells ◽

Retroviral Vector ◽

Analysis Tool ◽

Vector Integration ◽

Sequence Analysis Tool ◽

Set Up

Abstract Increasing use of hematopoietic stem cells for retroviral vector-mediated gene therapy and recent reports on leukemogenesis in mice and humans have created intense interest to characterize vector integrations on the genomic level. As techniques to determine insertion sites are more commonly applied in gene therapy laboratories there is a need to systematically collect and analyze the data arising from such studies in a vector insertion database. This will allow determining factors responsible for preferential integration of various vector types in specific chromosomal regions, genes or gene sections. The information derived from a vector insertion data base will be useful to recognize more “dangerous” vector types and may provide useful information for vector design. We have set up an automatic sequence analysis tool (ensuring quality criteria e.g. verification of LTR- and adapter sequence, score >40, e-value >10e-40, hit RefSeq, next RefSeq etc.) which simplifies data input enormously while ensuring high quality standards. Our group is establishing the "collaborative RISC (retroviral insertion estimation into chromosome) -Score Database (CRSD)"- assessment project, based on the M-CHIPS (Multi-Conditional Hybridisation Intensity Processing System) microarray data warehouse and analysis software (K. Fellenberg et al. 2001, 2002). The data obtained from the sequence analysis tool were automatically fed in the data base. A total of 287 retroviral vector integration sites were isolated and sequence analysis was performed with the above describe analysis tool. In human bone marrow repopulating cells they occurred with significantly increased frequency into chromosomes 17 and 19 (n=189). Analysis of targeted RefSeq genes showed a favored integration (48%) within the first intron. In comparison, retroviral vector integrations in T-cells (n=98) showed an entirely different chromosomal distribution pattern while the percentage of the targeted RefSeq genes was similar (46%). Further, more than 1200 sequences were submitted to the data base, originating from different vectors (SF-MDR-, MoLV-based TK/neoR-Mo3TIN-, Moloney-MGMT-, Harvey-based Neo-, Harvey-based MDR-, and lentiviral GFP-SIN-vectors) and different transduced cells (mouse hematopoietic cells, mouse fibroblasts, rhesus hematopoietic cells, human hematopoietic cells, human T-cells). The set-up and internal structure of the data base will be presented. Collaborations have been forged to include further groups and vector types. Bioinformatical analysis will allow recognizing even complex vector integration patterns and will broaden our understanding for the determinants of vector integration into the genome. This in turn can lead to the construction of "favorable" vectors and help to reduce the genotoxicity of retroviral or lentiviral vector-mediated gene transfer.

Download Full-text

Development of Isothermal LAMP Assay for Detection of Intimin Gene (eae) in E. coli Associated with Piglet Diarrhea

Indian Journal of Animal Research ◽

10.18805/ijar.b-4173 ◽

2020 ◽

Author(s):

Sanjeev Kumar ◽

Jagan Mohanarao Gali ◽

T.K. Dutta ◽

P. Roychoudhury ◽

P.K. Subudhi

Keyword(s):

Bacterial Species ◽

False Negative ◽

Acute Diarrhoea ◽

Lamp Assay ◽

Dna Amplification ◽

Analytical Sensitivity ◽

Conventional Pcr ◽

Isolation And Identification ◽

Reliable Technique ◽

E Coli

Background: Diarroeagenic Escherichia coli (DEC) including enteropathogenic E. coli (EPEC) and enterohaemorrhagic E. coli (EHEC) is associated with acute diarrhoea in children and young animals. The virulence is associated with attaching and effacing lesions encoded by eaeA gene is considered as marker for EPEC and EHEC. Laboratory diagnosis of such infections is carried out by traditional bacteriological techniques and by conventional PCR assays. Those techniques often provide false negative result and at the same time are costly as well as difficult to perform in the field level. The loop-mediated isothermal amplification (LAMP) is a new generation DNA amplification assay is developed for detection of eaeA gene in E. coli isolated from diarrhoeic piglets.Methods: Samples were collected from diarrhoeic piglets for isolation and identification of E. coli. eaeA gene was detected by conventional PCR using specific primers in all the isolates. LAMP assay was standardized for detection of eaeA gene. Analytical sensitivity of LAMP was evaluated using 10 fold serially diluted E. coli genomic DNA. The specificity of the LAMP assay was determined by evaluating the cross reactivity with 19 other enteric and non-enteric bacterial species. Standardized LAMP was applied for detection of eaeA gene in the field isolates.Result: A total of 37 (24.67%) isolates were recorded as positive for eaeA gene by conventional PCR, while 49 (32.67%) isolates were recorded as positive for eaeA gene by LAMP assay. The LAMP assay was 10 times more sensitive than conventional PCR. LAMP assay was found to be more sensitive, specific, cost effective, user friendly and reliable technique over conventional PCR, which can be applied for screening of the clinical isolates for confirmation of EPEC and/or EHEC.

Download Full-text