scholarly journals chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

2017 ◽  
Author(s):  
Mickael Silva ◽  
Miguel Machado ◽  
Diogo N. Silva ◽  
Mirko Rossi ◽  
Jacob Moran-Gilad ◽  
...  

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.

2021 ◽  
Vol 7 (11) ◽  
Author(s):  
Isabelle Bernaquez ◽  
Christiane Gaudreau ◽  
Pierre A. Pilon ◽  
Sadjia Bekal

Many public health laboratories across the world have implemented whole-genome sequencing (WGS) for the surveillance and outbreak detection of foodborne pathogens. PulseNet-affiliated laboratories have determined that most single-strain foodborne outbreaks are contained within 0–10 multi-locus sequence typing (MLST)-based allele differences and/or core genome single-nucleotide variants (SNVs). In addition to being a food- and travel-associated outbreak pathogen, most Shigella spp. cases occur through continuous person-to-person transmission, predominantly involving men who have sex with men (MSM), leading to long-term and recurrent outbreaks. Continuous transmission patterns coupled to genetic evolution under antibiotic treatment pressure require an assessment of existing WGS-based subtyping methods and interpretation criteria for cluster inclusion/exclusion. An evaluation of 4 WGS-based subtyping methods [SNVPhyl, coreMLST, core genome MLST (cgMLST) and whole-genome MLST (wgMLST)] was performed on 9 foodborne-, travel- and MSM-related retrospective outbreaks from a collection of 91 Shigella flexneri and 232  Shigella sonnei isolates to determine the methods’ epidemiological concordance, discriminatory power, robustness and ability to generate stable interpretation criteria. The discriminatory powers were ranked as follows: coreMLST<SNVPhyl<cgMLST<wgMLST (range: 0.970–1.000). The genetic differences observed for non-MSM-related Shigella spp. outbreaks respect the standard 0–10 allele/SNV guideline; however, mobile genetic element (MGE)-encoded loci caused inflated genetic variation and discrepant phylogenies for prolonged MSM-related S. sonnei outbreaks via wgMLST. The S. sonnei correlation coefficients of wgMLST were also the lowest at 0.680, 0.703 and 0.712 for SNVPhyl, coreMLST and cgMLST, respectively. Plasmid maintenance, mobilization and conjugation-associated genes were found to be the main source of genetic distance inflation in addition to prophage-related genes. Duplicated alleles arising from the repeated nature of IS elements were also responsible for many false cg/wgMLST differences. The coreMLST approach was shown to be the most robust, followed by SNVPhyl and wgMLST for inter-laboratory comparability. Our results highlight the need for validating species-specific subtyping methods based on microbial genome plasticity and outbreak dynamics in addition to the importance of filtering confounding MGEs for cluster detection.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Kathy E. Raven ◽  
Sophia T. Girgis ◽  
Asha Akram ◽  
Beth Blane ◽  
Danielle Leek ◽  
...  

AbstractWhole-genome sequencing is likely to become increasingly used by local clinical microbiology laboratories, where sequencing volume is low compared with national reference laboratories. Here, we describe a universal protocol for simultaneous DNA extraction and sequencing of numerous different bacterial species, allowing mixed species sequence runs to meet variable laboratory demand. We assembled test panels representing 20 clinically relevant bacterial species. The DNA extraction process used the QIAamp mini DNA kit, to which different combinations of reagents were added. Thereafter, a common protocol was used for library preparation and sequencing. The addition of lysostaphin, lysozyme or buffer ATL (a tissue lysis buffer) alone did not produce sufficient DNA for library preparation across the species tested. By contrast, lysozyme plus lysostaphin produced sufficient DNA across all 20 species. DNA from 15 of 20 species could be extracted from a 24-h culture plate, while the remainder required 48–72 h. The process demonstrated 100% reproducibility. Sequencing of the resulting DNA was used to recapitulate previous findings for species, outbreak detection, antimicrobial resistance gene detection and capsular type. This single protocol for simultaneous processing and sequencing of multiple bacterial species supports low volume and rapid turnaround time by local clinical microbiology laboratories.


2017 ◽  
Author(s):  
Lennard Epping ◽  
Andries J. van Tonder ◽  
Rebecca A. Gladstone ◽  
Stephen D. Bentley ◽  
Andrew J. Page ◽  
...  

ABSTRACTStreptococcus pneumoniae is responsible for 240,000 - 460,000 deaths in children under 5 years of age each year. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines. Recent efforts have been made to infer serotypes directly from genomic data but current software approaches are limited and do not scale well. Here, we introduce a novel method, SeroBA, which uses a hybrid assembly and mapping approach. We compared SeroBA against real and simulated data and present results on the concordance and computational performance against a validation dataset, the robustness and scalability when analysing a large dataset, and the impact of varying the depth of coverage in the cps locus region on sequence-based serotyping. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98% concordance using a k-mer based method, can process 10,000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 10x. SeroBA is implemented in Python3 and is freely available under an open source GPLv3 license from: https://github.com/sanger-pathogens/seroba.DATA SUMMARYThe reference genome Streptococcus pneumoniae ATCC 700669 is available from National Center for Biotechnology Information (NCBI) with the accession number: FM211187Simulated paired end reads for experiment 2 have been deposited in FigShare: https://doi.org/10.6084/m9.figshare.5086054.v1Accession numbers for all other experiments are listed in Supplementary Table S1 and Supplementary Table S2.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThis article describes SeroBA, a A-mer based method for predicting the serotypes of Streptococcus pneumoniae from Whole Genome Sequencing (WGS) data. SeroBA can identify 92 serotypes and 2 subtypes with constant memory usage and low computational costs. We showed that SeroBA is able to reliably predict serotypes at a depth of coverage as low as 10x and is scalable to large datasets.


2014 ◽  
Author(s):  
Todd J. Treangen ◽  
Brian D. Ondov ◽  
Sergey Koren ◽  
Adam M. Phillippy

Though many microbial species or clades now have hundreds of sequenced genomes, existing whole-genome alignment methods do not efficiently handle comparisons on this scale. Here we present the Harvest suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Combined they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.


2021 ◽  
Vol 7 (12) ◽  
Author(s):  
Kyrylo Bessonov ◽  
Chad Laing ◽  
James Robertson ◽  
Irene Yong ◽  
Kim Ziebell ◽  
...  

Escherichia coli is a priority foodborne pathogen of public health concern and phenotypic serotyping provides critical information for surveillance and outbreak detection activities. Public health and food safety laboratories are increasingly adopting whole-genome sequencing (WGS) for characterizing pathogens, but it is imperative to maintain serotype designations in order to minimize disruptions to existing public health workflows. Multiple in silico tools have been developed for predicting serotypes from WGS data, including SRST2, SerotypeFinder and EToKi EBEis, but these tools were not designed with the specific requirements of diagnostic laboratories, which include: speciation, input data flexibility (fasta/fastq), quality control information and easily interpretable results. To address these specific requirements, we developed ECTyper (https://github.com/phac-nml/ecoli_serotyping) for performing both speciation within Escherichia and Shigella , and in silico serotype prediction. We compared the serotype prediction performance of each tool on a newly sequenced panel of 185 isolates with confirmed phenotypic serotype information. We found that all tools were highly concordant, with 92–97 % for O-antigens and 98–100 % for H-antigens, and ECTyper having the highest rate of concordance. We extended the benchmarking to a large panel of 6954 publicly available E. coli genomes to assess the performance of the tools on a more diverse dataset. On the public data, there was a considerable drop in concordance, with 75–91 % for O-antigens and 62–90 % for H-antigens, and ECTyper and SerotypeFinder being the most concordant. This study highlights that in silico predictions show high concordance with phenotypic serotyping results, but there are notable differences in tool performance. ECTyper provides highly accurate and sensitive in silico serotype predictions, in addition to speciation, and is designed to be easily incorporated into bioinformatic workflows.


2020 ◽  
Vol 87 (1) ◽  
Author(s):  
Swarnali Louha ◽  
Richard J. Meinersmann ◽  
Zaid Abdo ◽  
Mark E. Berrang ◽  
Travis C. Glenn

ABSTRACT A reliable and standardized classification of Listeria monocytogenes is important for accurate strain identification during outbreak investigations. Current whole-genome sequencing (WGS)-based approaches for strain characterization are either difficult to standardize, rendering them less suitable for data exchange, or are not freely available. Thus, we developed a portable and open-source tool, Haplo-ST, to improve standardization and provide maximum discriminatory potential to WGS data tied to a multilocus sequence typing (MLST) framework. Haplo-ST performs whole-genome MLST (wgMLST) for L. monocytogenes while allowing for data exchangeability worldwide. This tool takes in (i) raw WGS reads as input, (ii) cleans the raw data according to user-specified parameters, (iii) assembles genes across loci by mapping to genes from reference strains, and (iv) assigns allelic profiles to assembled genes and provides a wgMLST subtyping for each isolate. Data exchangeability relies on the tool assigning allelic profiles based on a centralized nomenclature defined by the widely used BIGSdb-Lm database. Tests of Haplo-ST’s performance with simulated reads from L. monocytogenes reference strains demonstrated high sensitivity (97.5%), and coverage depths of ≥20× were found to be sufficient for wgMLST profiling. We then used Haplo-ST to characterize and differentiate between two groups of L. monocytogenes isolates derived from the natural environment and poultry processing plants. Phylogenetic reconstruction identified lineages within each group, and no lineage specificity was observed with isolate phenotypes (transient versus persistent) or origins. Genetic differentiation analyses between isolate groups identified 21 significantly differentiated loci, potentially enriched for adaptation and persistence of L. monocytogenes within poultry processing plants. IMPORTANCE We have developed an open-source tool (https://github.com/swarnalilouha/Haplo-ST) that provides allele-based subtyping of L. monocytogenes isolates at the whole-genome level. Along with allelic profiles, this tool also generates allele sequences and identifies paralogs, which is useful for phylogenetic tree reconstruction and deciphering relationships between closely related isolates. More broadly, Haplo-ST is flexible and can be adapted to characterize the genome of any haploid organism simply by installing an organism-specific gene database. Haplo-ST also allows for scalable subtyping of isolates; fewer reference genes can be used for low-resolution typing, whereas higher resolution can be achieved by increasing the number of genes used in the analysis. Our tool enabled clustering of L. monocytogenes isolates into lineages and detection of potential loci for adaptation and persistence in food processing environments. Findings from these analyses highlight the effectiveness of Haplo-ST in subtyping and evaluating relationships among isolates in studies of bacterial population genetics.


2018 ◽  
Author(s):  
Alexander M. Wailan ◽  
Francesc Coll ◽  
Eva Heinz ◽  
Gerry Tonkin-Hill ◽  
Jukka Corander ◽  
...  

ABSTRACTThe ability to distinguish between pathogens is a fundamental requirement to understand the epidemiology of infectious diseases. Phylogenetic analysis of genomic data can provide a powerful platform to identify lineages within bacterial populations, and thus inform outbreak investigation and transmission dynamics. However, resolving differences between pathogens associated with low variant (LV) populations carrying low median pairwise single nucleotide variant (SNV) distances, remains a major challenge. Here we present rPinecone, an R package designed to define sub-lineages within closely related LV populations. rPinecone uses a root-to-tip directional approach to define sub-lineages within a phylogenetic tree according to SNV distance from the ancestral node. The utility of this program was demonstrated using genomic data of two LV populations: a hospital outbreak of methicillin-resistant Staphylococcus aureus and endemic Salmonella Typhi from rural Cambodia. rPinecone identified the transmission branches of the hospital outbreak and geographically-confined lineages in Cambodia. Sub-lineages identified by rPinecone in both analyses were phylogenetically robust. It is anticipated that rPinecone can be used to discriminate between lineages of bacteria from LV populations where other methods fail, enabling a deeper understanding of infectious disease epidemiology for public health purposes.DATA SUMMARYSource code for rPinecone is available on GitHub under the open source licence GNU GPL 3; (url: https://github.com/alexwailan/rpinecone).Newick format files for both phylogenetic trees have been deposited in Figshare; (url: https://doi.org/10.6084/m9.figshare.7022558)Geographical analysis of the S. Typhi Dataset using Microreact is available at https://microreact.org/project/r1IqkrN1X.Accession numbers, meta data and sample lineage results of both datasets used in this paper are listed in the supplementary tables.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTWhole genome sequence data from bacterial pathogens is increasingly used in the epidemiological investigation of infectious disease, both in outbreak and endemic situations. However, distinguishing bacterial species which are both very similar and which are likely to come from a small geographical and temporal range presents a major technical challenge for epidemiologists. rPinecone was designed to address this challenge and utilises phylogenetic data to define lineages within bacterial populations that have limited variation. This approach is therefore of great interest to epidemiologists as it adds a further level of clarity above and beyond that which is offered by existing approaches which have not been designed to consider bacterial isolates containing variation that only transiently exist, but which is epidemiologically informative. rPinecone has the flexibility to be applied to multiple pathogens and has direct application for investigations of clinical outbreaks and endemic disease to understand transmission dynamics or geographical hotspots of disease.


2020 ◽  
Author(s):  
Swarnali Louha ◽  
Richard J. Meinersmann ◽  
Zaid Abdo ◽  
Mark E. Berrang ◽  
Travis C. Glenn

ABSTRACTA reliable and standardized classification of Listeria monocytogenes (Lm) is important for accurate strain identification during outbreak investigations. Current whole-genome sequencing (WGS) based approaches for strain characterization either lack standardization, rendering them less suitable for data exchange, or are not freely available. Thus, we developed a portable and open-source tool Haplo-ST to improve standardization and provide maximum discriminatory potential to WGS data tied to an MLST (multi locus sequence typing) framework. Haplo-ST performs whole-genome MLST (wgMLST) for Lm while allowing for data exchangeability worldwide. This tool takes in (i) raw WGS reads as input, (ii) cleans the raw data according to user specified parameters, (iii) assembles genes across loci by mapping to genes from reference strains, (iv) assigns allelic profiles to assembled genes and provides a wgMLST subtyping for each isolate. Data exchangeability relies on the tool assigning allelic profiles based on a centralized nomenclature defined by the widely-used BIGSdb-Lm database. Tests on Haplo-ST’s performance with simulated reads from Lm reference strains yielded a high sensitivity of 97.5%, and coverage depths of ≥ 20× was found to be sufficient for wgMLST profiling. We used Haplo-ST to characterize and differentiate between two groups of Lm isolates, derived from the natural environment and poultry processing plants. Phylogenetic reconstruction showed sharp delineation of lineages within each group and no lineage-specificity was observed with isolate phenotypes (transient vs. persistent) or origins. Genetic differentiation analyses between isolate groups identified 21 significantly differentiated loci, potentially enriched for adaptation and persistence of Lm within poultry processing plants.IMPORTANCEWe have developed an open-source tool that provides allele-based subtyping of Lm isolates at the whole genome level. Along with allelic profiles, this tool also generates allele sequences, and identifies paralogs, which is useful for phylogenetic tree reconstruction and deciphering relationships between closely related isolates. More broadly, Haplo-ST is flexible and can be adapted to characterize the genome of any haploid organism simply by installing an organism-specific gene database. Haplo-ST also allows for scalable subtyping of isolates; fewer reference genes can be used for low resolution typing, whereas higher resolution can be achieved by increasing the number of genes used in the analysis. Our tool enabled clustering of Lm isolates into lineages and detection of potential loci for adaptation and persistence in food processing environments. Findings from these analyses highlights the effectiveness of Haplo-ST in subtyping and evaluating relationships among isolates for routine surveillance, outbreak investigations and source tracking.


2019 ◽  
Author(s):  
Thomas Sakoparnig ◽  
Chris Field ◽  
Erik van Nimwegen

AbstractAlthough homologous recombination is accepted to be common in bacteria, so far it has been challenging to accurately quantify its impact on genome evolution within bacterial species. We here introduce methods that use the statistics of single-nucleotide polymorphism (SNP) splits in the core genome alignment of a set of strains to show that, for many bacterial species, recombination dominates genome evolution. Each genomic locus has been overwritten so many times by recombination that it is impossible to reconstruct the clonal phylogeny and, instead of a consensus phylogeny, the phylogeny typically changes many thousands of times along the core genome alignment.We also show how SNP splits can be used to quantify the relative rates with which different subsets of strains have recombined in the past. We find that virtually every strain has a unique pattern of frequencies with which its lineages have recombined with those of other strains, and that the relative rates with which different subsets of strains share SNPs follow long-tailed distributions. Our findings show that bacterial populations are neither clonal nor freely recombining, but structured such that recombination rates between different lineages vary along a continuum spanning several orders of magnitude, with a unique pattern of rates for each lineage. Thus, rather than reflecting clonal ancestry, whole genome phylogenies reflect these long-tailed distributions of recombination rates.


2019 ◽  
Vol 69 (4) ◽  
pp. 998-1000 ◽  
Author(s):  
Wenjing Wu ◽  
Zhiyong Zong

The aim of this study was to further clarify the taxonomic relationship between the two recently described bacterial species, Lelliottia jeotgali sp. nov. and Lelliottia aquatilis sp. nov. Whole genome sequences of types strains of the two species are available for analysis. Average nucleotide identity (ANI) and in silico DNA–DNA hybridization (isDDH) values between the two type strains were determined. The ANI and isDDH values between type strains of the two species are 98.7 and 91.0 %, respectively, which are higher than cut-offs to define a bacterial species. It is therefore clear that the two species actually belong to the same species. The name of L.aquatilis was published at an earlier date than that of L. aquatilis . We therefore propose that L. aquatilis is a later heterotypic synonym of L. jeotgali .


Sign in / Sign up

Export Citation Format

Share Document