scholarly journals Bayesian inference of infectious disease transmission from whole genome sequence data

2013 ◽  
Author(s):  
Xavier Didelot ◽  
Jennifer Gardy ◽  
Caroline Colijn

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered -- how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

2013 ◽  
Vol 368 (1614) ◽  
pp. 20120202 ◽  
Author(s):  
Nicholas J. Croucher ◽  
Simon R. Harris ◽  
Yonatan H. Grad ◽  
William P. Hanage

Sequence data are well established in the reconstruction of the phylogenetic and demographic scenarios that have given rise to outbreaks of viral pathogens. The application of similar methods to bacteria has been hindered in the main by the lack of high-resolution nucleotide sequence data from quality samples. Developing and already available genomic methods have greatly increased the amount of data that can be used to characterize an isolate and its relationship to others. However, differences in sequencing platforms and data analysis mean that these enhanced data come with a cost in terms of portability: results from one laboratory may not be directly comparable with those from another. Moreover, genomic data for many bacteria bear the mark of a history including extensive recombination, which has the potential to greatly confound phylogenetic and coalescent analyses. Here, we discuss the exacting requirements of genomic epidemiology, and means by which the distorting signal of recombination can be minimized to permit the leverage of growing datasets of genomic data from bacterial pathogens.


2018 ◽  
Author(s):  
Jesse Eaton ◽  
Jingyi Wang ◽  
Russell Schwartz

AbstractPhylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic data sets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically. We present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal subpopulations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas (TCGA) shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors. All code can be found at https://github.com/jaebird123/tusv


PLoS Genetics ◽  
2021 ◽  
Vol 17 (12) ◽  
pp. e1009944
Author(s):  
Torsten Pook ◽  
Adnane Nemri ◽  
Eric Gerardo Gonzalez Segovia ◽  
Daniel Valle Torres ◽  
Henner Simianer ◽  
...  

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline (“HBimpute”) that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.


2014 ◽  
Author(s):  
Caroline Colijn ◽  
Jennifer Gardy

AbstractWhole genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterised infectious periods, epidemiological and clinical meta-data which may not always be available, and typically require computationally intensive analysis focussing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the overall transmission patterns underyling an outbreak. Here we use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks. We find that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe five topological features that summarize a phylogeny’s structure and find that computational classifiers based on these are capable of predicting an outbreak’s transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks. We conclude that there are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission, and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.


2017 ◽  
Author(s):  
Phelim Bradley ◽  
Henk C Den Bakker ◽  
Eduardo P. C. Rocha ◽  
Gil McVean ◽  
Zamin Iqbal

AbstractGenome sequencing of pathogens is now ubiquitous in microbiology, and the sequence archives are effectively no longer searchable for arbitrary sequences. Furthermore, the exponential increase of these archives is likely to be further spurred by automated diagnostics. To unlock their use for scientific research and real-time surveillance we have combined knowledge about bacterial genetic variation with ideas used in web-search, to build a DNA search engine for microbial data that can grow incrementally. We indexed the complete global corpus of bacterial and viral whole genome sequence data (447,833 genomes), using four orders of magnitude less storage than previous methods. The method allows future scaling to millions of genomes. This renders the global archive accessible to sequence search, which we demonstrate with three applications: ultra-fast search for resistance genes MCR1-3, analysis of host-range for 2827 plasmids, and quantification of the rise of antibiotic resistance prevalence in the sequence archives.


Author(s):  
Marjolein E.M. Toorians ◽  
Ailene MacPherson ◽  
T. Jonathan Davies

With the decrease of biodiversity worldwide coinciding with an increase in disease outbreaks, investigating this link is more important then ever before. This review outlines the different modelling methods commonly used for pathogen transmission in animal host systems. There are a multitude of ways a pathogen can invade and spread through a host population. The assumptions of the transmission model used to capture disease propagation determines the outbreak potential, the net reproductive success (R0). This review offers an insight into the assumptions and motivation behind common transmission mechanisms and introduces a general framework with which contact rates, the most important parameter in disease dynamics, determines the transmission method. By using a general function introduced here and this general transmission model framework, we provide a guide for future disease ecologists for how to pick the contact function that best suites their system. Additionally, this manuscript attempts to bridge the gap between mathematical disease modelling and the controversially and heavily debated disease-diversity relationship, by expanding the summarized models to multiple hosts systems and explaining the role of host diversity in disease transmission. By outlining the mechanisms of transmission into a stepwise process, this review will serve as a guide to model pathogens in multi-host systems. We will further describe these models it in the greater context of host diversity and its effect on disease outbreaks, by introducing a novel method to include host species’ evolutionary history into the framework.


2017 ◽  
Vol 22 (45) ◽  
Author(s):  
Markus Petzold ◽  
Karola Prior ◽  
Jacob Moran-Gilad ◽  
Dag Harmsen ◽  
Christian Lück

Introduction Whole genome sequencing (WGS) is increasingly used in Legionnaires’ disease (LD) outbreak investigations, owing to its higher resolution than sequence-based typing, the gold standard typing method for Legionella pneumophila, in the analysis of endemic strains. Recently, a gene-by-gene typing approach based on 1,521 core genes called core genome multilocus sequence typing (cgMLST) was described that enables a robust and standardised typing of L. pneumophila. Methods: We applied this cgMLST scheme to isolates obtained during the largest outbreak of LD reported so far in Germany. In this outbreak, the epidemic clone ST345 had been isolated from patients and four different environmental sources. In total 42 clinical and environmental isolates were retrospectively typed. Results: Epidemiologically unrelated ST345 isolates were clearly distinguishable from the epidemic clone. Remarkably, epidemic isolates split up into two distinct clusters, ST345-A and ST345-B, each respectively containing a mix of clinical and epidemiologically-related environmental samples. Discussion/conclusion: The outbreak was therefore likely caused by both variants of the single sequence type, which pre-existed in the environmental reservoirs. The two clusters differed by 40 alleles located in two neighbouring genomic regions of ca 42 and 26 kb. Additional analysis supported horizontal gene transfer of the two regions as responsible for the difference between the variants. Both regions comprise virulence genes and have previously been reported to be involved in recombination events. This corroborates the notion that genomic outbreak investigations should always take epidemiological information into consideration when making inferences. Overall, cgMLST proved helpful in disentangling the complex genomic epidemiology of the outbreak.


2022 ◽  
Author(s):  
Benjamin Sobkowiak ◽  
Kamila Romanowski ◽  
Inna Sekirov ◽  
Jennifer L Gardy ◽  
James Johnston

Pathogen genomic epidemiology is now routinely used worldwide to interrogate infectious disease dynamics. Multiple computational tools that reconstruct transmission networks by coupling genomic data with epidemiological modelling have been developed. The resulting inferences are often used to inform outbreak investigations, yet to date, the performance of these transmission reconstruction tools has not been compared specifically for tuberculosis, a disease process with complex epidemiology that includes variable latency periods and within-host heterogeneity. Here, we carried out a systematic comparison of seven publicly available transmission reconstruction tools, evaluating their accuracy in predicting transmission events in both simulated and real-world Mycobacterium tuberculosis outbreaks. No tool was able to fully resolve transmission networks, though both the single-tree and multi-tree input implementations of TransPhylo identified the most epidemiologically supported transmission events and the fewest false positive links. We observed a high degree of variability in the transmission networks inferred by each approach. Our findings may inform the choice of tools in future tuberculosis transmission analyses and underscore the need for caution when interpreting transmission networks produced using probabilistic approaches.


Sign in / Sign up

Export Citation Format

Share Document