Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data

Mapping Intimacies ◽

10.1101/607127 ◽

2019 ◽

Author(s):

Allison L. Hicks ◽

Nicole Wheeler ◽

Leonor Sánchez-Busó ◽

Jennifer L. Rakeman ◽

Simon R. Harris ◽

...

Keyword(s):

Machine Learning ◽

Antibiotic Resistance ◽

Antibiotic Susceptibility ◽

Sequence Data ◽

Model Performance ◽

Outcome Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

AbstractPrediction of antibiotic resistance phenotypes from whole genome sequencing data by machine learning methods has been proposed as a promising platform for the development of sequence-based diagnostics. However, there has been no systematic evaluation of factors that may influence performance of such models, how they might apply to and vary across clinical populations, and what the implications might be in the clinical setting. Here, we performed a meta-analysis of seven large Neisseria gonorrhoeae datasets, as well as Klebsiella pneumoniae and Acinetobacter baumannii datasets, with whole genome sequence data and antibiotic susceptibility phenotypes using set covering machine classification, random forest classification, and random forest regression models to predict resistance phenotypes from genotype. We demonstrate how model performance varies by drug, dataset, resistance metric, and species, reflecting the complexities of generating clinically relevant conclusions from machine learning-derived models. Our findings underscore the importance of incorporating relevant biological and epidemiological knowledge into model design and assessment and suggest that doing so can inform tailored modeling for individual drugs, pathogens, and clinical populations. We further suggest that continued comprehensive sampling and incorporation of up-to-date whole genome sequence data, resistance phenotypes, and treatment outcome data into model training will be crucial to the clinical utility and sustainability of machine learning-based molecular diagnostics.Author SummaryMachine learning-based prediction of antibiotic resistance from bacterial genome sequences represents a promising tool to rapidly determine the antibiotic susceptibility profile of clinical isolates and reduce the morbidity and mortality resulting from inappropriate and ineffective treatment. However, while there has been much focus on demonstrating the diagnostic potential of these modeling approaches, there has been little assessment of potential caveats and prerequisites associated with implementing predictive models of drug resistance in the clinical setting. Our results highlight significant biological and technical challenges facing the application of machine learning-based prediction of antibiotic resistance as a diagnostic tool. By outlining specific factors affecting model performance, our findings provide a framework for future work on modeling drug resistance and underscore the necessity of continued comprehensive sampling and reporting of treatment outcome data for building reliable and sustainable diagnostics.

Download Full-text

MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates

PeerJ ◽

10.7717/peerj.5895 ◽

2018 ◽

Vol 6 ◽

pp. e5895 ◽

Cited By ~ 35

Author(s):

Thomas Andreas Kohl ◽

Christian Utpatel ◽

Viola Schleusener ◽

Maria Rosaria De Filippo ◽

Patrick Beckert ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Phylogenomic Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.

Download Full-text

Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007349 ◽

2019 ◽

Vol 15 (9) ◽

pp. e1007349 ◽

Cited By ~ 20

Author(s):

Allison L. Hicks ◽

Nicole Wheeler ◽

Leonor Sánchez-Busó ◽

Jennifer L. Rakeman ◽

Simon R. Harris ◽

...

Keyword(s):

Machine Learning ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Antibiotic Susceptibility ◽

Susceptibility Testing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Antibiotic Susceptibility Testing ◽

Sequencing Data ◽

Evaluation Of Parameters

Download Full-text

KoVariome: Korean National Standard Reference Variome database of whole genomes with comprehensive SNV, indel, CNV, and SV analyses

10.1101/187096 ◽

2017 ◽

Author(s):

Jungeun Kim ◽

Jessica A. Weber ◽

Sungwoong Jho ◽

Jinho Jang ◽

JeHoon Jun ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genetic Variations ◽

Korean Population ◽

National Standard ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Personal Genome ◽

Whole Genome ◽

Sequencing Data

AbstractHigh-coverage whole-genome sequencing data of a single ethnicity can provide a useful catalogue of population-specific genetic variations. Herein, we report a comprehensive analysis of the Korean population, and present the Korean National Standard Reference Variome (KoVariome). As a part of the Korean Personal Genome Project (KPGP), we constructed the KoVariome database using 5.5 terabases of whole genome sequence data from 50 healthy Korean individuals with an average coverage depth of 31×. In total, KoVariome includes 12.7M single-nucleotide variants (SNVs), 1.7M short insertions and deletions (indels), 4K structural variations (SVs), and 3.6K copy number variations (CNVs). Among them, 2.4M (19%) SNVs and 0.4M (24%) indels were identified as novel. We also discovered selective enrichment of 3.8M SNVs and 0.5M indels in Korean individuals, which were used to filter out 1,271 coding-SNVs not originally removed from the 1,000 Genomes Project data when prioritizing disease-causing variants. CNV analyses revealed gene losses related to bone mineral densities and duplicated genes involved in brain development and fat reduction. Finally, KoVariome health records were used to identify novel disease-causing variants in the Korean population, demonstrating the value of high-quality ethnic variation databases for the accurate interpretation of individual genomes and the precise characterization of genetic variations.

Download Full-text

Learning From Limited Data: Towards Best Practice Techniques for Antimicrobial Resistance Prediction From Whole Genome Sequencing Data

Frontiers in Cellular and Infection Microbiology ◽

10.3389/fcimb.2021.610348 ◽

2021 ◽

Vol 11 ◽

Author(s):

Lukas Lüftinger ◽

Peter Májek ◽

Stephan Beisken ◽

Thomas Rattei ◽

Andreas E. Posch

Keyword(s):

Machine Learning ◽

Antimicrobial Resistance ◽

Best Practice ◽

Cross Validation ◽

Model Performance ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Limited Data ◽

Sequencing Data ◽

Resistance Prediction

Antimicrobial resistance prediction from whole genome sequencing data (WGS) is an emerging application of machine learning, promising to improve antimicrobial resistance surveillance and outbreak monitoring. Despite significant reductions in sequencing cost, the availability and sampling diversity of WGS data with matched antimicrobial susceptibility testing (AST) profiles required for training of WGS-AST prediction models remains limited. Best practice machine learning techniques are required to ensure trained models generalize to independent data for optimal predictive performance. Limited data restricts the choice of machine learning training and evaluation methods and can result in overestimation of model performance. We demonstrate that the widely used random k-fold cross-validation method is ill-suited for application to small bacterial genomics datasets and offer an alternative cross-validation method based on genomic distance. We benchmarked three machine learning architectures previously applied to the WGS-AST problem on a set of 8,704 genome assemblies from five clinically relevant pathogens across 77 species-compound combinations collated from public databases. We show that individual models can be effectively ensembled to improve model performance. By combining models via stacked generalization with cross-validation, a model ensembling technique suitable for small datasets, we improved average sensitivity and specificity of individual models by 1.77% and 3.20%, respectively. Furthermore, stacked models exhibited improved robustness and were thus less prone to outlier performance drops than individual component models. In this study, we highlight best practice techniques for antimicrobial resistance prediction from WGS data and introduce the combination of genome distance aware cross-validation and stacked generalization for robust and accurate WGS-AST.

Download Full-text

Accurate Phasing of Pedigree Genotypes Using Whole Genome Sequence Data

10.1101/148510 ◽

2017 ◽

Author(s):

A.N. Blackburn ◽

M.Z. Kos ◽

N.B. Blackburn ◽

J.M. Peralta ◽

P. Stevens ◽

...

Keyword(s):

Error Rate ◽

Sequence Data ◽

Software Implementation ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Genotype Data ◽

Whole Genome ◽

Genotyping Error ◽

Sequencing Data ◽

Missing Genotypes

AbstractPhasing, the process of predicting haplotypes from genotype data, is an important undertaking in genetics and an ongoing area of research. Phasing methods, and associated software, designed specifically for pedigrees are urgently needed. Here we present a new method for phasing genotypes from whole genome sequencing data in pedigrees: PULSAR (Phasing Using Lineage Specific Alleles / Rare variants). The method is built upon the idea that alleles that are specific to a single founding chromosome within a pedigree, which we refer to as lineage-specific alleles, are highly informative for identifying haplotypes that are identical-by-decent between individuals within a pedigree. Through extensive simulation we assess the performance of PULSAR in a variety of pedigree sizes and structures, and we explore the effects of genotyping errors and presence of non-sequenced individuals on its performance. If the genotyping error rate is sufficiently low PULSAR can phase > 99.9% of heterozygous genotypes with a switch error rate below 1 x 10-4 in pedigrees where all individuals are sequenced. We demonstrate that the method is highly accurate and consistently outperforms the long-range phasing approach used for comparison in our benchmarking. The method also holds promise for fixing genotype errors or imputing missing genotypes. The software implementation of this method is freely available.

Download Full-text

Genome-scale profiling reveals noncoding loci carry higher proportions of concordant data

Molecular Biology and Evolution ◽

10.1093/molbev/msab026 ◽

2021 ◽

Author(s):

Robert Literman ◽

Rachel Schwartz

Keyword(s):

Sequence Data ◽

Phylogenetic Signal ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Coding Sequences ◽

Evolutionary Forces ◽

Tree Inference ◽

Intergenic Regions

Abstract Many evolutionary relationships remain controversial despite whole-genome sequencing data. These controversies arise in part due to challenges associated with accurately modeling the complex phylogenetic signal coming from genomic regions experiencing distinct evolutionary forces. Here we examine how different regions of the genome support or contradict well-established hypotheses among three mammal groups using millions of orthologous parsimony-informative biallelic sites [PIBS] distributed across primate, rodent, and Pecora genomes. We compared PIBS concordance percentages among locus types (e.g. coding sequences, introns, intergenic regions), and contrasted PIBS utility over evolutionary timescales. Sites derived from noncoding sequences provided more data and proportionally more concordant sites compared with those from coding sequences [CDS] in all clades. CDS PIBS were also predominant drivers of tree incongruence in two cases of topological conflict. PIBS derived from most locus types provided surprisingly consistent support for splitting events spread across the timescales we examined, although we find evidence that CDS and intronic PIBS may, respectively and to a limited degree, inform disproportionately about older and younger splits. In this era of accessible whole genome sequence data, these results (1) suggest benefits to more intentionally focusing on noncoding loci as robust data for tree inference, and (2) reinforce the importance of accurate modeling, especially when using CDS data.

Download Full-text

Ethnically diverse urban transmission networks of Neisseria gonorrhoeae without evidence of HIV serosorting

Sexually Transmitted Infections ◽

10.1136/sextrans-2019-054025 ◽

2019 ◽

Vol 96 (2) ◽

pp. 106-109

Author(s):

Jayshree Dave ◽

John Paul ◽

Thomas Joshua Pasvol ◽

Andy Williams ◽

Fiona Warburton ◽

...

Keyword(s):

Neisseria Gonorrhoeae ◽

Ethnic Groups ◽

Antimicrobial Susceptibility ◽

Sequence Data ◽

Small Sample ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequencing Data ◽

Transmission Networks ◽

Hiv Serosorting

ObjectiveWe aimed to characterise gonorrhoea transmission patterns in a diverse urban population by linking genomic, epidemiological and antimicrobial susceptibility data.MethodsNeisseria gonorrhoeae isolates from patients attending sexual health clinics at Barts Health NHS Trust, London, UK, during an 11-month period underwent whole-genome sequencing and antimicrobial susceptibility testing. We combined laboratory and patient data to investigate the transmission network structure.ResultsOne hundred and fifty-eight isolates from 158 patients were available with associated descriptive data. One hundred and twenty-nine (82%) patients identified as male and 25 (16%) as female; four (3%) records lacked gender information. Self-described ethnicities were: 51 (32%) English/Welsh/Scottish; 33 (21%) white, other; 23 (15%) black British/black African/black, other; 12 (8%) Caribbean; 9 (6%) South Asian; 6 (4%) mixed ethnicity; and 10 (6%) other; data were missing for 14 (9%). Self-reported sexual orientations were 82 (52%) men who have sex with men (MSM); 49 (31%) heterosexual; 2 (1%) bisexual; data were missing for 25 individuals. Twenty-two (14%) patients were HIV positive. Whole-genome sequence data were generated for 151 isolates, which linked 75 (50%) patients to at least one other case. Using sequencing data, we found no evidence of transmission networks related to specific ethnic groups (p=0.64) or of HIV serosorting (p=0.35). Of 82 MSM/bisexual patients with sequencing data, 45 (55%) belonged to clusters of ≥2 cases, compared with 16/44 (36%) heterosexuals with sequencing data (p=0.06).ConclusionWe demonstrate links between 50% of patients in transmission networks using a relatively small sample in a large cosmopolitan city. We found no evidence of HIV serosorting. Our results do not support assortative selectivity as an explanation for differences in gonorrhoea incidence between ethnic groups.

Download Full-text

Integrating Culture-based Antibiotic Resistance Profiles with Whole-genome Sequencing Data for 11,087 Clinical Isolates

Genomics Proteomics & Bioinformatics ◽

10.1016/j.gpb.2018.11.002 ◽

2019 ◽

Vol 17 (2) ◽

pp. 169-182 ◽

Cited By ~ 2

Author(s):

Valentina Galata ◽

Cédric C. Laczny ◽

Christina Backes ◽

Georg Hemmrich-Stanisak ◽

Susanne Schmolke ◽

...

Keyword(s):

Antibiotic Resistance ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Clinical Isolates ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

TETyper: a bioinformatic pipeline for classifying variation and genetic contexts of transposable elements from short-read whole-genome sequencing data

10.1101/288001 ◽

2018 ◽

Author(s):

Anna E Sheppard ◽

Nicole Stoesser ◽

Ian German-Mesner ◽

Kasi Vegesana ◽

A Sarah Walker ◽

...

Keyword(s):

Antibiotic Resistance ◽

Transposable Elements ◽

Genome Sequencing ◽

Resistance Genes ◽

Whole Genome Sequencing Data ◽

Sequence Variants ◽

Whole Genome ◽

Sequencing Data ◽

Bioinformatic Pipeline ◽

Short Read

ABSTRACTMuch of the worldwide dissemination of antibiotic resistance has been driven by resistance gene associations with mobile genetic elements (MGEs), such as plasmids and transposons. Although increasing, our understanding of resistance spread remains relatively limited, as methods for tracking mobile resistance genes through multiple species, strains and plasmids are lacking. We have developed a bioinformatic pipeline for tracking variation within, and mobility of, specific transposable elements (TEs), such as transposons carrying antibiotic resistance genes. TETyper takes short-read whole-genome sequencing data as input and identifies single-nucleotide mutations and deletions within the TE of interest, to enable tracking of specific sequence variants, as well as the surrounding genetic context(s), to enable identification of transposition events. To investigate global dissemination of Klebsiella pneumoniae carbapenemase (KPC) and its associated transposon Tn4401, we applied TETyper to a collection of >3000 publicly available Illumina datasets containing blaKPC. This revealed surprising diversity, with >200 distinct flanking genetic contexts for Tn4401, indicating high levels of transposition. Integration of sample metadata revealed insights into associations between geographic locations, host species, Tn4401 sequence variants and flanking genetic contexts. To demonstrate the ability of TETyper to cope with high copy number TEs and to track specific short-term evolutionary changes, we also applied it to the insertion sequence IS26 within a defined K. pneumoniae outbreak. TETyper is implemented in python and is freely available at https://github.com/aesheppard/TETyper.

Download Full-text