MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates

PeerJ ◽

10.7717/peerj.5895 ◽

2018 ◽

Vol 6 ◽

pp. e5895 ◽

Cited By ~ 35

Author(s):

Thomas Andreas Kohl ◽

Christian Utpatel ◽

Viola Schleusener ◽

Maria Rosaria De Filippo ◽

Patrick Beckert ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Phylogenomic Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.

Download Full-text

Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data

10.1101/607127 ◽

2019 ◽

Author(s):

Allison L. Hicks ◽

Nicole Wheeler ◽

Leonor Sánchez-Busó ◽

Jennifer L. Rakeman ◽

Simon R. Harris ◽

...

Keyword(s):

Machine Learning ◽

Antibiotic Resistance ◽

Antibiotic Susceptibility ◽

Sequence Data ◽

Model Performance ◽

Outcome Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

AbstractPrediction of antibiotic resistance phenotypes from whole genome sequencing data by machine learning methods has been proposed as a promising platform for the development of sequence-based diagnostics. However, there has been no systematic evaluation of factors that may influence performance of such models, how they might apply to and vary across clinical populations, and what the implications might be in the clinical setting. Here, we performed a meta-analysis of seven large Neisseria gonorrhoeae datasets, as well as Klebsiella pneumoniae and Acinetobacter baumannii datasets, with whole genome sequence data and antibiotic susceptibility phenotypes using set covering machine classification, random forest classification, and random forest regression models to predict resistance phenotypes from genotype. We demonstrate how model performance varies by drug, dataset, resistance metric, and species, reflecting the complexities of generating clinically relevant conclusions from machine learning-derived models. Our findings underscore the importance of incorporating relevant biological and epidemiological knowledge into model design and assessment and suggest that doing so can inform tailored modeling for individual drugs, pathogens, and clinical populations. We further suggest that continued comprehensive sampling and incorporation of up-to-date whole genome sequence data, resistance phenotypes, and treatment outcome data into model training will be crucial to the clinical utility and sustainability of machine learning-based molecular diagnostics.Author SummaryMachine learning-based prediction of antibiotic resistance from bacterial genome sequences represents a promising tool to rapidly determine the antibiotic susceptibility profile of clinical isolates and reduce the morbidity and mortality resulting from inappropriate and ineffective treatment. However, while there has been much focus on demonstrating the diagnostic potential of these modeling approaches, there has been little assessment of potential caveats and prerequisites associated with implementing predictive models of drug resistance in the clinical setting. Our results highlight significant biological and technical challenges facing the application of machine learning-based prediction of antibiotic resistance as a diagnostic tool. By outlining specific factors affecting model performance, our findings provide a framework for future work on modeling drug resistance and underscore the necessity of continued comprehensive sampling and reporting of treatment outcome data for building reliable and sustainable diagnostics.

Download Full-text

Whole genome sequence data of Mycobacterium tuberculosis XDR strain, isolated from patient in Kazakhstan

Data in Brief ◽

10.1016/j.dib.2020.106416 ◽

2020 ◽

Vol 33 ◽

pp. 106416

Author(s):

Asset Daniyarov ◽

Askhat Molkenov ◽

Saule Rakhimova ◽

Ainur Akhmetova ◽

Zhannur Nurkina ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Download Full-text

Integrating standardized whole genome sequence analysis with a global Mycobacterium tuberculosis antibiotic resistance knowledgebase

Scientific Reports ◽

10.1038/s41598-018-33731-1 ◽

2018 ◽

Vol 8 (1) ◽

Cited By ~ 26

Author(s):

Matthew Ezewudo ◽

Amanda Borens ◽

Álvaro Chiner-Oms ◽

Paolo Miotto ◽

Leonid Chindelevitch ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Sequence Analysis ◽

Genome Sequence ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Analysis

Download Full-text

KoVariome: Korean National Standard Reference Variome database of whole genomes with comprehensive SNV, indel, CNV, and SV analyses

10.1101/187096 ◽

2017 ◽

Author(s):

Jungeun Kim ◽

Jessica A. Weber ◽

Sungwoong Jho ◽

Jinho Jang ◽

JeHoon Jun ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genetic Variations ◽

Korean Population ◽

National Standard ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Personal Genome ◽

Whole Genome ◽

Sequencing Data

AbstractHigh-coverage whole-genome sequencing data of a single ethnicity can provide a useful catalogue of population-specific genetic variations. Herein, we report a comprehensive analysis of the Korean population, and present the Korean National Standard Reference Variome (KoVariome). As a part of the Korean Personal Genome Project (KPGP), we constructed the KoVariome database using 5.5 terabases of whole genome sequence data from 50 healthy Korean individuals with an average coverage depth of 31×. In total, KoVariome includes 12.7M single-nucleotide variants (SNVs), 1.7M short insertions and deletions (indels), 4K structural variations (SVs), and 3.6K copy number variations (CNVs). Among them, 2.4M (19%) SNVs and 0.4M (24%) indels were identified as novel. We also discovered selective enrichment of 3.8M SNVs and 0.5M indels in Korean individuals, which were used to filter out 1,271 coding-SNVs not originally removed from the 1,000 Genomes Project data when prioritizing disease-causing variants. CNV analyses revealed gene losses related to bone mineral densities and duplicated genes involved in brain development and fat reduction. Finally, KoVariome health records were used to identify novel disease-causing variants in the Korean population, demonstrating the value of high-quality ethnic variation databases for the accurate interpretation of individual genomes and the precise characterization of genetic variations.

Download Full-text

The whole genome sequence data analyses of a Mycobacterium tuberculosis strain SBH321 isolated in Sabah, Malaysia, belongs to Ural family of Lineage 4

Data in Brief ◽

10.1016/j.dib.2020.106388 ◽

2020 ◽

Vol 33 ◽

pp. 106388

Author(s):

Jaeyres Jani ◽

Zainal Arifin Mustapha ◽

Chin Kai Ling ◽

Amabel Seow Ming Hui ◽

Roddy Teo ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Tuberculosis Strain ◽

Genome Sequence Data ◽

Data Analyses ◽

Mycobacterium Tuberculosis Strain

Download Full-text

Accurate Phasing of Pedigree Genotypes Using Whole Genome Sequence Data

10.1101/148510 ◽

2017 ◽

Author(s):

A.N. Blackburn ◽

M.Z. Kos ◽

N.B. Blackburn ◽

J.M. Peralta ◽

P. Stevens ◽

...

Keyword(s):

Error Rate ◽

Sequence Data ◽

Software Implementation ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Genotype Data ◽

Whole Genome ◽

Genotyping Error ◽

Sequencing Data ◽

Missing Genotypes

AbstractPhasing, the process of predicting haplotypes from genotype data, is an important undertaking in genetics and an ongoing area of research. Phasing methods, and associated software, designed specifically for pedigrees are urgently needed. Here we present a new method for phasing genotypes from whole genome sequencing data in pedigrees: PULSAR (Phasing Using Lineage Specific Alleles / Rare variants). The method is built upon the idea that alleles that are specific to a single founding chromosome within a pedigree, which we refer to as lineage-specific alleles, are highly informative for identifying haplotypes that are identical-by-decent between individuals within a pedigree. Through extensive simulation we assess the performance of PULSAR in a variety of pedigree sizes and structures, and we explore the effects of genotyping errors and presence of non-sequenced individuals on its performance. If the genotyping error rate is sufficiently low PULSAR can phase > 99.9% of heterozygous genotypes with a switch error rate below 1 x 10-4 in pedigrees where all individuals are sequenced. We demonstrate that the method is highly accurate and consistently outperforms the long-range phasing approach used for comparison in our benchmarking. The method also holds promise for fixing genotype errors or imputing missing genotypes. The software implementation of this method is freely available.

Download Full-text

Identification of bacterial antibiotic resistance genes in next-generation sequencing data (review of literature)

Russian Clinical Laboratory Diagnostics ◽

10.51620/0869-2084-2021-66-11-684-688 ◽

2021 ◽

Vol 66 (11) ◽

pp. 684-688

Author(s):

A. V. Chaplin ◽

M. Korzhanova ◽

D. O. Korostin

Keyword(s):

Antibiotic Resistance ◽

Genome Sequence ◽

Protein Sequence ◽

Susceptibility Testing ◽

Sequence Data ◽

Genetic Resistance ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data ◽

Resistance Determinants

The spread of antibiotic-resistant human bacterial pathogens is a serious threat to modern medicine. Antibiotic susceptibility testing is essential for treatment regimens optimization and preventing dissemination of antibiotic resistance. Therefore, development of antibiotic susceptibility testing methods is a priority challenge of laboratory medicine. The aim of this review is to analyze the capabilities of the bioinformatics tools for bacterial whole genome sequence data processing. The PubMed database, Russian scientific electronic library eLIBRARY, information networks of World health organization and European Society of Clinical Microbiology and Infectious Diseases (ESCMID) were used during the analysis. In this review, the platforms for whole genome sequencing, which are suitable for detection of bacterial genetic resistance determinants, are described. The classic step of genetic resistance determinants searching is an alignment between the query nucleotide/protein sequence and the subject (database) nucleotide/protein sequence, which is performed using the nucleotide and protein sequence databases. The most commonly used databases are Resfinder, CARD, Bacterial Antimicrobial Resistance Reference Gene Database. The results of the resistance determinants searching in genome assemblies is more correct in comparison to results of the searching in contigs. The new resistance genes searching bioinformatics tools, such as neural networks and machine learning, are discussed in the review. After critical appraisal of the current antibiotic resistance databases we designed a protocol for predicting antibiotic resistance using whole genome sequence data. The designed protocol can be used as a basis of the algorithm for qualitative and quantitative antimicrobial susceptibility testing based on whole genome sequence data.

Download Full-text

Author Correction: Integrating standardized whole genome sequence analysis with a global Mycobacterium tuberculosis antibiotic resistance knowledgebase

Scientific Reports ◽

10.1038/s41598-020-58955-y ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Matthew Ezewudo ◽

Amanda Borens ◽

Álvaro Chiner-Oms ◽

Paolo Miotto ◽

Leonid Chindelevitch ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Sequence Analysis ◽

Genome Sequence ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Analysis

Download Full-text

Identifying mixed Mycobacterium tuberculosis infections from whole genome sequence data

BMC Genomics ◽

10.1186/s12864-018-4988-z ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 28

Author(s):

Benjamin Sobkowiak ◽

Judith R. Glynn ◽

Rein M. G. J. Houben ◽

Kim Mallard ◽

Jody E. Phelan ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Download Full-text

Genome-scale profiling reveals noncoding loci carry higher proportions of concordant data

Molecular Biology and Evolution ◽

10.1093/molbev/msab026 ◽

2021 ◽

Author(s):

Robert Literman ◽

Rachel Schwartz

Keyword(s):

Sequence Data ◽

Phylogenetic Signal ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Coding Sequences ◽

Evolutionary Forces ◽

Tree Inference ◽

Intergenic Regions

Abstract Many evolutionary relationships remain controversial despite whole-genome sequencing data. These controversies arise in part due to challenges associated with accurately modeling the complex phylogenetic signal coming from genomic regions experiencing distinct evolutionary forces. Here we examine how different regions of the genome support or contradict well-established hypotheses among three mammal groups using millions of orthologous parsimony-informative biallelic sites [PIBS] distributed across primate, rodent, and Pecora genomes. We compared PIBS concordance percentages among locus types (e.g. coding sequences, introns, intergenic regions), and contrasted PIBS utility over evolutionary timescales. Sites derived from noncoding sequences provided more data and proportionally more concordant sites compared with those from coding sequences [CDS] in all clades. CDS PIBS were also predominant drivers of tree incongruence in two cases of topological conflict. PIBS derived from most locus types provided surprisingly consistent support for splitting events spread across the timescales we examined, although we find evidence that CDS and intronic PIBS may, respectively and to a limited degree, inform disproportionately about older and younger splits. In this era of accessible whole genome sequence data, these results (1) suggest benefits to more intentionally focusing on noncoding loci as robust data for tree inference, and (2) reinforce the importance of accurate modeling, especially when using CDS data.

Download Full-text