M. tuberculosis microvariation is common and is associated with transmission: analysis of three years prospective universal sequencing in England

Mapping Intimacies ◽

10.1101/681502 ◽

2019 ◽

Cited By ~ 3

Author(s):

David Wyllie ◽

Trien Do ◽

Richard Myers ◽

Vlad Nikolayevskyy ◽

Derrick Crook ◽

...

Keyword(s):

Public Health ◽

Mixed Infection ◽

English Language ◽

Sequence Data ◽

Added Value ◽

Next Generation Sequencing Data ◽

Nucleotide Position ◽

Sequencing Data ◽

Increased Risk ◽

Low Incidence

AbstractBackgroundThe prevalence, association with disease status, and public health impact of infection with mixtures of M. tuberculosis strains is unclear, in part due to limitations of existing methods for detecting mixed infections.MethodsWe developed an algorithm to identify mixtures of M. tuberculosis strains using next generation sequencing data, assessing performance using simulated sequences. We identified mixed M. tuberculosis strains when there was at least one mixed nucleotide position, and where both the mixture’s components were present in similar isolates from other individuals. We determined risk factors for mixed infection among isolations of M. tuberculosis in England using logistic regression. We used survival analyses to assess the association between mixed infection and putative transmission.Findings6,560 isolations of TB were successfully sequenced in England 2016-2018. Of 3,691 (56%) specimens for which similar sequences had been isolated from at least two other individuals, 341 (9.2%) were mixed. Infection with lineages other than Lineage 4 were associated with mixed infection. Among the 1,823 individuals with pulmonary infection with Lineage 4 M. tuberculosis, mixed infection was associated with significantly increased risk of subsequent isolation of closely related organisms from a different individual (HR 1.43, 95% CI 1.05,1.94), indicative of transmission.InterpretationMixtures of transmissible strains occur in at least 5% of tuberculosis infections in England; when present in pulmonary disease, such mixtures are associated with an increased risk of tuberculosis transmission.FundingPublic Health England; NIHR Health Protection Research Unit Oxford; European Union.Research in ContextEvidence Before This StudyWe searched Pubmed using the search terms ‘tuberculosis’ and ‘mixed’ or ‘mixture’ for English Language articles published up to 1 April 2019. Studies, most performed without the benefit of genomic sequencing, report mixed TB infection from a range of medium and high prevalence areas and show it to be associated with delayed treatment response. Modelling suggests detection and treatment of mixed TB infection is an important goal for TB eradication campaigns. Although routine DNA sequencing of M. tuberculosis isolates is becoming widespread, efficient methods for detecting mixed infection from such data are underdeveloped, and the true prevalence of mixed infection and its association with transmission is unclear.Added Value of This StudyThis study investigated a large series of TB isolations obtained as part of a routine Mycobacterial sequencing program by two reference laboratories, in a low incidence area, England. We developed an efficient generalisable approach to identify transmitted mixed M. tuberculosis infection; our approach is capable of sensitive and specific detection of a single mixed nucleotide position. We identified mixed infection of similar strains (‘microvariation’) in about 9.2% of the M. tuberculosis samples which we were able to assess, and found evidence of increased transmission from individuals with mixed infection.Implications of All the Available EvidenceTB microvariation is a risk factor for TB transmission, even in the low incidence area studied. Although an efficient and highly specific technique identifying microvariation exists, it relies on comparison with similar sequences isolated from other patients. Sharing of sequence data from the many TB sequencing programs being deployed globally will increase the sensitivity of microvariation detection, and may assist targeted public health interventions.

Download Full-text

ALSgeneScanner: a pipeline for the analysis and interpretation of DNA NGS data of ALS patients

10.1101/378158 ◽

2018 ◽

Author(s):

Alfredo Iacoangeli ◽

Ahmad Al Khleifat ◽

William Sproviero ◽

Aleksey Shatunov ◽

Ashley R Jones ◽

...

Keyword(s):

Motor Neurons ◽

Health Care Professionals ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Whole Exome ◽

Exome Sequence Data ◽

Als Patients ◽

Ngs Data

AbstractAmyotrophic lateral sclerosis (ALS, MND) is a neurodegenerative disease of upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two years of first symptoms. Genetic factors are an important cause of ALS, with variants in more than 25 genes having strong evidence, and weaker evidence available for variants in more than 120 genes. With the increasing availability of Next-Generation sequencing data, non-specialists, including health care professionals and patients, are obtaining their genomic information without a corresponding ability to analyse and interpret it. Furthermore, the relevance of novel or existing variants in ALS genes is not always apparent. Here we present ALSgeneScanner, a tool that is easy to install and use, able to provide an automatic, detailed, annotated report, on a list of ALS genes from whole genome sequence data in a few hours and whole exome sequence data in about one hour on a readily available mid-range computer. This will be of value to non-specialists and aid in the interpretation of the relevance of novel and existing variants identified in DNA sequencing data.

Download Full-text

Multi-gene incongruence consistent with hybridisation in Cladocopium (Symbiodiniaceae), an ecologically important genus of coral reef symbionts

10.7287/peerj.preprints.27614 ◽

2019 ◽

Author(s):

Joshua I Brian ◽

Simon K Davy ◽

Shaun P Wilkinson

Keyword(s):

Sequence Data ◽

Incomplete Lineage Sorting ◽

Reticulate Evolution ◽

Next Generation Sequencing Data ◽

Putative Hybrid ◽

Its2 Region ◽

Sequencing Data ◽

Lineage Sorting ◽

Evolutionary Potential ◽

Unbiased Test

Coral reefs rely on their intracellular dinoflagellate symbionts (family Symbiodiniaceae) for nutritional provision in nutrient-poor waters, yet this association is threatened by thermally stressful conditions. Despite this, the evolutionary potential of these symbionts remains poorly characterised. In this study, we tested the potential for divergent Symbiodiniaceae types to sexually reproduce (i.e. hybridise) within Cladocopium, the most ecologically prevalent genus in this family. With sequence data from three organelles (cob gene, mitochondria; psbAncr region, chloroplast; and ITS2 region, nucleus), we utilised the Incongruence Length Difference test, Approximately Unbiased test, tree hybridisation analyses and visual inspection of raw data in stepwise fashion to highlight incongruences between organelles, and thus provide evidence of reticulate evolution. Using this approach, we identified three putative hybrid Cladocopium samples among the 158 analysed, at two of the seven sites sampled. These samples were identified as the common Cladocopium types C40 or C1 with respect to the mitochondria and chloroplasts, but the rarer types C3z, C3u and C1# with respect to their nuclear identity. These five Cladocopium types have previously been confirmed as evolutionarily distinct and were also recovered in non-incongruent samples multiple times, which is strongly suggestive that they sexually reproduced to produce the incongruent samples. A concomitant inspection of Next Generation Sequencing data for these samples suggests that other plausible explanations, such as incomplete lineage sorting, are much less likely. The approach taken in this study allows incongruences between gene regions to be identified with confidence, and brings new light to the evolutionary potential within Symbiodiniaceae.

Download Full-text

Multi-gene incongruence consistent with hybridisation in Cladocopium (Symbiodiniaceae), an ecologically important genus of coral reef symbionts

PeerJ ◽

10.7717/peerj.7178 ◽

2019 ◽

Vol 7 ◽

pp. e7178 ◽

Cited By ~ 1

Author(s):

Joshua I. Brian ◽

Simon K. Davy ◽

Shaun P. Wilkinson

Keyword(s):

Sequence Data ◽

Incomplete Lineage Sorting ◽

Reticulate Evolution ◽

Next Generation Sequencing Data ◽

Putative Hybrid ◽

Its2 Region ◽

Sequencing Data ◽

Lineage Sorting ◽

Evolutionary Potential ◽

Unbiased Test

Coral reefs rely on their intracellular dinoflagellate symbionts (family Symbiodiniaceae) for nutritional provision in nutrient-poor waters, yet this association is threatened by thermally stressful conditions. Despite this, the evolutionary potential of these symbionts remains poorly characterised. In this study, we tested the potential for divergent Symbiodiniaceae types to sexually reproduce (i.e. hybridise) within Cladocopium, the most ecologically prevalent genus in this family. With sequence data from three organelles (cob gene, mitochondrion; psbAncr region, chloroplast; and ITS2 region, nucleus), we utilised the Incongruence Length Difference test, Approximately Unbiased test, tree hybridisation analyses and visual inspection of raw data in stepwise fashion to highlight incongruences between organelles, and thus provide evidence of reticulate evolution. Using this approach, we identified three putative hybrid Cladocopium samples among the 158 analysed, at two of the seven sites sampled. These samples were identified as the common Cladocopium types C40 or C1 with respect to the mitochondria and chloroplasts, but the rarer types C3z, C3u and C1# with respect to their nuclear identity. These five Cladocopium types have previously been confirmed as evolutionarily distinct and were also recovered in non-incongruent samples multiple times, which is strongly suggestive that they sexually reproduced to produce the incongruent samples. A concomitant inspection of next generation sequencing data for these samples suggests that other plausible explanations, such as incomplete lineage sorting or the presence of co-dominance, are much less likely. The approach taken in this study allows incongruences between gene regions to be identified with confidence, and brings new light to the evolutionary potential within Symbiodiniaceae.

Download Full-text

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

10.1101/560839 ◽

2019 ◽

Cited By ~ 5

Author(s):

Sergey Aganezov ◽

Benjamin J. Raphael

Keyword(s):

Sequence Data ◽

Evolutionary Model ◽

Response To Treatment ◽

Nucleotide Position ◽

Structural Variants ◽

Sequencing Data ◽

Somatic Evolution ◽

A Genome ◽

Cancer Genomes ◽

Specific Cancer

AbstractMany cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. These genome rearrangements, or structural variants, can be detected in tumor DNA sequencing data by abnormal mapping of se-quence reads to the reference genome. However, nearly all cancer sequencing to date is of bulk tumor samples which consist of a heterogeneous mixture of normal cells and subpopulations of cancers cells, or clones, that harbor distinct somatic structural variants. We introduce a novel algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes, or clones, that best explain the read alignments from a bulk tumor sample. RCK leverages specific evolutionary constraints on the somatic mutation process in cancer to reduce ambiguity in the deconvolution of admixed DNA sequence data into multiple haplotype-specific cancer karyotypes. In particular, RCK relies on generalizations of the infinite sites assumption that a genome re-arrangement is highly unlikely to occur at the same nucleotide position more than once during somatic evolution. RCK’s comprehensive model allows us to incorporate information both from short and long-read sequencing technologies and is applicable to bulk tumor samples containing a mixture of an arbitrary number of derived genomes. We compared RCK to the state-of-the-art method ReMixT on a dataset of 17 primary and metastatic prostate cancer samples. We demonstrate that ReMixT’s limited support for heterogeneity and lack of evolutionary constrains leads to reconstruction of implausible karyotypes. In contrast, RCK’s infers cancer karyotypes that better explain read alignments from bulk tumor samples and are consistent with a reasonable evolutionary model. RCK’s reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is available at https://github.com/raphael-group/RCK.

Download Full-text

Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

10.1101/2021.04.19.440427 ◽

2021 ◽

Author(s):

Jean-Pierre Kocher ◽

Zachary Stephens ◽

Daniel O'Brien ◽

Mrunal Dehankar ◽

Lewis Roberts ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Long Read ◽

Breakpoint Detection ◽

Targeted Capture ◽

Genome Heterogeneity ◽

Generation Sequencing

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene's read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with those found in long read validation sets. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are validated by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq or targeted capture.

Download Full-text

Pyfastx: a robust Python package for fast random access to sequences from plain and gzipped FASTA/Q files

Briefings in Bioinformatics ◽

10.1093/bib/bbaa368 ◽

2020 ◽

Author(s):

Lianming Du ◽

Qin Liu ◽

Zhenxin Fan ◽

Jie Tang ◽

Xiuyue Zhang ◽

...

Keyword(s):

Sequence Data ◽

Random Access ◽

Biological Data ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Limited Memory ◽

Data Formats ◽

Low Efficiency ◽

Python Package ◽

Generation Sequencing

Abstract FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.

Download Full-text

Exogene: A performant workflow for detecting viral integrations from paired-end next-generation sequencing data

PLoS ONE ◽

10.1371/journal.pone.0250915 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0250915

Author(s):

Zachary Stephens ◽

Daniel O’Brien ◽

Mrunal Dehankar ◽

Lewis R. Roberts ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Long Read ◽

Breakpoint Detection ◽

Targeted Capture ◽

Genome Heterogeneity ◽

Generation Sequencing

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene’s read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with long read validation. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are also supported by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq and targeted capture.

Download Full-text

Using a fast clustering method for viral segment lineage determination, applied to the H9 influenza hemagglutinin.

10.7287/peerj.preprints.3166 ◽

2017 ◽

Author(s):

Andrew Dalby ◽

Lorna Tinworth ◽

Joshua Sealy ◽

Munir Iqbal

Keyword(s):

Phylogenetic Trees ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Viral Sequence ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

Influenza Hemagglutinin ◽

Tree Construction ◽

Alternative Approach ◽

Generation Sequencing

Lineage determination is an important part of the analysis of viral sequence data. Previously this has depended on phylogenetic analysis in order to identify distinct clades within the phylogenetic trees. This method is time consuming and dependent on a set of empirical rules for clade identification. An alternative approach is to use clustering. Clustering is commonly used to identify operational taxonomic units in next generation sequencing data. In this paper we use clustering in order to rapidly identify viral segment lineages and clades without the need for tree construction.

Download Full-text

Multi-gene incongruence consistent with hybridisation in Cladocopium (Symbiodiniaceae), an ecologically important genus of coral reef symbionts

10.7287/peerj.preprints.27614v1 ◽

2019 ◽

Author(s):

Joshua I Brian ◽

Simon K Davy ◽

Shaun P Wilkinson

Keyword(s):

Sequence Data ◽

Incomplete Lineage Sorting ◽

Reticulate Evolution ◽

Next Generation Sequencing Data ◽

Putative Hybrid ◽

Its2 Region ◽

Sequencing Data ◽

Lineage Sorting ◽

Evolutionary Potential ◽

Unbiased Test

Download Full-text

Using a fast clustering method for viral segment lineage determination, applied to the H9 influenza hemagglutinin.

10.7287/peerj.preprints.3166v1 ◽

2017 ◽

Author(s):

Andrew Dalby ◽

Lorna Tinworth ◽

Joshua Sealy ◽

Munir Iqbal

Keyword(s):

Phylogenetic Trees ◽

Sequence Data ◽

Next Generation Sequencing Data ◽

Viral Sequence ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

Influenza Hemagglutinin ◽

Tree Construction ◽

Alternative Approach ◽

Generation Sequencing

Download Full-text