Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome

Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications

Wellcome Open Research ◽

10.12688/wellcomeopenres.14826.1 ◽

2018 ◽

Vol 3 ◽

pp. 124 ◽

Cited By ~ 286

Author(s):

Keith A. Jolley ◽

James E. Bray ◽

Martin C. J. Maiden

Keyword(s):

Genetic Variation ◽

Open Access ◽

Genome Sequence ◽

Population Genomics ◽

Sequence Data ◽

Single Gene ◽

Cross Reactivity ◽

Third Party ◽

Whole Genome Sequence ◽

Wide Range

The PubMLST.org website hosts a collection of open-access, curated databases that integrate population sequence data with provenance and phenotype information for over 100 different microbial species and genera. Although the PubMLST website was conceived as part of the development of the first multi-locus sequence typing (MLST) scheme in 1998 the software it uses, the Bacterial Isolate Genome Sequence database (BIGSdb, published in 2010), enables PubMLST to include all levels of sequence data, from single gene sequences up to and including complete, finished genomes. Here we describe developments in the BIGSdb software made from publication to June 2018 and show how the platform realises microbial population genomics for a wide range of applications. The system is based on the gene-by-gene analysis of microbial genomes, with each deposited sequence annotated and curated to identify the genes present and systematically catalogue their variation. Originally intended as a means of characterising isolates with typing schemes, the synthesis of sequences and records of genetic variation with provenance and phenotype data permits highly scalable (whole genome sequence data for tens of thousands of isolates) means of addressing a wide range of functional questions, including: the prediction of antimicrobial resistance; likely cross-reactivity with vaccine antigens; and the functional activities of different variants that lead to key phenotypes. There are no limitations to the number of sequences, genetic loci, allelic variants or schemes (combinations of loci) that can be included, enabling each database to represent an expanding catalogue of the genetic variation of the population in question. In addition to providing web-accessible analyses and links to third-party analysis and visualisation tools, the BIGSdb software includes a RESTful application programming interface (API) that enables access to all the underlying data for third-party applications and data analysis pipelines.

Download Full-text

Whole-genome sequence data suggests environmental adaptation of Ethiopian sheep populations

Genome Biology and Evolution ◽

10.1093/gbe/evab014 ◽

2021 ◽

Author(s):

Pamela Wiener ◽

Christelle Robert ◽

Abulgasim Ahbara ◽

Mazdak Salavati ◽

Ayele Abebe ◽

...

Keyword(s):

High Altitude ◽

Environmental Variables ◽

Large Scale ◽

Sequence Data ◽

Strong Association ◽

Environmental Adaptation ◽

Whole Genome Sequence ◽

Single Nucleotide Variants ◽

High Altitude Adaptation ◽

Altitude Adaptation

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220v1 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text

Ecological Structuring of Temperate Bacteriophages in the Inflammatory Bowel Disease-Affected Gut

Microorganisms ◽

10.3390/microorganisms8111663 ◽

2020 ◽

Vol 8 (11) ◽

pp. 1663

Author(s):

Hiroki Nishiyama ◽

Hisashi Endo ◽

Romain Blanc-Mathieu ◽

Hiroyuki Ogata

Keyword(s):

Inflammatory Bowel Disease ◽

Bowel Disease ◽

Large Scale ◽

Sequence Data ◽

Vital Role ◽

Temperate Bacteriophage ◽

Human Gut ◽

Potential Association ◽

Wide Range ◽

Inflammatory Bowel

The aim of this study was to elucidate the ecological structure of the human gut temperate bacteriophage community and its role in inflammatory bowel disease (IBD). Temperate bacteriophages make up a large proportion of the human gut microbiota and are likely to play a role in IBD pathogenesis. However, many of these bacteriophages await characterization in reference databases. Therefore, we conducted a large-scale reconstruction of temperate bacteriophage and bacterial genomes from the whole-metagenome sequence data generated by the IBD Multi’omics Database project. By associating phages with their hosts via genome comparisons, we found that temperate bacteriophages infect a phylogenetically wide range of bacteria. The majority of variance in bacteriophage community composition was explained by variation among individuals, but differences in the abundance of temperate bacteriophages were identified between IBD and non-IBD patients. Of note, in active ulcerative colitis patients, temperate bacteriophages infecting Bacteroides uniformis and Bacteroides thetaiotaomicron—two species experimentally proven to be beneficial to gut homeostasis—were over-represented, whereas their hosts were under-represented in comparison with non-IBD patients. Supporting the mounting evidence that gut viral community plays a vital role in IBD, our results show potential association between temperate bacteriophages and IBD pathogenesis.

Download Full-text

Finding functional disease-associated non-coding variation using next-generation sequencing

10.1101/060285 ◽

2016 ◽

Author(s):

Paolo Devanna ◽

Xiaowei Sylvia Chen ◽

Joses Ho ◽

Dario Gajewski ◽

Alessandro Gialluisi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Binding Sites ◽

Large Scale ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Whole Exome ◽

Generation Sequencing

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.

Download Full-text

Life-history traits and habitat availability shape genomic diversity in birds: implications for conservation

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2021.1441 ◽

2021 ◽

Vol 288 (1961) ◽

Author(s):

Anna Brüniche-Olsen ◽

Kenneth F. Kellner ◽

Jerrold L. Belant ◽

J. Andrew DeWoody

Keyword(s):

Life History ◽

Life History Traits ◽

Management Practices ◽

Sequence Data ◽

Extinction Risk ◽

Conservation Status ◽

Genomic Variation ◽

Genomic Diversity ◽

Whole Genome Sequence ◽

Habitat Availability

More than 25% of species assessed by the International Union for Conservation of Nature (IUCN) are threatened with extinction. Understanding how environmental and biological processes have shaped genomic diversity may inform management practices. Using 68 extant avian species, we parsed the effects of habitat availability and life-history traits on genomic diversity over time to provide a baseline for conservation efforts. We used published whole-genome sequence data to estimate overall genomic diversity as indicated by historical long-term effective population sizes ( N e ) and current genomic variability ( H ), then used environmental niche modelling to estimate Pleistocene habitat dynamics for each species. We found that N e and H were positively correlated with habitat availability and related to key life-history traits (body mass and diet), suggesting the latter contribute to the overall genomic variation. We found that H decreased with increasing species extinction risk, suggesting that H may serve as a leading indicator of demographic trends related to formal IUCN conservation status in birds. Our analyses illustrate that genome-wide summary statistics estimated from sequence data reflect meaningful ecological attributes relevant to species conservation.

Download Full-text

Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

10.1101/660605 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Lu Zhang ◽

Ziming Weng ◽

David L. Dill ◽

Arend Sidow

Keyword(s):

Genetic Variation ◽

Genome Sequence ◽

Genome Assembly ◽

Sequence Data ◽

Association Studies ◽

Cost Effective ◽

Whole Genome Sequence ◽

Personal Genome ◽

Whole Genome ◽

Nucleotide Polymorphisms

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.

Download Full-text

Pathogenic and uncertain genetic variants have clinical cardiac correlates in diverse biobank participants

10.1101/716662 ◽

2019 ◽

Author(s):

Tess D. Pottinger ◽

Megan J. Puckelwartz ◽

Lorenzo L. Pesce ◽

Avery Robinson ◽

Samuel Kearns ◽

...

Keyword(s):

Genetic Variation ◽

African Ancestry ◽

Left Ventricular ◽

Whole Genome Sequence ◽

Internal Diameter ◽

Variants Of Uncertain Significance ◽

Pathogenic Variants ◽

Rare Genetic Variation ◽

Uncertain Significance ◽

Ventricle Size

AbstractBackgroundGenome sequencing coupled with electronic heath record data can uncover medically important genetic variation. Interpretation of rare genetic variation and its role in mediating cardiovascular phenotypes is confounded by variants of uncertain significance.Methods and ResultsWe analyzed the whole genome sequence of 900 racially and ethnically diverse biobank participants selected from a single US center. Participants were equally divided among European, African, Hispanic, and mixed race/ethnicities. We evaluated the American College of Medical Genetics and Genomics medically actionable list of 59 genes focusing on the cardiac genes. Variation was interpreted using the most recent reports in ClinVar, a database of medically relevant human variation. We identified 19 individuals with pathogenic/likely pathogenic variants in cardiac actionable genes (2%) and found evidence for clinical correlates in the electronic health record. African ancestry participants had more variants of uncertain significance in the medically actionable genes including the 30 cardiac actionable genes, even when normalized to total variant count per person. Longitudinal measures of left ventricle size, corrected for body surface area, from approximately 400 biobank participants (1,723 patient years) correlated with genetic findings. The presence of one or more uncertain variants in the actionable cardiac genes and a cardiomyopathy diagnosis correlated with increased left ventricular internal diameter in diastole and in systole. In particular, MYBPC3 was identified as a gene with excess variants of uncertain significance.ConclusionsThese data indicate a subset of uncertain variants may confer risk and should not be considered benign.

Download Full-text

Glutton: large-scale integration of non-model organism transcriptome data for comparative analysis

10.1101/077511 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alan Medlar ◽

Laura Laakso ◽

Andreia Miraldo ◽

Ari Löytynoja

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Model Organism ◽

Model Organisms ◽

Rna Seq ◽

Reference Species ◽

Wide Range ◽

The Impact

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.

Download Full-text

The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

10.7287/peerj.preprints.220 ◽

2014 ◽

Author(s):

Jason W Sahl ◽

Greg Caporaso ◽

David A Rasko ◽

Paul S Keim

Keyword(s):

Large Scale ◽

Sequence Data ◽

Parallel Implementation ◽

Genetic Relationships ◽

Clinical Diagnostics ◽

Whole Genome Sequence ◽

Bacterial Isolates ◽

Bacterial Genomes ◽

E Coli ◽

Blast Score

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

Download Full-text