scholarly journals Identifying tagging SNPs for African specific genetic variation from the African Diaspora Genome

2017 ◽  
Author(s):  
Henry Richard Johnston ◽  
Yi-Juan Hu ◽  
Jingjing Gao ◽  
Timoty D. O’Connor ◽  
Goncalo Abecasis ◽  
...  

A primary goal of The Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) is to develop an ‘African Diaspora Power Chip’ (ADPC), a genotyping array consisting of tagging SNPs, useful in comprehensively identifying African specific genetic variation. This array is designed based on the novel variation identified in 642 CAAPA samples of African ancestry with high coverage whole genome sequence data (~30x depth). This novel variation extends the pattern of variation catalogued in the 1000 Genomes and Exome Sequencing Projects to a spectrum of populations representing the wide range of West African genomic diversity. These individuals from CAAPA also comprise a large swath of the African Diaspora population and incorporate historical genetic diversity covering nearly the entire Atlantic coast of the Americas. Here we show the results of designing and producing such a microchip array. This novel array covers African specific variation far better than other commercially available arrays, and will enable better GWAS analyses for researchers with individuals of African descent in their study populations. A recent study1 cataloging variation in continental African populations suggests this type of African-specific genotyping array is both necessary and valuable for facilitating large-scale GWAS in populations of African ancestry.

2018 ◽  
Vol 3 ◽  
pp. 124 ◽  
Author(s):  
Keith A. Jolley ◽  
James E. Bray ◽  
Martin C. J. Maiden

The PubMLST.org website hosts a collection of open-access, curated databases that integrate population sequence data with provenance and phenotype information for over 100 different microbial species and genera.  Although the PubMLST website was conceived as part of the development of the first multi-locus sequence typing (MLST) scheme in 1998 the software it uses, the Bacterial Isolate Genome Sequence database (BIGSdb, published in 2010), enables PubMLST to include all levels of sequence data, from single gene sequences up to and including complete, finished genomes.  Here we describe developments in the BIGSdb software made from publication to June 2018 and show how the platform realises microbial population genomics for a wide range of applications.  The system is based on the gene-by-gene analysis of microbial genomes, with each deposited sequence annotated and curated to identify the genes present and systematically catalogue their variation.  Originally intended as a means of characterising isolates with typing schemes, the synthesis of sequences and records of genetic variation with provenance and phenotype data permits highly scalable (whole genome sequence data for tens of thousands of isolates) means of addressing a wide range of functional questions, including: the prediction of antimicrobial resistance; likely cross-reactivity with vaccine antigens; and the functional activities of different variants that lead to key phenotypes.  There are no limitations to the number of sequences, genetic loci, allelic variants or schemes (combinations of loci) that can be included, enabling each database to represent an expanding catalogue of the genetic variation of the population in question.  In addition to providing web-accessible analyses and links to third-party analysis and visualisation tools, the BIGSdb software includes a RESTful application programming interface (API) that enables access to all the underlying data for third-party applications and data analysis pipelines.


Author(s):  
Pamela Wiener ◽  
Christelle Robert ◽  
Abulgasim Ahbara ◽  
Mazdak Salavati ◽  
Ayele Abebe ◽  
...  

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.


2014 ◽  
Author(s):  
Jason W Sahl ◽  
Greg Caporaso ◽  
David A Rasko ◽  
Paul S Keim

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.


2020 ◽  
Vol 8 (11) ◽  
pp. 1663
Author(s):  
Hiroki Nishiyama ◽  
Hisashi Endo ◽  
Romain Blanc-Mathieu ◽  
Hiroyuki Ogata

The aim of this study was to elucidate the ecological structure of the human gut temperate bacteriophage community and its role in inflammatory bowel disease (IBD). Temperate bacteriophages make up a large proportion of the human gut microbiota and are likely to play a role in IBD pathogenesis. However, many of these bacteriophages await characterization in reference databases. Therefore, we conducted a large-scale reconstruction of temperate bacteriophage and bacterial genomes from the whole-metagenome sequence data generated by the IBD Multi’omics Database project. By associating phages with their hosts via genome comparisons, we found that temperate bacteriophages infect a phylogenetically wide range of bacteria. The majority of variance in bacteriophage community composition was explained by variation among individuals, but differences in the abundance of temperate bacteriophages were identified between IBD and non-IBD patients. Of note, in active ulcerative colitis patients, temperate bacteriophages infecting Bacteroides uniformis and Bacteroides thetaiotaomicron—two species experimentally proven to be beneficial to gut homeostasis—were over-represented, whereas their hosts were under-represented in comparison with non-IBD patients. Supporting the mounting evidence that gut viral community plays a vital role in IBD, our results show potential association between temperate bacteriophages and IBD pathogenesis.


2016 ◽  
Author(s):  
Paolo Devanna ◽  
Xiaowei Sylvia Chen ◽  
Joses Ho ◽  
Dario Gajewski ◽  
Alessandro Gialluisi ◽  
...  

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.


2021 ◽  
Vol 288 (1961) ◽  
Author(s):  
Anna Brüniche-Olsen ◽  
Kenneth F. Kellner ◽  
Jerrold L. Belant ◽  
J. Andrew DeWoody

More than 25% of species assessed by the International Union for Conservation of Nature (IUCN) are threatened with extinction. Understanding how environmental and biological processes have shaped genomic diversity may inform management practices. Using 68 extant avian species, we parsed the effects of habitat availability and life-history traits on genomic diversity over time to provide a baseline for conservation efforts. We used published whole-genome sequence data to estimate overall genomic diversity as indicated by historical long-term effective population sizes ( N e ) and current genomic variability ( H ), then used environmental niche modelling to estimate Pleistocene habitat dynamics for each species. We found that N e and H were positively correlated with habitat availability and related to key life-history traits (body mass and diet), suggesting the latter contribute to the overall genomic variation. We found that H decreased with increasing species extinction risk, suggesting that H may serve as a leading indicator of demographic trends related to formal IUCN conservation status in birds. Our analyses illustrate that genome-wide summary statistics estimated from sequence data reflect meaningful ecological attributes relevant to species conservation.


2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Ziming Weng ◽  
David L. Dill ◽  
Arend Sidow

AbstractVariant discovery in personal, whole genome sequence data is critical for uncovering the genetic contributions to health and disease. We introduce a new approach, Aquila, that uses linked-read data for generating a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. Assemblies cover >95% of the human reference genome, with over 98% in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective evolution of whole-genome reconstruction that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.


2019 ◽  
Author(s):  
Tess D. Pottinger ◽  
Megan J. Puckelwartz ◽  
Lorenzo L. Pesce ◽  
Avery Robinson ◽  
Samuel Kearns ◽  
...  

AbstractBackgroundGenome sequencing coupled with electronic heath record data can uncover medically important genetic variation. Interpretation of rare genetic variation and its role in mediating cardiovascular phenotypes is confounded by variants of uncertain significance.Methods and ResultsWe analyzed the whole genome sequence of 900 racially and ethnically diverse biobank participants selected from a single US center. Participants were equally divided among European, African, Hispanic, and mixed race/ethnicities. We evaluated the American College of Medical Genetics and Genomics medically actionable list of 59 genes focusing on the cardiac genes. Variation was interpreted using the most recent reports in ClinVar, a database of medically relevant human variation. We identified 19 individuals with pathogenic/likely pathogenic variants in cardiac actionable genes (2%) and found evidence for clinical correlates in the electronic health record. African ancestry participants had more variants of uncertain significance in the medically actionable genes including the 30 cardiac actionable genes, even when normalized to total variant count per person. Longitudinal measures of left ventricle size, corrected for body surface area, from approximately 400 biobank participants (1,723 patient years) correlated with genetic findings. The presence of one or more uncertain variants in the actionable cardiac genes and a cardiomyopathy diagnosis correlated with increased left ventricular internal diameter in diastole and in systole. In particular, MYBPC3 was identified as a gene with excess variants of uncertain significance.ConclusionsThese data indicate a subset of uncertain variants may confer risk and should not be considered benign.


2016 ◽  
Author(s):  
Alan Medlar ◽  
Laura Laakso ◽  
Andreia Miraldo ◽  
Ari Löytynoja

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.


2014 ◽  
Author(s):  
Jason W Sahl ◽  
Greg Caporaso ◽  
David A Rasko ◽  
Paul S Keim

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.


Sign in / Sign up

Export Citation Format

Share Document