scholarly journals A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference

2016 ◽  
Author(s):  
Sorina Maciuca ◽  
Carlos del Ojo Elias ◽  
Gil McVean ◽  
Zamin Iqbal

AbstractWe show how positional markers can be used to encode genetic variation within aBurrows-Wheeler Transform (BWT), and use this to construct a generalisation ofthe traditional “reference genome”, incorporating known variation within aspecies. Our goal is to support the inference of the closest mosaic of previouslyknown sequences to the genome(s) under analysis.Our scheme results in an increased alphabet size, and by using a wavelet tree encoding of the BWT we reduce the performance impact on rank operations. We give a specialised form of the backward search that allows variation-aware exact matching. We implement this, and demonstrate the cost of constructing an index of the whole human genome with 8 million genetic variants is 25GB of RAM. We also show that inferring a closer reference can close large kilobase-scale coverage gaps in P. falciparum.

2019 ◽  
Author(s):  
Rebecca A. Zabinsky ◽  
Jonathan Mares ◽  
Richard She ◽  
Michelle K. Zeman ◽  
Thomas R. Silvers ◽  
...  

ABSTRACTRapid mutation fuels the evolution of many cancers and pathogens. Much of the ensuing genetic variation is detrimental, but cells can survive by limiting the cost of accumulating mutation burden. We investigated this behavior by propagating hypermutating yeast lineages to create independent populations harboring thousands of distinct genetic variants. Mutation rate and spectrum remained unchanged throughout the experiment, yet lesions that arose early were more deleterious than those that arose later. Although the lineages shared no mutations in common, each mounted a similar transcriptional response to mutation burden. The proteins involved in this response formed a highly connected network that has not previously been identified. Inhibiting this response increased the cost of accumulated mutations, selectively killing highly mutated cells. A similar gene expression program exists in hypermutating human cancers and is linked to survival. Our data thus define a conserved stress response that buffers the cost of accumulating genetic lesions and further suggest that this network could be targeted therapeutically.


2018 ◽  
Author(s):  
Jacob Pritt ◽  
Nae-Chyun Chen ◽  
Ben Langmead

AbstractThere is growing interest in using genetic variants to augment the reference genome into a “graph genome” to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignment-score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead.


2019 ◽  
Author(s):  
Thomas Büchler ◽  
Enno Ohlebusch

AbstractMotivationIn resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers (Li and Durbin, 2009; Langmead and Salzberg, 2012) are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. (2013) encoded SNPs in a BWT by the IUPAC nucleotide code (Cornish-Bowden, 1985). In a different approach, Maciuca et al. (2016) provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation.ResultsIn this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, MNPs, indels, duplications, transpositions, inversions, and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in (Huang et al., 2013) and the encoding of the other kinds of genetic variation relies on the idea introduced in (Maciuca et al., 2016). In contrast to Maciuca et al. (2016), however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it to BWBBLE (Huang et al., 2013) and gramtools (Maciuca et al., 2016).Availabilityhttps://www.uni-ulm.de/in/theo/research/seqana/Contact:[email protected]


2019 ◽  
Author(s):  
Thomas Büchler ◽  
Enno Ohlebusch

Abstract Motivation In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows–Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a ‘natural encoding’ of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. Results In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361–i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222–233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ‘marked chromosome’. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. Availability and implementation https://www.uni-ulm.de/in/theo/research/seqana/. Contact [email protected]


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2008 ◽  
Vol 36 (3) ◽  
pp. 471-477 ◽  
Author(s):  
Jennifer A. Hamilton

In 2000, researchers from the Human Genome Project (HGP) proclaimed that the initial sequencing of the human genome definitively proved, among other things, that there was no genetic basis for race. The genetic fact that most humans were 99.9% the same at the level of their DNA was widely heralded and circulated in the English-speaking press, especially in the United States. This pronouncement seemed proof that long-term antiracist efforts to de-biologize race were legitimized by scientific findings. Yet, despite the seemingly widespread acceptance of the social construction of race, post-HGP genetic science has seen a substantial shift toward the use of race variables in genetic research and, according to a number of prominent scholars, is re-invoking the specter of earlier forms of racial science in some rather discomfiting ways. During the past seven years, the main thrust of human genetic research, especially in the realm of biomedicine, has shifted from a concern with the 99.9% of the shared genome — what is thought to make humans alike — towards an explicit focus on the 0.1% that constitutes human genetic variation. Here I briefly explore some of the potential implications of the conceptualization and practice of early 21st century genetic variation research, especially as it relates to questions of race.


2018 ◽  
Vol 62 (4) ◽  
pp. 575-582
Author(s):  
Francesco Raimondi ◽  
Robert B. Russell

Genetic variants are currently a major component of system-wide investigations into biological function or disease. Approaches to select variants (often out of thousands of candidates) that are responsible for a particular phenomenon have many clinical applications and can help illuminate differences between individuals. Selecting meaningful variants is greatly aided by integration with information about molecular mechanism, whether known from protein structures or interactions or biological pathways. In this review we discuss the nature of genetic variants, and recent studies highlighting what is currently known about the relationship between genetic variation, biomolecular function, and disease.


2021 ◽  
Author(s):  
Brice Letcher ◽  
Martin Hunt ◽  
Zamin Iqbal

AbstractBackgroundStandard approaches to characterising genetic variation revolve around mapping reads to a reference genome and describing variants in terms of differences from the reference; this is based on the assumption that these differences will be small and provides a simple coordinate system. However this fails, and the coordinates break down, when there are diverged haplotypes at a locus (e.g. one haplotype contains a multi-kilobase deletion, a second contains a few SNPs, and a third is highly diverged with hundreds of SNPs). To handle these, we need to model genetic variation that occurs at different length-scales (SNPs to large structural variants) and that occurs on alternate backgrounds. We refer to these together as multiscale variation.ResultsWe model the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools. This enables variant calling on different sequence backgrounds. In addition to producing regular VCF files, we introduce a JSON file format based on VCF, which records variant site relationships and alternate sequence backgrounds.We show two applications. First, we benchmark gramtools against existing state-of-the-art methods in joint-genotyping 17 M. tuberculosis samples at long deletions and the overlapping small variants that segregate in a cohort of 1,017 genomes. Second, in 706 African and SE Asian P. falciparum genomes, we analyse a dimorphic surface antigen gene which possesses variation on two diverged backgrounds which appeared to not recombine. This generates the first map of variation on both backgrounds, revealing patterns of recombination that were previously unknown.ConclusionsWe need new approaches to be able to jointly analyse SNP and structural variation in cohorts, and even more to handle variants on different genetic backgrounds. We have demonstrated that by modelling with a directed, acyclic and locally hierarchical genome graph, we can apply new algorithms to accurately genotype dense variation at multiple scales. We also propose a generalisation of VCF for accessing multiscale variation in genome graphs, which we hope will be of wide utility.


2021 ◽  
Vol 12 ◽  
Author(s):  
S. A. Durward-Akhurst ◽  
R. J. Schaefer ◽  
B. Grantham ◽  
W. K. Carey ◽  
J. R. Mickelson ◽  
...  

Genetic variation is a key contributor to health and disease. Understanding the link between an individual’s genotype and the corresponding phenotype is a major goal of medical genetics. Whole genome sequencing (WGS) within and across populations enables highly efficient variant discovery and elucidation of the molecular nature of virtually all genetic variation. Here, we report the largest catalog of genetic variation for the horse, a species of importance as a model for human athletic and performance related traits, using WGS of 534 horses. We show the extent of agreement between two commonly used variant callers. In data from ten target breeds that represent major breed clusters in the domestic horse, we demonstrate the distribution of variants, their allele frequencies across breeds, and identify variants that are unique to a single breed. We investigate variants with no homozygotes that may be potential embryonic lethal variants, as well as variants present in all individuals that likely represent regions of the genome with errors, poor annotation or where the reference genome carries a variant. Finally, we show regions of the genome that have higher or lower levels of genetic variation compared to the genome average. This catalog can be used for variant prioritization for important equine diseases and traits, and to provide key information about regions of the genome where the assembly and/or annotation need to be improved.


Sign in / Sign up

Export Citation Format

Share Document