scholarly journals A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

2017 ◽  
Author(s):  
Siddharth Samsi ◽  
Brian Helfer ◽  
Jeremy Kepner ◽  
Albert Reuther ◽  
Darrell O. Ricke

AbstractAnalysis of DNA samples is an important tool in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis. The use of single nucleotide polymorphisms (SNPs) has utility for analysis of complex DNA mixtures. The use of tens of thousands of SNPs loci for analysis poses significant computational challenges because the forensic analysis scales by the product of the loci count and number of DNA samples to be analyzed. In this paper, we discuss the implementation of a DNA sequence comparison algorithm by re-casting the algorithm in terms of linear algebra primitives. By developing an overloaded matrix multiplication approach to DNA comparisons, we can leverage advances in GPU hardware and algoithms for dense matrix multiplication (DGEMM) to speed up DNA sample comparisons. We show that it is possible to compare 2048 unknown DNA samples with 20 million known samples in under 6 seconds using a NVIDIA K80 GPU.

Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 743
Author(s):  
Caiyong Yin ◽  
Kaiyuan Su ◽  
Ziwei He ◽  
Dian Zhai ◽  
Kejian Guo ◽  
...  

Y chromosomal short tandem repeats (Y-STRs) have been widely harnessed for forensic applications, such as pedigree source searching from public security databases and male identification from male–female mixed samples. For various populations, databases composed of Y-STR haplotypes have been built to provide investigating leads for solving difficult or cold cases. Recently, the supplementary application of Y chromosomal haplogroup-determining single-nucleotide polymorphisms (SNPs) for forensic purposes was under heated debate. This study provides Y-STR haplotypes for 27 markers typed by the Yfiler™ Plus kit and Y-SNP haplogroups defined by 24 loci within the Y-SNP Pedigree Tagging System for Shandong Han (n = 305) and Yunnan Han (n = 565) populations. The genetic backgrounds of these two populations were explicitly characterized by the analysis of molecular variance (AMOVA) and multi-dimensional scaling (MDS) plots based on 27 Y-STRs. Then, population comparisons were conducted by observing Y-SNP allelic frequencies and Y-SNP haplogroups distribution, estimating forensic parameters, and depicting distribution spectrums of Y-STR alleles in sub-haplogroups. The Y-STR variants, including null alleles, intermedia alleles, and copy number variations (CNVs), were co-listed, and a strong correlation between Y-STR allele variants (“DYS518~.2” alleles) and the Y-SNP haplogroup QR-M45 was observed. A network was reconstructed to illustrate the evolutionary pathway and to figure out the ancestral mutation event. Also, a phylogenetic tree on the individual level was constructed to observe the relevance of the Y-STR haplotypes to the Y-SNP haplogroups. This study provides the evidence that basic genetic backgrounds, which were revealed by both Y-STR and Y-SNP loci, would be useful for uncovering detailed population differences and, more importantly, demonstrates the contributing role of Y-SNPs in population differentiation and male pedigree discrimination.


1984 ◽  
Vol 4 (2) ◽  
pp. 254-259 ◽  
Author(s):  
D Carroll ◽  
J E Garrett ◽  
B S Lam

There exist in the Xenopus laevis genome clusters of tandemly repeated DNA sequences, consisting of two types of 393-base-pair repeating unit. Each such cluster contains several units of one of these paired tandem repeats (PTR-1), followed by several units of the other repeat (PTR-2). The number of repeats of each type is variable from cluster to cluster and averages about seven of each type per cluster. Every cluster has ca. 1,000 base pairs of common left flanking sequence (adjacent to the PTR-1 repeats) and 1,000 base pairs of common right flanking sequence (adjacent to the PTR-2 repeats). Beyond these common flanks, the DNA sequences are different in the eight cloned genomic fragments we have studied. Thus, the hundreds of PTR clusters in the genome are dispersed at apparently unrelated sites. Nucleotide sequences of representative PTR-1 and PTR-2 repeats are 64% homologous. These sequences do not reveal an obvious function. However, the related species X. mulleri and X. borealis have sequences homologous to PTR-1 and PTR-2, which show the same repeat lengths and genomic organization. This evolutionary conservation suggests positive selection for the clusters. Maintenance of these sequences at dispersed sites imposes constraints on possible mechanisms of concerted evolution.


1984 ◽  
Vol 4 (2) ◽  
pp. 254-259
Author(s):  
D Carroll ◽  
J E Garrett ◽  
B S Lam

There exist in the Xenopus laevis genome clusters of tandemly repeated DNA sequences, consisting of two types of 393-base-pair repeating unit. Each such cluster contains several units of one of these paired tandem repeats (PTR-1), followed by several units of the other repeat (PTR-2). The number of repeats of each type is variable from cluster to cluster and averages about seven of each type per cluster. Every cluster has ca. 1,000 base pairs of common left flanking sequence (adjacent to the PTR-1 repeats) and 1,000 base pairs of common right flanking sequence (adjacent to the PTR-2 repeats). Beyond these common flanks, the DNA sequences are different in the eight cloned genomic fragments we have studied. Thus, the hundreds of PTR clusters in the genome are dispersed at apparently unrelated sites. Nucleotide sequences of representative PTR-1 and PTR-2 repeats are 64% homologous. These sequences do not reveal an obvious function. However, the related species X. mulleri and X. borealis have sequences homologous to PTR-1 and PTR-2, which show the same repeat lengths and genomic organization. This evolutionary conservation suggests positive selection for the clusters. Maintenance of these sequences at dispersed sites imposes constraints on possible mechanisms of concerted evolution.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Chun-Tien Chang ◽  
Chi-Neu Tsai ◽  
Chuan Yi Tang ◽  
Chun-Houh Chen ◽  
Jang-Hau Lian ◽  
...  

The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such asβ-defensin 4 (DEFB4) and its paralogHSPDP3.


Author(s):  
Saeeda Baig

During the recent past focus has shifted from identifying intervertebral disc degeneration as being caused by physical exposure and strain to being linked with a variety of genetic variations. The objective of this review is to provide an up to date review of the existing research data regarding the relation of intervertebral disc degeneration to structural protein genes and their polymorphisms and thus help clearly establish further avenues where research into causation and treatment is needed. A comprehensive search using the keywords “Collagen”, “COL”, “Aggrecan”, “AGC”, “IVDD”, “intervertebral disc degeneration”, and “lumbar disc degeneration” from PubMed and Google Scholar, where literature in the English language was selected spanning from 1991 to 2019. There are many genes involved in the production of structural components of an intervertebral disc. The issues in production of these components involve the over-expression or under-expression of their genes, and single nucleotide polymorphisms and variable number of tandem repeats affecting their structures. These structural genes include primarily the collagen and the aggrecan genes. While genetic and environmental factors all come into play with a disease process like disc degeneration, the bulk of research now shows the significantly larger impact of hereditary over exposure. While further research is needed into some of the lesser studied genes linked to IVDD and also the racial variations in genetic makeup, the focus in the near future should be on establishment of genetic testing to identify individuals at greater risk of disease and deliberation regarding the use of gene therapy to prevent disc degeneration.


2002 ◽  
Vol 80 (11) ◽  
pp. 1151-1159 ◽  
Author(s):  
M Dusabenyagasani ◽  
G Laflamme ◽  
R C Hamelin

We detected nucleotide polymorphisms within the genus Gremmeniella in DNA sequences of β-tubulin, glyceraldehyde phosphate dehydrogenase, and mitochondrial small subunit rRNA (mtSSU rRNA) genes. A group-I intron was present in strains originating from fir (Abies spp.) in the mtSSU rRNA locus. This intron in the mtSSU rRNA locus of strains isolated from Abies sachalinensis (Fridr. Schmidt) M.T. Mast in Asia was also found in strains isolated from Abies balsamea (L.) Mill. in North America. Phylogenetic analyses yielded trees that grouped strains by host of origin with strong branch support. Asian strains of Gremmeniella abietina (Lagerberg) Morelet var. abietina isolated from fir (A. sachalinensis) were more closely related to G. abietina var. balsamea from North America, which is found on spruce (Picea spp.) and balsam fir, and European and North American races of G. abietina var. abietina from pines (Pinus spp.) were distantly related. Likewise, North American isolates of Gremmeniella laricina (Ettinger) O. Petrini, L.E. Petrini, G. Laflamme, & G.B. Ouellette, a pathogen of larch, was more closely related to G. laricina from Europe than to G. abietina var. abietina from North America. These data suggest that host specialization might have been the leading evolutionary force shaping Gremmeniella spp., with geographic separation acting as a secondary factor.Key words: Gremmeniella, geographic separation, host specialization, mitochondrial rRNA, nuclear genes.


Genetics ◽  
2004 ◽  
Vol 166 (2) ◽  
pp. 661-668
Author(s):  
Mandy Kim ◽  
Erika Wolff ◽  
Tiffany Huang ◽  
Lilit Garibyan ◽  
Ashlee M Earl ◽  
...  

Abstract We have applied a genetic system for analyzing mutations in Escherichia coli to Deinococcus radiodurans, an extremeophile with an astonishingly high resistance to UV- and ionizing-radiation-induced mutagenesis. Taking advantage of the conservation of the β-subunit of RNA polymerase among most prokaryotes, we derived again in D. radiodurans the rpoB/Rif r system that we developed in E. coli to monitor base substitutions, defining 33 base change substitutions at 22 different base pairs. We sequenced >250 mutations leading to Rif r in D. radiodurans derived spontaneously in wild-type and uvrD (mismatch-repair-deficient) backgrounds and after treatment with N-methyl-N′-nitro-N-nitrosoguanidine (NTG) and 5-azacytidine (5AZ). The specificities of NTG and 5AZ in D. radiodurans are the same as those found for E. coli and other organisms. There are prominent base substitution hotspots in rpoB in both D. radiodurans and E. coli. In several cases these are at different points in each organism, even though the DNA sequences surrounding the hotspots and their corresponding sites are very similar in both D. radiodurans and E. coli. In one case the hotspots occur at the same site in both organisms.


2021 ◽  
Vol 22 (4) ◽  
pp. 1832
Author(s):  
Eugene Metakovsky ◽  
Laura Pascual ◽  
Patrizia Vaccino ◽  
Viktor Melnik ◽  
Marta Rodriguez-Quijano ◽  
...  

The Gli-B1-encoded γ-gliadins and non-coding γ-gliadin DNA sequences for 15 different alleles of common wheat have been compared using seven tests: electrophoretic mobility (EM) and molecular weight (MW) of the encoded major γ-gliadin, restriction fragment length polymorphism patterns (RFLPs) (three different markers), Gli-B1-γ-gliadin-pseudogene known SNP markers (Single nucleotide polymorphisms) and sequencing the pseudogene GAG56B. It was discovered that encoded γ-gliadins, with contrasting EM, had similar MWs. However, seven allelic variants (designated from I to VII) differed among them in the other six tests: I (alleles Gli-B1i, k, m, o), II (Gli-B1n, q, s), III (Gli-B1b), IV (Gli-B1e, f, g), V (Gli-B1h), VI (Gli-B1d) and VII (Gli-B1a). Allele Gli-B1c (variant VIII) was identical to the alleles from group IV in four of the tests. Some tests might show a fine difference between alleles belonging to the same variant. Our results attest in favor of the independent origin of at least seven variants at the Gli-B1 locus that might originate from deeply diverged genotypes of the donor(s) of the B genome in hexaploid wheat and therefore might be called “heteroallelic”. The donor’s particularities at the Gli-B1 locus might be conserved since that time and decisively contribute to the current high genetic diversity of common wheat.


Genetics ◽  
1974 ◽  
Vol 77 (1) ◽  
pp. 95-104
Author(s):  
J E Sulston ◽  
S Brenner

ABSTRACT Chemical analysis and a study of renaturation kinetics show that the nematode, Caenorhabditis elegans, has a haploid DNA content of 8 x IO7 base pairs (20 times the genome of E. coli). Eighty-three percent of the DNA sequences are unique. The mean base composition is 36% GC; a small component, containing the rRNA cistrons, has a base composition of 51% GC. The haploid genome contains about 300 genes for 4s RNA, 110 for 5s RNA, and 55 for (18 + 28)S RNA.


Sign in / Sign up

Export Citation Format

Share Document