Compressing population DNA sequences using multiple reference sequences

Author(s):  
Kin-On Cheng ◽  
Ngai-Fong Law ◽  
Wan-Chi Siu
Author(s):  
Dirk Erpenbeck ◽  
Merrick Ekins ◽  
Nicole Enghuber ◽  
John N.A. Hooper ◽  
Helmut Lehnert ◽  
...  

Sponge species are infamously difficult to identify for non-experts due to their high morphological plasticity and the paucity of informative morphological characters. The use of molecular techniques certainly helps with species identification, but unfortunately it requires prior reference sequences. Holotypes constitute the best reference material for species identification, however their usage in molecular systematics and taxonomy is scarce and frequently not even attempted, mostly due to their antiquity and preservation history. Here we provide case studies in which we demonstrate the importance of using holotype material to answer phylogenetic and taxonomic questions. We also demonstrate the possibility of sequencing DNA fragments out of century-old holotypes. Furthermore we propose the deposition of DNA sequences in conjunction with new species descriptions.


2014 ◽  
Author(s):  
Alexander Dilthey ◽  
Charles Cox ◽  
Zamin Iqbal ◽  
Matthew R. Nelson ◽  
Gil McVean

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and longread data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Qiuhui Li ◽  
Shilin Tian ◽  
Bin Yan ◽  
Chi Man Liu ◽  
Tak-Wah Lam ◽  
...  

AbstractPan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12707
Author(s):  
Girum Fitihamlak Ejigu ◽  
Gangman Yi ◽  
Jong Im Kim ◽  
Jaehee Jung

The massively parallel nature of next-generation sequencing technologies has contributed to the generation of massive sequence data in the last two decades. Deciphering the meaning of each generated sequence requires multiple analysis tools, at all stages of analysis, from the reads stage all the way up to the whole-genome level. Homology-based approaches based on related reference sequences are usually the preferred option for gene and transcript prediction in newly sequenced genomes, resulting in the popularity of a variety of BLAST and BLAST-based tools. For organelle genomes, a single-reference–based gene finding tool that uses grouping parameters for BLAST results has been implemented in the Genome Search Plotter (GSP). However, this tool does not accept multiple and user-customized reference sequences required for a broad homology search. Here, we present multiple Reference–based Gene Search and Plot (ReGSP), a simple and convenient web tool that accepts multiple reference sequences for homology-based gene search. The tool incorporates cPlot, a novel dot plot tool, for illustrating nucleotide sequence similarity between the query and the reference sequences. ReGSP has an easy-to-use web interface and is freely accessible at https://ds.mju.ac.kr/regsp.


2020 ◽  
Vol 11 (2) ◽  
pp. 145-152
Author(s):  
Nevenka Ćelepirović ◽  
Sanja Novak Agbaba ◽  
Monika Karija Vlahović

The saprotrophic, endophytic, and parasitic fungi were detected from the samples collected in the forest of the management unit East Psunj and Papuk Nature Park in Croatia. The disease symptoms, the morphology of fruiting bodies and fungal culture, and DNA barcoding were combined for determining the fungi at the genus or species level. DNA barcoding is a standardized and automated identification of species based on recognition of highly variable DNA sequences. DNA barcoding has a wide application in the diagnostic purpose of fungi in biological specimens. DNA samples for DNA barcoding were isolated from infected tree tissues, fungal fruiting bodies or fungal cultures. The ITS or ITS2 sequences of the fungal DNA were sequenced and aligned with the reference sequences in GenBank (NCBI) using BLAST. The size of ITS and ITS2 sequences were 512-584 bp and 248-326 bp, respectively. The sequences showed a high identity of 97.21%-100% at 98%-100% coverage with reference sequences in GenBank (NCBI). The exception was the species Amphilogia gyrosa that showed 95.65% identity at 100% coverage. Two fungi were determined at genus level: Cladosporium sp., and Cytospora sp., while 11 fungi were determined at species level: Alternaria alternata, Aureobasidium pullulans, Amphilogia gyrosa, Capronia pilosella, Cryphonectria parasitica, Exidia glandulosa, Epicoccum nigrum, Penicillium glabrum, Pezicula carpinea, Rosellinia corticium, and Stereum hirsutum.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Chun-Tien Chang ◽  
Chi-Neu Tsai ◽  
Chuan Yi Tang ◽  
Chun-Houh Chen ◽  
Jang-Hau Lian ◽  
...  

The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such asβ-defensin 4 (DEFB4) and its paralogHSPDP3.


2020 ◽  
Vol 30 (8) ◽  
pp. 1154-1169
Author(s):  
Kiran V. Garimella ◽  
Zamin Iqbal ◽  
Michael A. Krause ◽  
Susana Campino ◽  
Mihir Kekre ◽  
...  

2019 ◽  
Vol 11 (4) ◽  
pp. 88
Author(s):  
Bhaba Amatya

The present study is the first of its type that uses a technique of DNA barcoding to determine identification and relationship of a species of fish from Phewa lake, Nepal. The mitochondrial DNA from two ethanol-preserved samples of fish, randomly collected from Phewa lake, was extracted using Gene AllExgene TMtissue extraction kit. 650 base pair of mitochondrial cytochrome c oxidase subunit 1 (CO1) was amplified using a cocktail of four primers and was sequenced bidirectionaly using Sanger sequence method. The DNA sequences were edited using AliView software. The sequences confirmed Chagunius chagunio as their alignment with 16 reference sequences belonging to Chagunius chagunio in the NCBI GenBank, scored highest percentage of Query Cover (75% to 100%) and Percentage Identity (97.29% to 100%). The MEGA software analysed the DNA sequences to obtain their corresponding protein sequences. The DNA sequences were submitted to the GenBank and accession numbers (MN087472 and MN087473) were obtained. Clustal Omega software analysed multiple sequence alignment among 19 homologous DNA sequences of Chagunius chagunio from India, Bangladesh and Phewa lake, Nepal. The percentage of similarity among the aligned sequences was calculated as 39.3%. Based on the neighbour joining tree, the Chagunius chagunio of Phewa lake is found closely related with Chagunius chagunio of Bangladesh.


Author(s):  
David P. Bazett-Jones ◽  
Mark L. Brown

A multisubunit RNA polymerase enzyme is ultimately responsible for transcription initiation and elongation of RNA, but recognition of the proper start site by the enzyme is regulated by general, temporal and gene-specific trans-factors interacting at promoter and enhancer DNA sequences. To understand the molecular mechanisms which precisely regulate the transcription initiation event, it is crucial to elucidate the structure of the transcription factor/DNA complexes involved. Electron spectroscopic imaging (ESI) provides the opportunity to visualize individual DNA molecules. Enhancement of DNA contrast with ESI is accomplished by imaging with electrons that have interacted with inner shell electrons of phosphorus in the DNA backbone. Phosphorus detection at this intermediately high level of resolution (≈lnm) permits selective imaging of the DNA, to determine whether the protein factors compact, bend or wrap the DNA. Simultaneously, mass analysis and phosphorus content can be measured quantitatively, using adjacent DNA or tobacco mosaic virus (TMV) as mass and phosphorus standards. These two parameters provide stoichiometric information relating the ratios of protein:DNA content.


Author(s):  
Barbara Trask ◽  
Susan Allen ◽  
Anne Bergmann ◽  
Mari Christensen ◽  
Anne Fertitta ◽  
...  

Using fluorescence in situ hybridization (FISH), the positions of DNA sequences can be discretely marked with a fluorescent spot. The efficiency of marking DNA sequences of the size cloned in cosmids is 90-95%, and the fluorescent spots produced after FISH are ≈0.3 μm in diameter. Sites of two sequences can be distinguished using two-color FISH. Different reporter molecules, such as biotin or digoxigenin, are incorporated into DNA sequence probes by nick translation. These reporter molecules are labeled after hybridization with different fluorochromes, e.g., FITC and Texas Red. The development of dual band pass filters (Chromatechnology) allows these fluorochromes to be photographed simultaneously without registration shift.


Sign in / Sign up

Export Citation Format

Share Document