scholarly journals Calibrating seed-based alignment heuristics with Sesame

2019 ◽  
Author(s):  
Guillaume J. Filion ◽  
Ruggero Cortini ◽  
Eduard Zorita

AbstractThe increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip-seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which combined together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.

2017 ◽  
Author(s):  
Daniel W. Bellott ◽  
Ting-Jan Cho ◽  
Jennifer F. Hughes ◽  
Helen Skaletsky ◽  
David C. Page

AbstractReference sequence of structurally complex regions can only be obtained through highly accurate clone-based approaches. We and others have successfully employed Single-Haplotype Iterative Mapping and Sequencing (SHIMS 1.0) to assemble structurally complex regions across the sex chromosomes of several vertebrate species and in targeted improvements to the reference sequences of human autosomes. However, SHIMS 1.0 was expensive and time consuming, requiring the resources that only a genome center could command. Here we introduce SHIMS 2.0, an improved SHIMS protocol to allow even a small laboratory to generate high-quality reference sequence from complex genomic regions. Using a streamlined and parallelized library preparation protocol, and taking advantage of high-throughput, inexpensive, short-read sequencing technologies, a small group can sequence and assemble hundreds of clones in a week. Relative to SHIMS 1.0, SHIMS 2.0 reduces the cost and time required by two orders of magnitude, while preserving high sequencing accuracy.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Liuyang Fu ◽  
Qian Wang ◽  
Lina Li ◽  
Tao Lang ◽  
Junjia Guo ◽  
...  

Abstract Background Chromosomal variants play important roles in crop breeding and genetic research. The development of single-stranded oligonucleotide (oligo) probes simplifies the process of fluorescence in situ hybridization (FISH) and facilitates chromosomal identification in many species. Genome sequencing provides rich resources for the development of oligo probes. However, little progress has been made in peanut due to the lack of efficient chromosomal markers. Until now, the identification of chromosomal variants in peanut has remained a challenge. Results A total of 114 new oligo probes were developed based on the genome-wide tandem repeats (TRs) identified from the reference sequences of the peanut variety Tifrunner (AABB, 2n = 4x = 40) and the diploid species Arachis ipaensis (BB, 2n = 2x = 20). These oligo probes were classified into 28 types based on their positions and overlapping signals in chromosomes. For each type, a representative oligo was selected and modified with green fluorescein 6-carboxyfluorescein (FAM) or red fluorescein 6-carboxytetramethylrhodamine (TAMRA). Two cocktails, Multiplex #3 and Multiplex #4, were developed by pooling the fluorophore conjugated probes. Multiplex #3 included FAM-modified oligo TIF-439, oligo TIF-185-1, oligo TIF-134-3 and oligo TIF-165. Multiplex #4 included TAMRA-modified oligo Ipa-1162, oligo Ipa-1137, oligo DP-1 and oligo DP-5. Each cocktail enabled the establishment of a genome map-based karyotype after sequential FISH/genomic in situ hybridization (GISH) and in silico mapping. Furthermore, we identified 14 chromosomal variants of the peanut induced by radiation exposure. A total of 28 representative probes were further chromosomally mapped onto the new karyotype. Among the probes, eight were mapped in the secondary constrictions, intercalary and terminal regions; four were B genome-specific; one was chromosome-specific; and the remaining 15 were extensively mapped in the pericentric regions of the chromosomes. Conclusions The development of new oligo probes provides an effective set of tools which can be used to distinguish the various chromosomes of the peanut. Physical mapping by FISH reveals the genomic organization of repetitive oligos in peanut chromosomes. A genome map-based karyotype was established and used for the identification of chromosome variations in peanut following comparisons with their reference sequence positions.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Krisztian Buza ◽  
Bartek Wilczynski ◽  
Norbert Dojer

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used.Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge.Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.


2019 ◽  
Author(s):  
Eleanor Young ◽  
Heba Z. Abid ◽  
Pui-Yan Kwok ◽  
Harold Riethman ◽  
Ming Xiao

AbstractDetailed comprehensive knowledge of the structures of individual long-range telomere-terminal haplotypes are needed to understand their impact on telomere function, and to delineate the population structure and evolution of subtelomere regions. However, the abundance of large evolutionarily recent segmental duplications and high levels of large structural variations have complicated both the mapping and sequence characterization of human subtelomere regions. Here, we use high throughput optical mapping of large single DNA molecules in nanochannel arrays for 154 human genomes from 26 populations to present a comprehensive look at human subtelomere structure and variation. The results catalog many novel long-range subtelomere haplotypes and determine the frequencies and contexts of specific subtelomeric duplicons on each chromosome arm, helping to clarify the currently ambiguous nature of many specific subtelomere structures as represented in the current reference sequence (HG38). The organization and content of some duplicons in subtelomeres appear to show both chromosome arm and population-specific trends. Based upon these trends we estimate a timeline for the spread of these duplication blocks.Author SummaryThe ends of human chromosomes have caps called telomeres that are essential. These telomeres are influenced by the portions of DNA next to them, a region known as the subtelomere. We need to better understand the subtelomeric region to understand how it impacts the telomeres. This subtelomeric region is not well described in the current references. This is due to large variations in this region and portions that are repeated many times, making current sequencing technologies struggle to capture these regions. Many of these variations are evolutionary recent. Here we use 154 different samples from the 26 geographic regions of the world to gain a better understanding of the variation in these regions. We found many new haplotypes and clarified the haplotypes existing in the current reference. We then examined population and chromosome specific trends.


2013 ◽  
Author(s):  
Benjamin P. Berman ◽  
Yaping Liu ◽  
Theresa K. Kelly

Background: Nucleosome organization and DNA methylation are two mechanisms that are important for proper control of mammalian transcription, as well as epigenetic dysregulation associated with cancer. Whole-genome DNA methylation sequencing studies have found that methylation levels in the human genome show periodicities of approximately 190 bp, suggesting a genome-wide relationship between the two marks. A recent report (Chodavarapu et al., 2010) attributed this to higher methylation levels of DNA within nucleosomes. Here, we analyzed a number of published datasets and found a more compelling alternative explanation, namely that methylation levels are highest in linker regions between nucleosomes. Results: Reanalyzing the data from (Chodavarapu et al., 2010), we found that nucleosome-associated methylation could be strongly confounded by known sequence-related biases of the next-generation sequencing technologies. By accounting for these biases and using an unrelated nucleosome profiling technology, NOMe-seq, we found that genome-wide methylation was actually highest within linker regions occurring between nucleosomes in multi-nucleosome arrays. This effect was consistent among several methylation datasets generated independently using two unrelated methylation assays. Linker-associated methylation was most prominent within long Partially Methylated Domains (PMDs) and the positioned nucleosomes that flank CTCF binding sites. CTCF adjacent nucleosomes retained the correct positioning in regions completely devoid of CpG dinucleotides, suggesting that DNA methylation is not required for proper nucleosomes positioning. Conclusions: The biological mechanisms responsible for DNA methylation patterns outside of gene promoters remain poorly understood. We identified a significant genome-wide relationship between nucleosome organization and DNA methylation, which can be used to more accurately analyze and understand the epigenetic changes that accompany cancer and other diseases.


2019 ◽  
Vol 6 (2) ◽  
pp. 180608 ◽  
Author(s):  
Marvin Choquet ◽  
Irina Smolina ◽  
Anusha K. S. Dhanasiri ◽  
Leocadio Blanco-Bercial ◽  
Martina Kopp ◽  
...  

Advances in next-generation sequencing technologies and the development of genome-reduced representation protocols have opened the way to genome-wide population studies in non-model species. However, species with large genomes remain challenging, hampering the development of genomic resources for a number of taxa including marine arthropods. Here, we developed a genome-reduced representation method for the ecologically important marine copepod Calanus finmarchicus (haploid genome size of 6.34 Gbp). We optimized a capture enrichment-based protocol based on 2656 single-copy genes, yielding a total of 154 087 high-quality SNPs in C. finmarchicus including 62 372 in common among the three locations tested. The set of capture probes was also successfully applied to the congeneric C. glacialis . Preliminary analyses of these markers revealed similar levels of genetic diversity between the two Calanus species, while populations of C. glacialis showed stronger genetic structure compared to C. finmarchicus . Using this powerful set of markers, we did not detect any evidence of hybridization between C. finmarchicus and C. glacialis . Finally, we propose a shortened version of our protocol, offering a promising solution for population genomics studies in non-model species with large genomes.


Hematology ◽  
2012 ◽  
Vol 2012 (1) ◽  
pp. 342-349 ◽  
Author(s):  
Gareth J. Morgan ◽  
Martin F. Kaiser

Abstract Recent advances in multiple myeloma (MM) therapy have led to significantly longer median survival rates and some patients being cured. At the same time, our understanding of MM biology and the molecular mechanisms driving the disease is constantly improving. Next-generation sequencing technologies now allow insights into the genetic aberrations in MM at a genome-wide scale and across different developmental stages in the course of an individual tumor. This improved knowledge about MM biology needs to be rapidly translated and transformed into diagnostic and therapeutic applications to finally achieve cure in a larger proportion of patients. As a part of these translational efforts, novel drugs that inhibit oncogenic proteins overexpressed in defined molecular subgroups of the disease, such as FGFR3 and MMSET in t(4;14) MM, are currently being developed. The potential of targeted next-generation diagnostic tests to rapidly identify clinically relevant molecular subgroups is being evaluated. The technical tools to detect and define tumor subclones may potentially become clinically relevant because intraclonal tumor heterogeneity has become apparent in many cancers. The emergence of different MM subclones under the selective pressure of treatment is important in MM, especially in the context of maintenance therapy and treatment for asymptomatic stages of the disease. Finally, novel diagnostic and therapeutic achievements have to be implemented into innovative clinical trial strategies with smaller trials for molecularly defined high-risk patients and large trials with a long follow-up for the patients most profiting from the current treatment protocols. These combined approaches will hopefully transform the current one-for-all care into a more tailored, individual therapeutic strategy for MM patients.


2014 ◽  
Vol 24 (04) ◽  
pp. 697-717 ◽  
Author(s):  
RYSZARD RUDNICKI ◽  
JERZY TIURYN

We consider a probabilistic model of genome evolution. We are interested in size distribution of gene families. The model is based on three fundamental evolutionary events: gene loss, duplication and accumulated change. We assume that the probability of gene loss and duplication is constant and the probability of gene mutation mi depends on the size i of a family. We prove that size distribution of paralogous gene families in a genome converges to the equilibrium as time goes to infinity. Moreover, we show how this equilibrium depends on the sequence (mi). Theoretical results are compared with the available genomic data.


2016 ◽  
Vol 229 (2) ◽  
pp. R43-R56 ◽  
Author(s):  
Koen D Flach ◽  
Wilbert Zwart

The advent of genome-wide transcription factor profiling has revolutionized the field of breast cancer research. Estrogen receptor α (ERα), the major drug target in hormone receptor-positive breast cancer, has been known as a key transcriptional regulator in tumor progression for over 30 years. Even though this function of ERα is heavily exploited and widely accepted as an Achilles heel for hormonal breast cancer, only since the last decade we have been able to understand how this transcription factor is functioning on a genome-wide scale. Initial ChIP-on-chip (chromatin immunoprecipitation coupled with tiling array) analyses have taught us that ERα is an enhancer-associated factor binding to many thousands of sites throughout the human genome and revealed the identity of a number of directly interacting transcription factors that are essential for ERα action. More recently, with the development of massive parallel sequencing technologies and refinements thereof in sample processing, a genome-wide interrogation of ERα has become feasible and affordable with unprecedented data quality and richness. These studies have revealed numerous additional biological insights into ERα behavior in cell lines and especially in clinical specimens. Therefore, what have we actually learned during this first decade of cistromics in breast cancer and where may future developments in the field take us?


2020 ◽  
Author(s):  
Lauren Coombe ◽  
Vladimir Nikolić ◽  
Justin Chu ◽  
Inanc Birol ◽  
René L. Warren

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]


Sign in / Sign up

Export Citation Format

Share Document