scholarly journals Floating search methodology for combining classification models for site recognition in DNA sequences

2018 ◽  
Author(s):  
Javier Pérez-Rodríguez ◽  
Aida de Haro-García ◽  
Nicolás García-Pedrajas

AbstractRecognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. Recognition of the functional sites of genes is also a fundamental step in gene structure predictions in the most powerful programs. The best approaches to this type of recognition use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this type of problem with the best possible performance.A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper, we present a framework that is based on floating search for combining as many classifiers as needed for the recognition of any functional sites of a gene. The methodology can be used for the recognition of translation initiation sites, donor and acceptor splice sites and stop codons. Furthermore, we can combine any number of classifiers that are trained on any species. The method is also scalable to large datasets, as is shown in experiments in which the whole human genome is used. The method is also applicable to other recognition tasks.We present experiments on the recognition of these four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods for use in a thorough evaluation process. The proposed method is also able to improve heuristic selection of species to be used as sources of evidence as the search finds the most useful datasets.Author summaryIn this paper we present a methodology for combining many sources of information to recognize some of the most important functional sites in a genomic sequence. The functional sites of the sequences, such as, translation start sites, translation initiation sites, acceptor and donor splice sites and stop codons, play a very relevant role in many Bioinformatics tasks. Their accurate recognition is an important task by itself and also as part of gene structure prediction programs.Our approach uses a methodology usually termed in Computer Science as “floating search”. This is a powerful heuristics applicable when the cost of evaluating each possible solution is high. The methodology is applied to the recognition of four different functional sites in the human genome using as additional sources of evidence the annotated genomes of other twenty different species.The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites in the human genome.

Author(s):  
Rafał Biedrzycki ◽  
Jarosław Arabas

Abstract This paper presents an application of methods from the machine learning domain to solving the task of DNA sequence recognition. We present an algorithm that learns to recognize groups of DNA sequences sharing common features such as sequence functionality. We demonstrate application of the algorithm to find splice sites, i.e., to properly detect donor and acceptor sequences. We compare the results with those of reference methods that have been designed and tuned to detect splice sites. We also show how to use the algorithm to find a human readable model of the IRE (Iron-Responsive Element) and to find IRE sequences. The method, although universal, yields results which are of quality comparable to those obtained by reference methods. In contrast to reference methods, this approach uses models that operate on sequence patterns, which facilitates interpretation of the results by humans.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Leonhard Wachutka ◽  
Livia Caizzi ◽  
Julien Gagneur ◽  
Patrick Cramer

RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain poorly understood. Here, we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and acceptor splice sites across the human genome. Splicing occurs in the range of minutes and is limited by the speed of RNA polymerase elongation. Splicing kinetics strongly depends on the position and nature of nucleotides flanking splice sites, and on structural interactions between unspliced RNA and small nuclear RNAs in spliceosomal intermediates. Finally, we introduce the ‘yield’ of splicing as the efficiency of converting unspliced to spliced RNA and show that it is highest for mRNAs and independent of splicing kinetics. These results lead to quantitative models describing how splicing rates and yield are encoded in the human genome.


2019 ◽  
Vol 63 (6) ◽  
pp. 757-771 ◽  
Author(s):  
Claire Francastel ◽  
Frédérique Magdinier

Abstract Despite the tremendous progress made in recent years in assembling the human genome, tandemly repeated DNA elements remain poorly characterized. These sequences account for the vast majority of methylated sites in the human genome and their methylated state is necessary for this repetitive DNA to function properly and to maintain genome integrity. Furthermore, recent advances highlight the emerging role of these sequences in regulating the functions of the human genome and its variability during evolution, among individuals, or in disease susceptibility. In addition, a number of inherited rare diseases are directly linked to the alteration of some of these repetitive DNA sequences, either through changes in the organization or size of the tandem repeat arrays or through mutations in genes encoding chromatin modifiers involved in the epigenetic regulation of these elements. Although largely overlooked so far in the functional annotation of the human genome, satellite elements play key roles in its architectural and topological organization. This includes functions as boundary elements delimitating functional domains or assembly of repressive nuclear compartments, with local or distal impact on gene expression. Thus, the consideration of satellite repeats organization and their associated epigenetic landmarks, including DNA methylation (DNAme), will become unavoidable in the near future to fully decipher human phenotypes and associated diseases.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Author(s):  
Shaun D Jackman ◽  
Benjamin P Vandervalk ◽  
Hamid Mohamadi ◽  
Justin Chu ◽  
Sarah Yeo ◽  
...  

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.


2021 ◽  
Author(s):  
Payam Kelich ◽  
Sanghwa Jeong ◽  
Nicole Navarro ◽  
Jaquesta Adams ◽  
Xiaoqi Sun ◽  
...  

AbstractDNA-wrapped single walled carbon nanotube (SWNT) conjugates have remarkable optical properties leading to their use in biosensing and imaging applications. A critical limitation in the development of DNA-SWNT sensors is the current inability to predict unique DNA sequences that confer a strong analyte-specific optical response to these sensors. Here, near-infrared (nIR) fluorescence response datasets for ~100 DNA-SWNT conjugates, narrowed down by a selective evolution protocol starting from a pool of ~1010 unique DNA-SWNT candidates, are used to train machine learning (ML) models to predict new unique DNA sequences with strong optical response to neurotransmitter serotonin. First, classifier models based on convolutional neural networks (CNN) are trained on sequence features to classify DNA ligands as either high response or low response to serotonin. Second, support vector machine (SVM) regression models are trained to predict relative optical response values for DNA sequences. Finally, we demonstrate with validation experiments that integrating the predictions of ensembles of the highest quality CNN classifiers and SVM regression models leads to the best predictions of both high and low response sequences. With our ML approaches, we discovered five new DNA-SWNT sensors with higher fluorescence intensity response to serotonin than obtained previously. Overall, the explored ML approaches introduce an important new tool to predict useful DNA sequences, which can be used for discovery of new DNA-based sensors and nanobiotechnologies.


Sign in / Sign up

Export Citation Format

Share Document