CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification

Gold-Aptamer-Nanoconstructs Engineered to Detect Conserved Enteroviral Nucleic Acid Sequences

10.26434/chemrxiv.8312324.v1 ◽

2019 ◽

Author(s):

Veeren Chauhan ◽

Mohamed M Elsutohy ◽

C Patrick McClure ◽

Will Irving ◽

Neil Roddis ◽

...

Keyword(s):

Nucleic Acid ◽

In Silico ◽

Point Of Care ◽

Lateral Flow ◽

Nucleic Acid Sequence ◽

In Silico Screening ◽

Lateral Flow Assays ◽

Life Threatening ◽

Software And Hardware ◽

Nucleic Acid Sequences

Enteroviruses are a ubiquitous mammalian pathogen that can produce mild to life-threatening disease. Bearing this in mind, we have developed a rapid, accurate and economical point-of-care biosensor that can detect a nucleic acid sequences conserved amongst 96% of all known enteroviruses. The biosensor harnesses the physicochemical properties of gold nanoparticles and aptamers to provide colourimetric, spectroscopic and lateral flow-based identification of an exclusive enteroviral RNA sequence (23 bases), which was identified through in silico screening. Aptamers were designed to demonstrate specific complementarity towards the target enteroviral RNA to produce aggregated gold-aptamer nanoconstructs. Conserved target enteroviral nucleic acid sequence (≥ 1x10-7 M, ≥1.4×10-14 g/mL), initiates gold-aptamer-nanoconstructs disaggregation and a signal transduction mechanism, producing a colourimetric and spectroscopic blueshift (544 nm (purple) > 524 nm (red)). Furthermore, lateral-flow-assays that utilise gold-aptamer-nanoconstructs were unaffected by contaminating human genomic DNA, demonstrated rapid detection of conserved target enteroviral nucleic acid sequence (< 60 s) and could be interpreted with a bespoke software and hardware electronic interface. We anticipate our methodology will translate in-silico screening of nucleic acid databases to a tangible enteroviral desktop detector, which could be readily translated to related organisms. This will pave-the-way forward in the clinical evaluation of disease and complement existing strategies at overcoming antimicrobial resistance.

Download Full-text

NASCUP: Nucleic Acid Sequence Classification by Universal Probability

IEEE Access ◽

10.1109/access.2021.3127957 ◽

2021 ◽

pp. 1-1

Author(s):

Sunyoung Kwon ◽

Gyuwan Kim ◽

Byunghan Lee ◽

Jongsik Chun ◽

Sungroh Yoon ◽

...

Keyword(s):

Nucleic Acid ◽

Nucleic Acid Sequence ◽

Sequence Classification

Download Full-text

Reproducible evaluation of classification methods in Alzheimer’s disease: framework and application to MRI and PET data

10.1101/274324 ◽

2018 ◽

Author(s):

Jorge Samper-González ◽

Ninon Burgos ◽

Simona Bottani ◽

Sabrina Fontanella ◽

Pascal Lu ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Feature Extraction ◽

Fdg Pet ◽

Large Scale ◽

Extraction Methods ◽

Classification Performance ◽

Evaluation Framework ◽

Classification Methods ◽

Standard Format

AbstractA large number of papers have introduced novel machine learning and feature extraction methods for automatic classification of Alzheimer’s disease (AD). However, while the vast majority of these works use the public dataset ADNI for evaluation, they are difficult to reproduce because different key components of the validation are often not readily available. These components include selected participants and input data, image preprocessing and cross-validation procedures. The performance of the different approaches is also difficult to compare objectively. In particular, it is often difficult to assess which part of the method (e.g. preprocessing, feature extraction or classification algorithms) provides a real improvement, if any. In the present paper, we propose a framework for reproducible and objective classification experiments in AD using three publicly available datasets (ADNI, AIBL and OASIS). The framework comprises: i) automatic conversion of the three datasets into a standard format (BIDS); ii) a modular set of preprocessing pipelines, feature extraction and classification methods, together with an evaluation framework, that provide a baseline for benchmarking the different components. We demonstrate the use of the framework for a large-scale evaluation on 1960 participants using T1 MRI and FDG PET data. In this evaluation, we assess the influence of different modalities, preprocessing, feature types (regional or voxel-based features), classifiers, training set sizes and datasets. Performances were in line with the state-of-the-art. FDG PET outperformed T1 MRI for all classification tasks. No difference in performance was found for the use of different atlases, image smoothing, partial volume correction of FDG PET images, or feature type. Linear SVM and L2-logistic regression resulted in similar performance and both outperformed random forests. The classification performance increased along with the number of subjects used for training. Classifiers trained on ADNI generalized well to AIBL and OASIS, performing better than the classifiers trained and tested on each of these datasets independently. All the code of the framework and the experiments is publicly available.

Download Full-text

Gold-Aptamer-Nanoconstructs Engineered to Detect Conserved Enteroviral Nucleic Acid Sequences

10.26434/chemrxiv.8312324 ◽

2019 ◽

Author(s):

Veeren Chauhan ◽

Mohamed M Elsutohy ◽

C Patrick McClure ◽

Will Irving ◽

Neil Roddis ◽

...

Keyword(s):

Nucleic Acid ◽

In Silico ◽

Point Of Care ◽

Lateral Flow ◽

Nucleic Acid Sequence ◽

In Silico Screening ◽

Lateral Flow Assays ◽

Life Threatening ◽

Software And Hardware ◽

Nucleic Acid Sequences

Enteroviruses are a ubiquitous mammalian pathogen that can produce mild to life-threatening disease. Bearing this in mind, we have developed a rapid, accurate and economical point-of-care biosensor that can detect a nucleic acid sequences conserved amongst 96% of all known enteroviruses. The biosensor harnesses the physicochemical properties of gold nanoparticles and aptamers to provide colourimetric, spectroscopic and lateral flow-based identification of an exclusive enteroviral RNA sequence (23 bases), which was identified through in silico screening. Aptamers were designed to demonstrate specific complementarity towards the target enteroviral RNA to produce aggregated gold-aptamer nanoconstructs. Conserved target enteroviral nucleic acid sequence (≥ 1x10-7 M, ≥1.4×10-14 g/mL), initiates gold-aptamer-nanoconstructs disaggregation and a signal transduction mechanism, producing a colourimetric and spectroscopic blueshift (544 nm (purple) > 524 nm (red)). Furthermore, lateral-flow-assays that utilise gold-aptamer-nanoconstructs were unaffected by contaminating human genomic DNA, demonstrated rapid detection of conserved target enteroviral nucleic acid sequence (< 60 s) and could be interpreted with a bespoke software and hardware electronic interface. We anticipate our methodology will translate in-silico screening of nucleic acid databases to a tangible enteroviral desktop detector, which could be readily translated to related organisms. This will pave-the-way forward in the clinical evaluation of disease and complement existing strategies at overcoming antimicrobial resistance.

Download Full-text

Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines

Entropy ◽

10.3390/e21121149 ◽

2019 ◽

Vol 21 (12) ◽

pp. 1149

Author(s):

Ersoy Öz ◽

Öyküm Esra Aşkın

Keyword(s):

Feature Extraction ◽

Support Vector Machines ◽

Nucleic Acid ◽

Extraction Methods ◽

Classification Performance ◽

Vital Role ◽

Kernel Functions ◽

Permutation Entropy ◽

Support Vector ◽

Vector Machines

Classifying nucleic acid trace files is an important issue in molecular biology researches. For the purpose of obtaining better classification performance, the question of which features are used and what classifier is implemented to best represent the properties of nucleic acid trace files plays a vital role. In this study, different feature extraction methods based on statistical and entropy theory are utilized to discriminate deoxyribonucleic acid chromatograms, and distinguishing their signals visually is almost impossible. Extracted features are used as the input feature set for the classifiers of Support Vector Machines (SVM) with different kernel functions. The proposed framework is applied to a total number of 200 hepatitis nucleic acid trace files which consist of Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV). While the use of statistical-based feature extraction methods allows representing the properties of hepatitis nucleic acid trace files with descriptive measures such as mean, median and standard deviation, entropy-based feature extraction methods including permutation entropy and multiscale permutation entropy enable quantifying the complexity of these files. The results indicate that using statistical and entropy-based features produces exceptionally high performances in terms of accuracies (reached at nearly 99%) in classifying HBV and HCV.

Download Full-text

A Comparative Analysis of Time-frequency Feature Extraction Techniques for Large Scale Electroencephalogram Data

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/031012021 ◽

2021 ◽

Vol 10 (1) ◽

pp. 14-24

Keyword(s):

Feature Extraction ◽

Large Scale ◽

Extraction Methods ◽

Kernel Functions ◽

The Body ◽

Support Vector ◽

Discrete Wavelet ◽

Eeg Signals ◽

Time Frequency ◽

The Impact

Recognition of human emotions is a fascinating research field that motivates many researchers to use various approaches, such as facial expression, speech or gesture of the body. Electroencephalogram (EEG) is another approach of recognizing human emotion through brain signals and has offered promising findings. Although EEG signals provide detail information on human emotional states, the analysis of non-linear and chaotic characteristics of EEG signals is a substantial problem. The main challenge remains in analyzing EEG signals to extract relevant features in order to achieve optimum classification performance. Various feature extraction methods have been developed by researchers, which mainly can be categorized under time, frequency or time-frequency based feature extraction methods. Yet, there are numerous setting that could affect the performance of any model. In this paper, we investigated the performance of Discrete Wavelet Transform (DWT) and Discrete Wavelet Packet Transform (DWPT), which are time-frequency domain methods using Support Vector Machine (SVM) and k-Nearest Neighbor (KNN) classification techniques. Different SVM kernel functions and distance metrics of KNN are tested in this study by using subject-dependent and subject -independent approaches. The experiment is implemented using publicly available DEAP dataset. The experimental results show that DWT is mostly suitable with weighted KNN classifier while DWPT reported better results when tested using Linear SVM classifier to accurately classify the EEG signals on subject-dependent approach. Consistent results are observed for DWT-KNN on subject-independent approach, however SVM works better in the setting of quadratic kernel functions. These results indicate that further investigation is significant to examine the impact of different setting of methods in analyzing large scale of EEG data

Download Full-text

Feature extraction methods for consistent spatio-temporal image sequence classification using hidden Markov models

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.1997.595394 ◽

2002 ◽

Cited By ~ 5

Author(s):

P. Morguet ◽

M. Lang

Keyword(s):

Feature Extraction ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Image Sequence ◽

Extraction Methods ◽

Sequence Classification ◽

Spatio Temporal

Download Full-text

Gold–Oligonucleotide Nanoconstructs Engineered to Detect Conserved Enteroviral Nucleic Acid Sequences

Biosensors ◽

10.3390/bios11070238 ◽

2021 ◽

Vol 11 (7) ◽

pp. 238

Author(s):

Veeren M. Chauhan ◽

Mohamed M. Elsutohy ◽

C. Patrick McClure ◽

William L. Irving ◽

Neil Roddis ◽

...

Keyword(s):

Nucleic Acid ◽

In Silico ◽

Point Of Care ◽

Lateral Flow ◽

Nucleic Acid Sequence ◽

In Silico Screening ◽

Lateral Flow Assays ◽

Life Threatening ◽

Software And Hardware ◽

Nucleic Acid Sequences

Enteroviruses are ubiquitous mammalian pathogens that can produce mild to life-threatening disease. We developed a multimodal, rapid, accurate and economical point-of-care biosensor that can detect nucleic acid sequences conserved amongst 96% of all known enteroviruses. The biosensor harnesses the physicochemical properties of gold nanoparticles and oligonucleotides to provide colourimetric, spectroscopic and lateral flow-based identification of an exclusive enteroviral nucleic acid sequence (23 bases), which was identified through in silico screening. Oligonucleotides were designed to demonstrate specific complementarity towards the target enteroviral nucleic acid to produce aggregated gold–oligonucleotide nanoconstructs. The conserved target enteroviral nucleic acid sequence (≥1 × 10−7 M, ≥1.4 × 10−14 g/mL) initiates gold–oligonucleotide nanoconstruct disaggregation and a signal transduction mechanism, producing a colourimetric and spectroscopic blueshift (544 nm (purple) > 524 nm (red)). Furthermore, lateral-flow assays that utilise gold–oligonucleotide nanoconstructs were unaffected by contaminating human genomic DNA, demonstrated rapid detection of conserved target enteroviral nucleic acid sequence (<60 s), and could be interpreted with a bespoke software and hardware electronic interface. We anticipate that our methodology will translate in silico screening of nucleic acid databases to a tangible enteroviral desktop detector, which could be readily translated to related organisms. This will pave the way forward in the clinical evaluation of disease and complement existing strategies to overcome antimicrobial resistance.

Download Full-text

An alignment method for nucleic acid sequences against annotated genomes

10.1101/200394 ◽

2017 ◽

Cited By ~ 2

Author(s):

Koen Deforche

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Reference Genome ◽

Query Sequence ◽

Amino Acid Sequences ◽

Alignment Score ◽

Nucleic Acid Sequence ◽

Coding Sequences ◽

Divergent Sequences ◽

Nucleic Acid Sequences

AbstractMotivationBiological sequence alignment is fundamental to their further interpretation. Current alignment algorithms typically align either nucleic acid or amino acid sequences. Using only nucleic acid sequence similarity, divergent sequences cannot be aligned reliably because of the limited alphabet and genetic saturation. To align divergent coding nucleic acid sequences, one can align using the translated amino acid sequences. This requires the detection of the correct open reading frame, is prone to eventual frame shift errors, and typically requires the treatment of genes separately. It was our motivation to design a nucleic acid sequence alignment algorithm to align a nucleic acid sequence against a (reference) genome sequence, that works equally well for similar and divergent sequences, and produces an optimal alignment considering simultaneously the alignment of all annotated coding sequences.ResultsWe define a genome alignment score for evaluating the quality of an alignment of a nucleic acid query sequence against a reference genome sequence, for which coding sequence features have been annotated (for example in a GenBank record). The genome alignment score combines the a ne gap score for the nucleic acid sequence with an a ne gap score for all amino acid alignments resulting from coding sequences in open reading frames contained within the query sequence. We present a Dynamic Programming algorithm to compute the optimal global or local alignment using this genomic alignment score and provide a formal proof of correctness. This algorithm allows the alignment of nucleic acid sequences from closely related and highly divergent sequences within the same software and using the same parameters, automatically correcting any eventual frame shift errors and produces at the same time the aligned translated amino acid sequences of all relevant coding sequence features.AvailabilityThe software is available as a web application at http://www.genomedetective.com/app/aga and as command-line application at https://github.com/emweb/aga

Download Full-text

Seq2DFunc: 2-dimensional convolutional neural network on graph representation of synthetic sequences from massive-throughput assay

10.1101/2019.12.22.886085 ◽

2019 ◽

Author(s):

Haotian Guo ◽

Xiaohu Song ◽

Ariel B. Lindner

Keyword(s):

Neural Network ◽

Nucleic Acid ◽

Convolutional Neural Network ◽

Rna Structure ◽

Large Scale ◽

Explanatory Power ◽

Graph Representation ◽

Approach Training ◽

Dataset Size ◽

Nucleic Acid Sequences

AbstractIn recent years, a pipeline of massively parallel reporter assay (MPRA), and next-generation sequencing (NGS) provided large-scale datasets to investigate biological mechanisms in detail. However, bigger data often leads to larger complexity. As a result, theories derived from low-throughput experiments lose explanatory power, requiring new methods to create predictive models. Here we focus on modeling functions of nucleic acid sequences, as a study case of massive-throughput assays. We report a deep learning approach, training a two-dimensional convolutional neural network (CNN) on an ordered graph representation of nucleic acid sequences to predict their functions (Seq2DFunc). To compare the performance of Seq2DFunc with conventional methods, we obtained customized database on a CRISPR RNA processing assay. For this specific assay, analyses of sequence and RNA structure determinants failed to explain the results regardless of dataset size. 1-dimensional CNN of raw sequences generate generally failed to converge at < 10,000 or fewer sequences. By contrast, Seq2DFunc trained on ∼ 7,000 sequences still provided 86% accuracy. Given a sufficient dataset (∼ 120,000 sequences) for training, Seq2DFunc (96% accuracy, 0.93 f1-score) still outperformed the best 1D CNN (92% accuracy, 0.83 f1-score). We anticipate Seq2DFunc can be a versatile downstream tool for deciphering massive-throughput assays for many fundamental studies. In addition, the use of smaller dataset is especially beneficial to reduce the experiment budget or required sequencing depth.

Download Full-text