scholarly journals A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

2013 ◽  
Vol 7 ◽  
pp. BBI.S10053 ◽  
Author(s):  
Nicolas Carels ◽  
Diego Frías

In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.

2009 ◽  
Vol 3 ◽  
pp. BBI.S3030 ◽  
Author(s):  
Nicolas Carels ◽  
Diego Frías

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.


2020 ◽  
Author(s):  
Michael P. Hughes ◽  
Luki Goldschmidt ◽  
David S. Eisenberg

AbstractMembraneless Organelles (MLOs) are vital and dynamic reaction centers in cells that organize metabolism in the absence of a membrane. Multivalent interactions between protein Low-Complexity Domains (LCDs) contribute to MLO organization. Our previous work used computational methods to identify structural motifs termed Low-complexity Amyloid-like Reversible Kinked Segments (LARKS) that can phase-transition to form hydrogels and are common in human proteins that participate in MLOs. Here we searched for LARKS in proteomes of six model organisms: Homo sapiens, Drosophila melanogaster, Plasmodium falciparum, Saccharomyces cerevisiae, Mycobacterium tuberculosis, and Escherichia coli. We find LARKS are abundant in M. tuberculosis, D. melanogaster, and H. sapiens, but not in S. cerevisiae or P. falciparum. Abundant LARKS require high glycine content, which enables kinks to form in LARKS as is illustrated in the known LARKS-rich amyloid structures of TDP43, FUS, and hnRNPA2, three proteins that participate in MLOs. These results support the idea of LARKS as an evolved structural motif and we offer the LARKSdb webserver which permits users to search for LARKS in their protein sequences of interest.


2009 ◽  
Vol 3 ◽  
pp. BBI.S2236 ◽  
Author(s):  
Nicolas Carels ◽  
Ramon Vidal ◽  
Diego Frías

In this report, we revisited simple features that allow the classification of coding sequences (CDS) from non-coding DNA. The spectrum of codon usage of our sequence sample is large and suggests that these features are universal. The features that we investigated combine (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine, Guanine, Adenine probabilities in 1st, 2nd, 3rd position of triplets, respectively, (iv) the product of G and C probabilities in 1st and 2nd position of triplets. These features are a natural consequence of the physico-chemical properties of proteins and their combination is successful in classifying CDS and non-coding DNA (introns) with a success rate >95% above 350 bp. The coding strand and coding frame are implicitly deduced when the sequences are classified as coding.


Author(s):  
Yanping Zhang ◽  
Pengcheng Chen ◽  
Ya Gao ◽  
Jianwei Ni ◽  
Xiaosheng Wang

Aim and Objective:: Given the rapidly increasing number of molecular biology data available, computational methods of low complexity are necessary to infer protein structure, function, and evolution. Method:: In the work, we proposed a novel mthod, FermatS, which based on the global position information and local position representation from the curve and normalized moments of inertia, respectively, to extract features information of protein sequences. Furthermore, we use the generated features by FermatS method to analyze the similarity/dissimilarity of nine ND5 proteins and establish the prediction model of DNA-binding proteins based on logistic regression with 5-fold crossvalidation. Results:: In the similarity/dissimilarity analysis of nine ND5 proteins, the results are consistent with evolutionary theory. Moreover, this method can effectively predict the DNA-binding proteins in realistic situations. Conclusion:: The findings demonstrate that the proposed method is effective for comparing, recognizing and predicting protein sequences. The main code and datasets can download from https://github.com/GaoYa1122/FermatS.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 371
Author(s):  
Yerin Lee ◽  
Soyoung Lim ◽  
Il-Youp Kwak

Acoustic scene classification (ASC) categorizes an audio file based on the environment in which it has been recorded. This has long been studied in the detection and classification of acoustic scenes and events (DCASE). This presents the solution to Task 1 of the DCASE 2020 challenge submitted by the Chung-Ang University team. Task 1 addressed two challenges that ASC faces in real-world applications. One is that the audio recorded using different recording devices should be classified in general, and the other is that the model used should have low-complexity. We proposed two models to overcome the aforementioned problems. First, a more general classification model was proposed by combining the harmonic-percussive source separation (HPSS) and deltas-deltadeltas features with four different models. Second, using the same feature, depthwise separable convolution was applied to the Convolutional layer to develop a low-complexity model. Moreover, using gradient-weight class activation mapping (Grad-CAM), we investigated what part of the feature our model sees and identifies. Our proposed system ranked 9th and 7th in the competition for these two subtasks, respectively.


PLoS ONE ◽  
2018 ◽  
Vol 13 (3) ◽  
pp. e0193757 ◽  
Author(s):  
Inti Anabela Pagnuco ◽  
María Victoria Revuelta ◽  
Hernán Gabriel Bondino ◽  
Marcel Brun ◽  
Arjen ten Have

2013 ◽  
Vol 113 (suppl_1) ◽  
Author(s):  
LU XIAO ◽  
Haiqing Bai ◽  
James Boyer ◽  
Bo Ye ◽  
Ning Hou ◽  
...  

Lu Xiao, Haiqing Bai, James Boyer, Bo Ye, Ning Hou, Haodong Xu, and Faqian Li Department of Pathology and Laboratory Medicine and Cardiovascular Research Institute, University of Rochester Medical Center, Rochester, NY, USA Backgrounds: Canonical Wnt signaling appears to have multiphasic and often antagonistic roles in cardiac development. The molecular mechanism for these opposing actions is not clear. We hypothesized that alternative splicing of TCF7L2, a nuclear interaction partner of beta-catenin is involved in the specificity of canonical Wnt signaling. Methods: RT-PCR were performed on embryonic (E16.5) and neonatal (day 8) hearts with primers spanning the end of first exon and the beginning of last exon and the products were cloned and sequenced. Result: There are totally 18 exons identified so far in TCF7L2. We sequenced 56 clones and 53 clones (29 from day 8) and (24 from E16.5) contained TCF7L2 sequences. No exon 6 or exon 17 was found in TCF7L2 transcripts of mouse hearts. Most clones (more than 80%) from E16.5 and day 8 hearts excluded exon 4. Both E16.5 and day 8 hearts had one clone with exon 9 deletion which does not change reading frame and another with alterations in exon 3 that lead to reading frame shift and premature stop codon. As reported in other organs, there were extensive alternative splicing in the C-terminal exons 14, 15 and 16. The inclusion of exon 14 was more frequently in day 8 (18 of 29, 62%) than in E16.5 (8 of 24, 33%) hearts. The peptide encoded by exon 14 has conserved functional motif. Additionally, this alternative exon usage can change the C-terminus of TCF7L2 to include or exclude the so-called E tail with two binding motifs for C-terminal binding protein. Conclusion: The isoform switch of TCF7L2 occurs in neonatal mouse hearts and may have a role in the terminal differentiation of cardiac myocytes during this period.


1988 ◽  
Vol 8 (8) ◽  
pp. 3439-3447 ◽  
Author(s):  
W Bajwa ◽  
T E Torchia ◽  
J E Hopper

GAL3 gene expression is required for rapid GAL4-mediated galactose induction of the galactose-melibiose regulon genes in Saccharomyces cerevisiae. Here we show by Northern (RNA) blot analysis that GAL3 gene expression is itself galactose inducible. Like the GAL1, GAL7, GAL10, and MEL1 genes, the GAL3 gene is severely glucose repressed. Like the MEL1 gene, but in contrast to the GAL1, GAL7, and GAL10 genes, GAL3 is expressed at readily detectable basal levels in cells grown in noninducing, nonrepressing media. We determined the sequence of the S. cerevisiae GAL3 gene and its 5'-noncoding region. Within the 5'-noncoding region of the GAL3 gene, we found two sequences similar to the UASGal elements of the other galactose-melibiose regulon genes. Deletion analysis indicated that only the most ATG proximal of these sequences is required for GAL3 expression. The coding region of GAL3 consists of a 1,275-base-pair open reading frame in the direction of transcription. A comparison of the deduced 425-amino-acid sequence with the protein data bank revealed three regions of striking similarity between the GAL3 protein and the GAL1-specified galactokinase of Saccharomyces carlsbergensis. One of these regions also showed striking similarity to sequences within the galactokinase protein of Escherichia coli. On the basis of these protein sequence similarities, we propose that the GAL3 protein binds a molecule identical to or structurally related to one of the substrates or products of the galactokinase-catalyzed reaction.


Sign in / Sign up

Export Citation Format

Share Document