The prevalence and distribution in genomes of low-complexity, amyloid-like, reversible, kinked segment (LARKS), a common structural motif in amyloid-like fibrils

Mapping Intimacies ◽

10.1101/2020.12.08.415679 ◽

2020 ◽

Author(s):

Michael P. Hughes ◽

Luki Goldschmidt ◽

David S. Eisenberg

Keyword(s):

Phase Transition ◽

Homo Sapiens ◽

Protein Sequences ◽

Low Complexity ◽

Reaction Centers ◽

Structural Motif ◽

Model Organisms ◽

Dynamic Reaction ◽

Human Proteins ◽

Glycine Content

AbstractMembraneless Organelles (MLOs) are vital and dynamic reaction centers in cells that organize metabolism in the absence of a membrane. Multivalent interactions between protein Low-Complexity Domains (LCDs) contribute to MLO organization. Our previous work used computational methods to identify structural motifs termed Low-complexity Amyloid-like Reversible Kinked Segments (LARKS) that can phase-transition to form hydrogels and are common in human proteins that participate in MLOs. Here we searched for LARKS in proteomes of six model organisms: Homo sapiens, Drosophila melanogaster, Plasmodium falciparum, Saccharomyces cerevisiae, Mycobacterium tuberculosis, and Escherichia coli. We find LARKS are abundant in M. tuberculosis, D. melanogaster, and H. sapiens, but not in S. cerevisiae or P. falciparum. Abundant LARKS require high glycine content, which enables kinks to form in LARKS as is illustrated in the known LARKS-rich amyloid structures of TDP43, FUS, and hnRNPA2, three proteins that participate in MLOs. These results support the idea of LARKS as an evolved structural motif and we offer the LARKSdb webserver which permits users to search for LARKS in their protein sequences of interest.

Download Full-text

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Bioinformatics and Biology Insights ◽

10.4137/bbi.s10053 ◽

2013 ◽

Vol 7 ◽

pp. BBI.S10053 ◽

Cited By ~ 8

Author(s):

Nicolas Carels ◽

Diego Frías

Keyword(s):

Success Rate ◽

Prior Knowledge ◽

Homo Sapiens ◽

Stop Codon ◽

Protein Sequences ◽

Data Bank ◽

Low Complexity ◽

Statistical Parameters ◽

Reading Frame

In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.

Download Full-text

DBP-PSSM: Combination of evolutionary profiles with the XGBoost algorithm to improve the identification of DNA-binding proteins

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323999201124203531 ◽

2020 ◽

Vol 23 ◽

Author(s):

Yanping Zhang ◽

Pengcheng Chen ◽

Ya Gao ◽

Jianwei Ni ◽

Xiaosheng Wang

Keyword(s):

Logistic Regression ◽

Protein Structure ◽

Dna Binding ◽

Molecular Biology ◽

Binding Proteins ◽

Protein Sequences ◽

Low Complexity ◽

Dna Binding Proteins ◽

Position Information ◽

Position Representation

Aim and Objective:: Given the rapidly increasing number of molecular biology data available, computational methods of low complexity are necessary to infer protein structure, function, and evolution. Method:: In the work, we proposed a novel mthod, FermatS, which based on the global position information and local position representation from the curve and normalized moments of inertia, respectively, to extract features information of protein sequences. Furthermore, we use the generated features by FermatS method to analyze the similarity/dissimilarity of nine ND5 proteins and establish the prediction model of DNA-binding proteins based on logistic regression with 5-fold crossvalidation. Results:: In the similarity/dissimilarity analysis of nine ND5 proteins, the results are consistent with evolutionary theory. Moreover, this method can effectively predict the DNA-binding proteins in realistic situations. Conclusion:: The findings demonstrate that the proposed method is effective for comparing, recognizing and predicting protein sequences. The main code and datasets can download from https://github.com/GaoYa1122/FermatS.

Download Full-text

Myosinome: A Database of Myosins from Select Eukaryotic Genomes to Facilitate Analysis of Sequence-Structure-Function Relationships

Bioinformatics and Biology Insights ◽

10.4137/bbi.s9902 ◽

2012 ◽

Vol 6 ◽

pp. BBI.S9902 ◽

Cited By ~ 3

Author(s):

Divya P. Syamaladevi ◽

Margaret S Sunitha ◽

S. Kalaimathy ◽

Chandrashekar C. Reddy ◽

Mohammed Iftekhar ◽

...

Keyword(s):

Conformational Changes ◽

Atp Hydrolysis ◽

Homo Sapiens ◽

Relevant Literature ◽

Myosin Ii ◽

Coiled Coil ◽

Structural Features ◽

Model Organisms ◽

Congenital Diseases ◽

C Elegans

Myosins are one of the largest protein superfamilies with 24 classes. They have conserved structural features and catalytic domains yet show huge variation at different domains resulting in a variety of functions. Myosins are molecules driving various kinds of cellular processes and motility until the level of organisms. These are ATPases that utilize the chemical energy released by ATP hydrolysis to bring about conformational changes leading to a motor function. Myosins are important as they are involved in almost all cellular activities ranging from cell division to transcriptional regulation. They are crucial due to their involvement in many congenital diseases symptomatized by muscular malfunctions, cardiac diseases, deafness, neural and immunological dysfunction, and so on, many of which lead to death at an early age. We present Myosinome, a database of selected myosin classes (myosin II, V, and VI) from five model organisms. This knowledge base provides the sequences, phylogenetic clustering, domain architectures of myosins and molecular models, structural analyses, and relevant literature of their coiled-coil domains. In the current version of Myosinome, information about 71 myosin sequences belonging to three myosin classes (myosin II, V, and VI) in five model organisms ( Homo Sapiens, Mus musculus, D. melanogaster, C. elegans and S. cereviseae) identified using bioinformatics surveys are presented, and several of them are yet to be functionally characterized. As these proteins are involved in congenital diseases, such a database would be useful in short-listing candidates for gene therapy and drug development. The database can be accessed from http://caps.ncbs.res.in/myosinome .

Download Full-text

The Occurrence of Genetic Recombination between Viruses and Human, it's Possible Influence on Vaccination

Epidemiology and Vaccinal Prevention ◽

10.31631/2073-3046-2019-18-4-14 ◽

2020 ◽

Vol 18 (6) ◽

pp. 4-14 ◽

Cited By ~ 1

Author(s):

E. P. Kharchenko

Keyword(s):

Genetic Information ◽

Genetic Recombination ◽

Protein Sequences ◽

Virus Protein ◽

Computer Study ◽

Data Bases ◽

Small Genome ◽

After Effects ◽

The Past ◽

Human Proteins

Relevance. The genetic recombination between viruses and men is known long ago. It can be divided on relict and ontogenic ones. For the host the recombination may display different consequences the nature of which is not exposed explicitly.Aim is to analyze (on the base of computer comparison of the primary structure of viral and human proteins ) the occurrence of twodirectional recombination by small genome fragments between viruses and men and describe its possible after-effects.Materials and methods. For this computer study human and virus protein sequences were used from data bases available in INTERNET.Results. It was indicated that recombination (cryptical and explicit) by small genome fragments between viruses and men occurred many times in the past and many viruses pathogenic for men were involved in it.Conclusion. The bioinformatics approach allows to look at the past of viruses and men and find the traces of genetic information changes between them that may predetermine the effects of vaccines and diagnostic immune tests.

Download Full-text

Identification of high-efficiency 3′GG gRNA motifs in indexed FASTA files with ngg2

PeerJ Computer Science ◽

10.7717/peerj-cs.33 ◽

2015 ◽

Vol 1 ◽

pp. e33 ◽

Cited By ~ 2

Author(s):

Elisha D. Roberson

Keyword(s):

High Efficiency ◽

Homo Sapiens ◽

Model Organisms ◽

Proof Of Concept ◽

Protein Coding ◽

C Elegans ◽

Protein Coding Genes ◽

Starting Point ◽

Command Line Tool ◽

Reference Genomes

CRISPR/Cas9 is emerging as one of the most-used methods of genome modification in organisms ranging from bacteria to human cells. However, the efficiency of editing varies tremendously site-to-site. A recent report identified a novel motif, called the 3′GG motif, which substantially increases the efficiency of editing at all sites tested inC. elegans. Furthermore, they highlighted that previously published gRNAs with high editing efficiency also had this motif. I designed a Python command-line tool, ngg2, to identify 3′GG gRNA sites from indexed FASTA files. As a proof-of-concept, I screened for these motifs in six model genomes:Saccharomyces cerevisiae,Caenorhabditis elegans,Drosophila melanogaster,Danio rerio,Mus musculus, andHomo sapiens. I also scanned the genomes of pig (Sus scrofa) and African elephant (Loxodonta africana) to demonstrate the utility in non-model organisms. I identified more than 60 million single match 3′GG motifs in these genomes. Greater than 61% of all protein coding genes in the reference genomes had at least one unique 3′GG gRNA site overlapping an exon. In particular, more than 96% of mouse and 93% of human protein coding genes have at least one unique, overlapping 3′GG gRNA. These identified sites can be used as a starting point in gRNA selection, and the ngg2 tool provides an important ability to identify 3′GG editing sites in any species with an available genome sequence.

Download Full-text

Hepatitis B virus and Homo sapiens proteomewide analysis: A profusion of viral peptide overlaps in neuron-specific human proteins

Biologics: Targets and Therapy ◽

10.2147/btt.s8890 ◽

2010 ◽

pp. 75 ◽

Cited By ~ 1

Author(s):

Darja Kanduc

Keyword(s):

Hepatitis B Virus ◽

Hepatitis B ◽

Homo Sapiens ◽

Viral Peptide ◽

B Virus ◽

Human Proteins

Download Full-text

Mathematical Characterization of Membrane Protein Sequences of Homo-Sapiens

2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) ◽

10.1109/confluence.2019.8776919 ◽

2019 ◽

Author(s):

Pranav Dutt Upadhayay ◽

Rupaansh Chandra Agarwal ◽

Ranjeet Kumar Rout ◽

Arun Prakash Agrawal

Keyword(s):

Membrane Protein ◽

Homo Sapiens ◽

Protein Sequences

Download Full-text

Mycobacterium tuberculosis H37Rv: In Silico Drug Targets Identification by Metabolic Pathways Analysis

International Journal of Evolutionary Biology ◽

10.1155/2014/284170 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 13

Author(s):

Asad Amir ◽

Khyati Rana ◽

Arvind Arya ◽

Neelesh Kapoor ◽

Hirdesh Kumar ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

In Silico ◽

Metabolic Pathways ◽

Drug Targets ◽

Pathogenic Bacteria ◽

Homo Sapiens ◽

Protein Sequences ◽

Mycobacterium Tuberculosis H37rv ◽

Drug Molecules ◽

Pathways Analysis

Mycobacterium tuberculosis (Mtb) is a pathogenic bacteria species in the genus Mycobacterium and the causative agent of most cases of tuberculosis. Tuberculosis (TB) is the leading cause of death in the world from a bacterial infectious disease. This antibiotic resistance strain lead to development of the new antibiotics or drug molecules which can kill or suppress the growth of Mycobacterium tuberculosis. We have performed an in silico comparative analysis of metabolic pathways of the host Homo sapiens and the pathogen Mycobacterium tuberculosis (H37Rv). Novel efforts in developing drugs that target the intracellular metabolism of M. tuberculosis often focus on metabolic pathways that are specific to M. tuberculosis. We have identified five unique pathways for Mycobacterium tuberculosis having a number of 60 enzymes, which are nonhomologous to Homo sapiens protein sequences, and among them there were 55 enzymes, which are nonhomologous to Homo sapiens protein sequences. These enzymes were also found to be essential for survival of the Mycobacterium tuberculosis according to the DEG database. Further, the functional analysis using Uniprot showed involvement of all the unique enzymes in the different cellular components.

Download Full-text

Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software

10.1101/706531 ◽

2019 ◽

Author(s):

Melanie E. F. LaCava ◽

Ellen O. Aikens ◽

Libby C. Megna ◽

Gregg Randolph ◽

Charley Hubbard ◽

...

Keyword(s):

Dna Sequences ◽

De Novo ◽

Homo Sapiens ◽

Simulated Data ◽

Model Organisms ◽

Reduced Representation ◽

Insertion And Deletion ◽

Large Sets ◽

Sequencing Library ◽

Software Programs

AbstractAdvances in DNA sequencing have made it feasible to gather genomic data for non-model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD-HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated datasets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD-HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

Download Full-text

Spectrum of protein localization in proteomes captures evolutionary relation between species

10.1101/845362 ◽

2019 ◽

Author(s):

Valérie Marot-Lassauzaie ◽

Tatyana Goldberg ◽

Burkhard Rost

Keyword(s):

Protein Function ◽

Homo Sapiens ◽

Protein Localization ◽

Model Organisms ◽

Prediction Methods ◽

Systematic Bias ◽

Human Protein Atlas ◽

Species Comparisons ◽

The One ◽

Evolutionary Comparisons

AbstractThe native subcellular localization or cellular compartment of a protein is the one in which it acts most often; it is one aspect of protein function. Do ten eukaryotic model organisms differ in their location spectrum, i.e. the fraction of its proteome in each of its seven major compartments? As experimental annotations of locations remain biased and incomplete, we need prediction methods to answer this question. To gauge the bias of prediction methods, we merged all available experimental annotations for the human proteome. In doing so, we found important values in both Swiss-Prot and the Human Protein Atlas (HPA). After systematic bias corrections, the complete but faulty prediction methods appeared to be more appropriate to compare location spectra between species than the incomplete more accurate experimental data. This work compared the location spectra for ten eukaryotes: Homo sapiens, Gorilla gorilla, Pan troglodytes, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Anopheles gambiae, Caenorhabitis elegans, Saccharomyces cerevisiae and Schizosaccharomyces pombe. Overall, the predicted location spectra were similar. However, the detailed differences were significant enough to plot trees and 2D (PCA) maps relating the ten organisms using a simple Euclidean distance in seven states, corresponding to the seven studied localization classes. The relations based on the simple predicted location spectra captured aspects of cross-species comparisons usually revealed only by much more detailed evolutionary comparisons.

Download Full-text