Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins

Hedi Hegyi; Mark Gerstein

doi:10.1101/gr.183801

Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins

Genome Research ◽

10.1101/gr.183801 ◽

2001 ◽

Vol 11 (10) ◽

pp. 1632-1640

Author(s):

Hedi Hegyi ◽

Mark Gerstein

Keyword(s):

Genome Annotation ◽

Large Scale ◽

Sequence Similarity ◽

Functional Divergence ◽

Open Reading Frames ◽

Single Domain ◽

Functional Conservation ◽

Link Type ◽

Approximate Function ◽

The Relationship

Annotation transfer is a principal process in genome annotation. It involves “transferring” structural and functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is important that this process be robust and statistically well-characterized, especially with regard to how it depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, present more complex issues in functional conservation. Here we present a large-scale survey of annotation transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have significantly less functional conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the probability of their sharing the same function increases to 80% in the case of complete coverage along the full length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these were associated with the same function in both categories of proteins. We also investigated the degree to which function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence similarity between them, finding that functional divergence at a given amount of sequence similarity is always about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further information is available athttp://partslist.org/func orhttp://bioinfo.mbb.yale.edu/partslist/func.

Download Full-text

Hyphomicrobium nitrativorans sp. nov., isolated from the biofilm of a methanol-fed denitrification system treating seawater at the Montreal Biodome

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.048124-0 ◽

2013 ◽

Vol 63 (Pt_10) ◽

pp. 3777-3781 ◽

Cited By ~ 38

Author(s):

Christine Martineau ◽

Céline Villeneuve ◽

Florian Mauffrey ◽

Richard Villemur

Keyword(s):

16S Rrna ◽

Type Species ◽

Sequence Similarity ◽

Open Reading Frames ◽

Nitrite Accumulation ◽

Rrna Gene ◽

Content Type ◽

Link Type ◽

Serine Pathway ◽

High Level

A budding prosthecate bacterial strain, designated NL23T, was isolated from a methanol-fed denitrification system treating seawater at the Montreal Biodome, Canada. Phylogenetic analysis based on 16S rRNA (rRNA) gene sequences showed that the strain was affiliated with the genus Hyphomicrobium of the Alphaproteobacteria and was most closely related to Hyphomicrobium zavarzinii with 99.4 % sequence similarity. Despite this high level of 16S rRNA gene sequence similarity, DNA–DNA hybridization assays showed that strain NL23T was only distantly related to H. zavarzinii ZV-622T (12 %). Strain NL23T grew aerobically, but also had the capacity to grow under denitrifying conditions in the presence of nitrate without nitrite accumulation. Growth occurred at pH 7.0–9.5, with 0–1 % NaCl and at temperatures of 15–35 °C. Major fatty acids were C18 : 1ω7c or ω6c (84.6 %) and C18 : 0 (8.5 %), and major quinones were Q8 (5 %) and Q9 (95 %). The complete genome of the strain was sequenced and showed a DNA G+C content of 63.8 mol%. Genome analysis predicted open reading frames (ORF) encoding the key enzymes of the serine pathway as well as enzymes involved in methylotrophy. Also, ORF encoding a periplasmic nitrate reductase (Nap), a nitrite reductase (Nir), a nitric oxide reductase (Nor) and a nitrous oxide reductase (Nos) were identified. Our results support that strain NL23T represents a novel species within the genus Hyphomicrobium , for which the name Hyphomicrobium nitrativorans sp. nov. is proposed. The type strain is NL23T ( = ATCC BAA-2476T = LMG 27277T).

Download Full-text

G-OnRamp: Generating genome browsers to facilitate undergraduate-driven collaborative genome annotation

10.1101/781658 ◽

2019 ◽

Author(s):

Luke Sargent ◽

Yating Liu ◽

Wilson Leung ◽

Nathan T. Mortimer ◽

David Lopatto ◽

...

Keyword(s):

Genome Annotation ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Prediction ◽

Phenotypic Traits ◽

Wasp Species ◽

Major Barrier ◽

Link Type ◽

A Genome ◽

Genome Browsers

AbstractScientists are sequencing new genomes at an increasing rate with the goal of associating genome contents with phenotypic traits. After a new genome is sequenced and assembled, structural gene annotation is often the first step in analysis. Despite advances in computational gene prediction algorithms, most eukaryotic genomes still benefit from manual gene annotation. Undergraduates can become skilled annotators, and in the process learn both about genes/genomes and about how to utilize large datasets. Data visualizations provided by a genome browser are essential for manual gene annotation, enabling annotators to quickly evaluate multiple lines of evidence (e.g., sequence similarity, RNA-Seq, gene predictions, repeats). However, creating genome browsers requires extensive computational skills; lack of the expertise required remains a major barrier for many biomedical researchers and educators.To address these challenges, the Genomics Education Partnership (GEP; https://gep.wustl.edu/) has partnered with the Galaxy Project (https://galaxyproject.org) to develop G-OnRamp (http://g-onramp.org), a web-based platform for creating UCSC Assembly Hubs and JBrowse genome browsers. G-OnRamp can also convert a JBrowse instance into an Apollo instance for collaborative genome annotations in research and educational settings. G-OnRamp enables researchers to easily visualize their experimental results, educators to create Course-based Undergraduate Research Experiences (CUREs) centered on genome annotation, and students to participate in genomics research.Development of G-OnRamp was guided by extensive user feedback from in-person workshops. Sixty-five researchers and educators from over 40 institutions participated in these workshops, which produced over 20 genome browsers now available for research and education. For example, genome browsers for four parasitoid wasp species were used in a CURE engaging 142 students taught by 13 faculty members — producing a total of 192 gene models. G-OnRamp can be deployed on a personal computer or on cloud computing platforms, and the genome browsers produced can be transferred to the CyVerse Data Store for long-term access.

Download Full-text

AMAW: automated gene annotation for non-model eukaryotic genomes

10.1101/2021.12.07.471566 ◽

2021 ◽

Author(s):

Loïc Meunier ◽

Denis Baurain ◽

Luc Cornet

Keyword(s):

Genome Annotation ◽

Large Scale ◽

Gene Annotation ◽

Supplementary Information ◽

Supplementary Data ◽

Software Suite ◽

Perl Script ◽

Link Type ◽

Eukaryotic Genomes

AbstractSummaryTo support small and large-scale genome annotation projects, we present AMAW (Automated MAKER2 Annotation Wrapper), a program devised to annotate non-model unicellular eukaryotic genomes by automating the acquisition of evidence data (transcripts and proteins) and facilitating the use of MAKER2, a widely adopted software suite for the annotation of eukaryotic genomes. Moreover, AMAW exists as a Singularity container recipe easy to deploy on a grid computer, thereby overcoming the tricky installation of MAKER2.AvailabilityAMAW is released both as a Singularity container recipe and a standalone Perl script (https://bitbucket.org/phylogeno/amaw/)[email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Kurthia huakuii sp. nov., isolated from biogas slurry, and emended description of the genus Kurthia

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.056044-0 ◽

2014 ◽

Vol 64 (Pt_2) ◽

pp. 518-521 ◽

Cited By ~ 41

Author(s):

Zhiyong Ruan ◽

Yanwei Wang ◽

Jinlong Song ◽

Shenghua Jiang ◽

Huimin Wang ◽

...

Keyword(s):

Type Species ◽

Large Scale ◽

Sequence Similarity ◽

Rrna Gene ◽

Biogas Slurry ◽

Content Type ◽

Link Type ◽

Emended Description ◽

Method Analysis ◽

The 16S Rrna Gene

A novel facultatively anaerobic bacterium, designated strain LAM0618T, was isolated from biogas slurry samples collected from the large-scale anaerobic digester of Modern Farming Corporation in Hebei Province, China. Cells of strain LAM0618T were Gram-stain-positive, motile, non-spore-forming and short-rod-shaped. The optimal temperature and pH for growth were 30 °C and 7.0, respectively. The strain did not require NaCl for growth but tolerated up to 70 g NaCl l−1. The major fatty acids of strain LAM0618T were iso-C15 : 0, anteiso-C15 : 0, iso-C14 : 0, C16 : 0 and C18 : 0. The predominant menaquinones of strain LAM0618T were menaquinone 7 (MK-7) and menaquinone 6 (MK-6). The main polar lipids of strain LAM0618T were phosphatidylglycerol, diphosphatidylglycerol, phosphatidylethanolamine and six unknown glycolipids. The genomic DNA G+C content was 41 mol% as determined by the T m method. Analysis of the 16S rRNA gene sequence revealed that strain LAM0618T was a member of the genus Kurthia , and was most closely related to ‘ Kurthia massiliensis’ DSM 24639, Kurthia zopfii DSM 20580T, Kurthia gibsonii DSM 20636T and Kurthia sibirica DSM 4747T, with 96.9, 95.7, 95.6 and 94.9 % sequence similarity, respectively. Based on its phenotypic and genotypic properties, strain LAM0618T is suggested to represent a novel species of the genus Kurthia , for which the name Kurthia huakuii sp. nov. is proposed. The type strain is LAM0618T ( = ACCC 06121T = JCM 19187T).

Download Full-text

Open Reading Frame Phylogenetic Analysis on the Cloud

International Journal of Genomics ◽

10.1155/2013/614923 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 5

Author(s):

Che-Lun Hung ◽

Chun-Yuan Lin

Keyword(s):

Phylogenetic Analysis ◽

Phylogenetic Trees ◽

Large Scale ◽

Sequence Similarity ◽

Open Reading Frames ◽

Open Reading Frame ◽

Evolutionary Relationships ◽

Reading Frame ◽

Complete Sequences

Phylogenetic analysis has become essential in researching the evolutionary relationships between viruses. These relationships are depicted on phylogenetic trees, in which viruses are grouped based on sequence similarity. Viral evolutionary relationships are identified from open reading frames rather than from complete sequences. Recently, cloud computing has become popular for developing internet-based bioinformatics tools. Biocloud is an efficient, scalable, and robust bioinformatics computing service. In this paper, we propose a cloud-based open reading frame phylogenetic analysis service. The proposed service integrates the Hadoop framework, virtualization technology, and phylogenetic analysis methods to provide a high-availability, large-scale bioservice. In a case study, we analyze the phylogenetic relationships amongNorovirus. Evolutionary relationships are elucidated by aligning different open reading frame sequences. The proposed platform correctly identifies the evolutionary relationships between members ofNorovirus.

Download Full-text

Relationship between total plasma homocysteine and the risk of aneurysms – a meta-analysis

VASA ◽

10.1024/0301-1526/a000891 ◽

2020 ◽

pp. 1-6

Author(s):

Hanji Zhang ◽

Dexin Yin ◽

Yue Zhao ◽

Yezhou Li ◽

Dejiang Yao ◽

...

Keyword(s):

Large Scale ◽

Meta Analysis ◽

Single Species ◽

Total Plasma ◽

Control Groups ◽

Randomized Controlled ◽

Randomized Controlled Studies ◽

The Difference ◽

The Relationship ◽

Healthy Participants

Summary: Our meta-analysis focused on the relationship between homocysteine (Hcy) level and the incidence of aneurysms and looked at the relationship between smoking, hypertension and aneurysms. A systematic literature search of Pubmed, Web of Science, and Embase databases (up to March 31, 2020) resulted in the identification of 19 studies, including 2,629 aneurysm patients and 6,497 healthy participants. Combined analysis of the included studies showed that number of smoking, hypertension and hyperhomocysteinemia (HHcy) in aneurysm patients was higher than that in the control groups, and the total plasma Hcy level in aneurysm patients was also higher. These findings suggest that smoking, hypertension and HHcy may be risk factors for the development and progression of aneurysms. Although the heterogeneity of meta-analysis was significant, it was found that the heterogeneity might come from the difference between race and disease species through subgroup analysis. Large-scale randomized controlled studies of single species and single disease species are needed in the future to supplement the accuracy of the results.

Download Full-text

New ''Cold War''

Diplomatic Service ◽

10.33920/vne-01-2001-04 ◽

2020 ◽

pp. 27-34

Author(s):

Vladimir Batiuk

Keyword(s):

Cold War ◽

Nuclear Weapons ◽

Armed Conflict ◽

Large Scale ◽

Arms Race ◽

Ideological Struggle ◽

Nuclear Arms Race ◽

The Cold War ◽

The Relationship ◽

Nuclear Arms

In this article, the ''Cold War'' is understood as a situation where the relationship between the leading States is determined by ideological confrontation and, at the same time, the presence of nuclear weapons precludes the development of this confrontation into a large-scale armed conflict. Such a situation has developed in the years 1945–1989, during the first Cold War. We see that something similar is repeated in our time-with all the new nuances in the ideological struggle and in the nuclear arms race.

Download Full-text

Investigating Diseases and Chemicals in COVID-19 Literature with Text Mining (Preprint)

10.2196/preprints.21503 ◽

2020 ◽

Author(s):

Amir Karami ◽

Brandon Bookstaver ◽

Melissa Nolan

Keyword(s):

Text Mining ◽

Literature Review ◽

Topic Modeling ◽

Large Scale ◽

Clinical Manifestations ◽

International Health ◽

Research Papers ◽

Strategic Plans ◽

Funding Agencies ◽

The Relationship

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.

Download Full-text

Strategic resources and smallholder performance at the bottom of the pyramid

International Food and Agribusiness Management Review ◽

10.22434/ifamr2018.0111 ◽

2019 ◽

Vol 22 (3) ◽

pp. 365-380 ◽

Cited By ~ 1

Author(s):

Matthias Olthaar ◽

Wilfred Dolfsma ◽

Clemens Lutz ◽

Florian Noseleit

Keyword(s):

Large Scale ◽

Business Environment ◽

Agricultural Economics ◽

Local Market ◽

Bottom Of The Pyramid ◽

Fine Grained ◽

Strategic Resources ◽

Relative Contribution ◽

The Relationship ◽

Resource Based Theory

In a competitive business environment at the Bottom of the Pyramid smallholders supplying global value chains may be thought to be at the whims of downstream large-scale players and local market forces, leaving no room for strategic entrepreneurial behavior. In such a context we test the relationship between the use of strategic resources and firm performance. We adopt the Resource Based Theory and show that seemingly homogenous smallholders deploy resources differently and, consequently, some do outperform others. We argue that the ‘resource-based theory’ results in a more fine-grained understanding of smallholder performance than approaches generally applied in agricultural economics. We develop a mixed-method approach that allows one to pinpoint relevant, industry-specific resources, and allows for empirical identification of the relative contribution of each resource to competitive advantage. The results show that proper use of quality labor, storage facilities, time of selling, and availability of animals are key capabilities.

Download Full-text

Lack of an association between gallstone disease and bilirubin levels with risk of colorectal cancer: a Mendelian randomisation analysis

British Journal of Cancer ◽

10.1038/s41416-020-01211-x ◽

2021 ◽

Author(s):

Richard Culliford ◽

Alex J. Cornish ◽

Philip J. Law ◽

Susan M. Farrington ◽

Kimmo Palin ◽

...

Keyword(s):

Colorectal Cancer ◽

Large Scale ◽

Causal Effect ◽

Gallstone Disease ◽

Epidemiological Studies ◽

Mendelian Randomisation ◽

Linear Relationships ◽

Potential Risk Factors ◽

The Relationship ◽

Circulating Levels

Abstract Background Epidemiological studies of the relationship between gallstone disease and circulating levels of bilirubin with risk of developing colorectal cancer (CRC) have been inconsistent. To address possible confounding and reverse causation, we examine the relationship between these potential risk factors and CRC using Mendelian randomisation (MR). Methods We used two-sample MR to examine the relationship between genetic liability to gallstone disease and circulating levels of bilirubin with CRC in 26,397 patients and 41,481 controls. We calculated the odds ratio per genetically predicted SD unit increase in log bilirubin levels (ORSD) for CRC and tested for a non-zero causal effect of gallstones on CRC. Sensitivity analysis was applied to identify violations of estimator assumptions. Results No association between either gallstone disease (P value = 0.60) or circulating levels of bilirubin (ORSD = 1.00, 95% confidence interval (CI) = 0.96–1.03, P value = 0.90) with CRC was shown. Conclusions Despite the large scale of this study, we found no evidence for a causal relationship between either circulating levels of bilirubin or gallstone disease with risk of developing CRC. While the magnitude of effect suggested by some observational studies can confidently be excluded, we cannot exclude the possibility of smaller effect sizes and non-linear relationships.

Download Full-text