Deep Annotation of Protein Function across Diverse Bacteria from Mutant Phenotypes

Mapping Intimacies ◽

10.1101/072470 ◽

2016 ◽

Cited By ~ 21

Author(s):

Morgan N. Price ◽

Kelly M. Wetmore ◽

R. Jordan Waters ◽

Mark Callaghan ◽

Jayashree Ray ◽

...

Keyword(s):

Protein Function ◽

Large Scale ◽

Hypothetical Proteins ◽

Data Set ◽

Protein Coding ◽

Bacterial Proteins ◽

Genome Wide ◽

Protein Functions ◽

Mutant Phenotypes ◽

Related Proteins

SummaryThe function of nearly half of all protein-coding genes identified in bacterial genomes remains unknown. To systematically explore the functions of these proteins, we generated saturated transposon mutant libraries from 25 diverse bacteria and we assayed mutant phenotypes across hundreds of distinct conditions. From 3,903 genome-wide mutant fitness assays, we obtained 14.9 million gene phenotype measurements and we identified a mutant phenotype for 8,487 proteins with previously unknown functions. The majority of these hypothetical proteins (57%) had phenotypes that were either specific to a few conditions or were similar to that of another gene, thus enabling us to make informed predictions of protein function. For 1,914 of these hypothetical proteins, the functional associations are conserved across related proteins from different bacteria, which confirms that these associations are genuine. This comprehensive catalogue of experimentally-annotated protein functions also enables the targeted exploration of specific biological processes. For example, sensitivity to a DNA-damaging agent revealed 28 known families of DNA repair proteins and 11 putative novel families. Across all sequenced bacteria, 14% of proteins that lack detailed annotations have an ortholog with a functional association in our data set. Our study demonstrates the utility and scalability of high-throughput genetics for large-scale annotation of bacterial proteins and provides a vast compendium of experimentally-determined protein functions across diverse bacteria.

Download Full-text

Genome-wide Phenotypic RNAi Screen in the Drosophila Wing: Global Parameters

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab351 ◽

2021 ◽

Author(s):

Ana López-Varea ◽

Cristina M Ostalé ◽

Patricia Vega-Cuesta ◽

Ana Ruiz-Gómez ◽

María F Organista ◽

...

Keyword(s):

Protein Function ◽

Wing Disc ◽

Functional Categories ◽

Protein Coding ◽

Genome Wide ◽

Adult Wing ◽

Drosophila Protein ◽

Molecular Information ◽

Mutant Phenotypes ◽

Global Parameters

Abstract We have screened a collection of UAS-RNAi lines targeting 10920 Drosophila protein-coding genes for phenotypes in the adult wing. We identified 3653 genes (33%) whose knock-down causes either larval/pupal lethality or a mutant phenotype affecting the formation of a normal wing. The most frequent phenotypes consist in changes in wing size, vein differentiation and patterning, defects in the wing margin and in the apposition of the dorsal and ventral wing surfaces. We also defined 16 functional categories encompassing the most relevant aspect of each protein function, and assigned each Drosophila gene to one of these functional groups. This allowed us to identify which mutant phenotypes are enriched within each functional group. Finally, we used previously published gene expression datasets to determine which genes are or are not expressed in the wing disc. Integrating expression, phenotypic and molecular information offers considerable precision to identify the relevant genes affecting wing formation and the biological processes regulated by them.

Download Full-text

Genome-wide association study of agronomic traits in bread wheat reveals novel putative alleles for future breeding programs

BMC Plant Biology ◽

10.1186/s12870-019-2165-4 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 11

Author(s):

Yousef Rahimi ◽

Mohammad Reza Bihamta ◽

Alireza Taleei ◽

Hadi Alipour ◽

Pär K. Ingvarsson

Keyword(s):

Bread Wheat ◽

Genome Wide Association Study ◽

Agronomic Traits ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Wheat Varieties ◽

Data Set ◽

Protein Coding ◽

Genome Wide

Abstract Background Identification of loci for agronomic traits and characterization of their genetic architecture are crucial in marker-assisted selection (MAS). Genome-wide association studies (GWAS) have increasingly been used as potent tools in identifying marker-trait associations (MTAs). The introduction of new adaptive alleles in the diverse genetic backgrounds may help to improve grain yield of old or newly developed varieties of wheat to balance supply and demand throughout the world. Landraces collected from different climate zones can be an invaluable resource for such adaptive alleles. Results GWAS was performed using a collection of 298 Iranian bread wheat varieties and landraces to explore the genetic basis of agronomic traits during 2016–2018 cropping seasons under normal (well-watered) and stressed (rain-fed) conditions. A high-quality genotyping by sequencing (GBS) dataset was obtained using either all original single nucleotide polymorphism (SNP, 10938 SNPs) or with additional imputation (46,862 SNPs) based on W7984 reference genome. The results confirm that the B genome carries the highest number of significant marker pairs in both varieties (49,880, 27.37%) and landraces (55,086, 28.99%). The strongest linkage disequilibrium (LD) between pairs of markers was observed on chromosome 2D (0.296). LD decay was lower in the D genome, compared to the A and B genomes. Association mapping under two tested environments yielded a total of 313 and 394 significant (−log10P >3) MTAs for the original and imputed SNP data sets, respectively. Gene ontology results showed that 27 and 27.5% of MTAs of SNPs in the original set were located in protein-coding regions for well-watered and rain-fed conditions, respectively. While, for the imputed data set 22.6 and 16.6% of MTAs represented in protein-coding genes for the well-watered and rain-fed conditions, respectively. Conclusions Our finding suggests that Iranian bread wheat landraces harbor valuable alleles that are adaptive under drought stress conditions. MTAs located within coding genes can be utilized in genome-based breeding of new wheat varieties. Although imputation of missing data increased the number of MTAs, the fraction of these MTAs located in coding genes were decreased across the different sub-genomes.

Download Full-text

Genome-wide profiling of transcribed enhancers during macrophage activation

10.1101/163519 ◽

2017 ◽

Author(s):

Elena Denisenko ◽

Reto Guler ◽

Musa Mhlanga ◽

Harukazu Suzuki ◽

Frank Brombacher ◽

...

Keyword(s):

Gene Expression ◽

Transcriptional Activation ◽

Large Scale ◽

Transcriptional Control ◽

Macrophage Activation ◽

Transcriptional Responses ◽

Protein Coding ◽

Genome Wide ◽

Ifn Γ ◽

Cap Analysis

AbstractMacrophages are sentinel cells essential for tissue homeostasis and host defence. Owing to their plasticity, macrophages acquire a range of functional phenotypes in response to microenvironmental stimuli, of which M(IFN-γ) and M(IL-4/IL-13) are well-known for their opposing pro- and anti-inflammatory roles. Enhancers have emerged as regulatory DNA elements crucial for transcriptional activation of gene expression. Using cap analysis of gene expression and epigenetic data, we identify on large-scale transcribed enhancers in mouse macrophages, their time kinetics and target protein-coding genes. We observe an increase in target gene expression, concomitant with increasing numbers of associated enhancers and find that genes associated to many enhancers show a shift towards stronger enrichment for macrophage-specific biological processes. We infer enhancers that drive transcriptional responses of genes upon M(IFN-γ) and M(IL-4/IL-13) macrophage activation and demonstrate stimuli-specificity of regulatory associations. Finally, we show that enhancer regions are enriched for binding sites of inflammation-related transcription factors, suggesting a link between stimuli response and enhancer transcriptional control. Our study provides new insights into genome-wide enhancer-mediated transcriptional control of macrophage genes, including those implicated in macrophage activation, and offers a detailed genome-wide catalogue to further elucidate enhancer regulation in macrophages.

Download Full-text

Design to Data for mutants of β-glucosidase B from Paenibacillus polymyxa: M319C, T431I, and K337D

10.1101/839027 ◽

2019 ◽

Cited By ~ 1

Author(s):

Peishan Huang ◽

Stephanie C. Contreras ◽

Eliana Bloomfield ◽

Kristine Schmitz ◽

Augustine Arredondo ◽

...

Keyword(s):

Protein Function ◽

Kinetic Data ◽

Prediction Accuracy ◽

Paenibacillus Polymyxa ◽

Computational Algorithms ◽

Enzyme Design ◽

Data Set ◽

Computational Tools ◽

Protein Functions ◽

Novel Protein

ABSTRACTThe use of computational tools has become an increasingly popular tool for engineering protein function. While there are numerous examples of computational tools enabling the design of novel protein functions, there remains room for improvement in both prediction accuracy and success. To improve algorithms for functional and stability predictions, we have initiated the development of a data set designed to be used for training new computational algorithms for enzyme design. To date our dataset is composed of over 129 mutants with associated expression levels, kinetic data, and thermal stability for the enzyme β-glucosidase B (BglB) from Paenibacillus polymyxa. In this study, we introduced three new variants (M319C, T431I, and K337D) to our existing dataset with the goal of cultivating a larger dataset to train new design algorithms and more broadly explore structure-function relationships in BglB.

Download Full-text

A Method for Identifying Environmental Stimuli and Genes Responsible for Genotype-by-Environment Interactions From a Large-Scale Multi-Environment Data Set

Frontiers in Genetics ◽

10.3389/fgene.2021.803636 ◽

2021 ◽

Vol 12 ◽

Author(s):

Akio Onogi ◽

Daisuke Sekine ◽

Akito Kaga ◽

Satoshi Nakano ◽

Tetsuya Yamada ◽

...

Keyword(s):

Large Scale ◽

Genetic Correlations ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Environmental Stimuli ◽

Genotype By Environment ◽

Genome Wide ◽

Sowing Dates ◽

Data Driven Approach

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when these factors were involved in the interactions. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields.

Download Full-text

Increasing the Efficiency of Genome-wide Association Mapping via Hidden Markov Models

10.1101/039099 ◽

2016 ◽

Author(s):

Hong Gao ◽

Hua Tang ◽

Carlos Bustamante

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Large Scale ◽

Hidden Markov ◽

Association Studies ◽

Genome Wide Association ◽

Trend Test ◽

Genome Wide Association Studies ◽

Data Set ◽

Genome Wide

With the rapid production of high dimensional genetic data, one major challenge in genome-wide association studies is to develop effective and efficient statistical tools to resolve the low power problem of detecting causal SNPs with low to moderate susceptibility, whose effects are often obscured by substantial background noises. Here we present a novel method that serves as an optimal technique for reducing background noises and improving detection power in genome-wide association studies. The approach uses hidden Markov model and its derivate Markov hidden Markov model to estimate the posterior probabilities of a markers being in an associated state. We conducted extensive simulations based on the human whole genome genotype data from the GlaxoSmithKline-POPRES project to calibrate the sensitivity and specificity of our method and compared with many popular approaches for detecting positive signals including the χ^2 test for association and the Cochran-Armitage trend test. Our simulation results suggested that at very low false positive rates (<10^-6), our method reaches the power of 0.9, and is more powerful than any other approaches, when the allelic effect of the causal variant is non-additive or unknown. Application of our method to the data set generated by Welcome Trust Case Control Consortium using 14,000 cases and 3,000 controls confirmed its powerfulness and efficiency under the context of the large-scale genome-wide association studies.

Download Full-text

Computational annotation of miRNA transcription start sites

Briefings in Bioinformatics ◽

10.1093/bib/bbz178 ◽

2020 ◽

Author(s):

Saidi Wang ◽

Amlan Talukder ◽

Mingyu Cha ◽

Xiaoman Li ◽

Haiyan Hu

Keyword(s):

Computational Methods ◽

Large Scale ◽

Transcription Start ◽

Protein Coding ◽

Functional Roles ◽

Transcription Start Sites ◽

Small Noncoding Rnas ◽

Mirna Genes ◽

Genome Wide ◽

Computational Annotation

Abstract Motivation MicroRNAs (miRNAs) are small noncoding RNAs that play important roles in gene regulation and phenotype development. The identification of miRNA transcription start sites (TSSs) is critical to understand the functional roles of miRNA genes and their transcriptional regulation. Unlike protein-coding genes, miRNA TSSs are not directly detectable from conventional RNA-Seq experiments due to miRNA-specific process of biogenesis. In the past decade, large-scale genome-wide TSS-Seq and transcription activation marker profiling data have become available, based on which, many computational methods have been developed. These methods have greatly advanced genome-wide miRNA TSS annotation. Results In this study, we summarized recent computational methods and their results on miRNA TSS annotation. We collected and performed a comparative analysis of miRNA TSS annotations from 14 representative studies. We further compiled a robust set of miRNA TSSs (RSmirT) that are supported by multiple studies. Integrative genomic and epigenomic data analysis on RSmirT revealed the genomic and epigenomic features of miRNA TSSs as well as their relations to protein-coding and long non-coding genes. Contact [email protected], [email protected]

Download Full-text

Characterisation of protein structure/function relationship by sequence analysis without previous alignment: Distinction between sub-groups of protein kinases

Bioscience Reports ◽

10.1007/bf01207456 ◽

1995 ◽

Vol 15 (3) ◽

pp. 161-171 ◽

Cited By ~ 2

Author(s):

Marie-Anne Guerrucci ◽

Robert Bellé

Keyword(s):

Structure Function ◽

Protein Kinases ◽

Tyrosine Kinases ◽

Large Scale ◽

Data Bank ◽

Structure Function Relationship ◽

Protein Functions ◽

Function Relationship ◽

Related Proteins ◽

Relationship Of

Using an approach for protein comparison by computer analysis based on signal treatment methods without previous alignment of the sequence, we have analysed the structure/function relationship of related proteins. The aim was to demonstrate that from a few members of related proteins, specific parameters can be obtained and used for the characterisation of newly sequenced proteins obtained by molecular biology techniques. The analysis was performed on protein kinases, which comprise the largest known family of proteins, and therefore allows valid estimations to be made. We show that using only a dozen defined proteins, the specific parameters extracted from their sequences classified the protein kinase family into two sub-groups: the protein serine/threonine kinases (PSKs) and the protein tyrosine kinases (PTKs). The analysis, largely involving computation, appears applicable to large scale data-bank analysis and prediction of protein functions.

Download Full-text

Assessment of a complete and classified platelet proteome from genome-wide transcripts of human platelets and megakaryocytes covering platelet functions

Scientific Reports ◽

10.1038/s41598-021-91661-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Jingnan Huang ◽

Frauke Swieringa ◽

Fiorella A. Solari ◽

Isabella Provenzale ◽

Luigi Grassi ◽

...

Keyword(s):

Protein Function ◽

Intracellular Localization ◽

Transcript Level ◽

Human Platelets ◽

Secretory Proteins ◽

Protein Coding ◽

Copy Numbers ◽

Genome Wide ◽

Proteome Database ◽

Health And Disease

AbstractNovel platelet and megakaryocyte transcriptome analysis allows prediction of the full or theoretical proteome of a representative human platelet. Here, we integrated the established platelet proteomes from six cohorts of healthy subjects, encompassing 5.2 k proteins, with two novel genome-wide transcriptomes (57.8 k mRNAs). For 14.8 k protein-coding transcripts, we assigned the proteins to 21 UniProt-based classes, based on their preferential intracellular localization and presumed function. This classified transcriptome-proteome profile of platelets revealed: (i) Absence of 37.2 k genome-wide transcripts. (ii) High quantitative similarity of platelet and megakaryocyte transcriptomes (R = 0.75) for 14.8 k protein-coding genes, but not for 3.8 k RNA genes or 1.9 k pseudogenes (R = 0.43–0.54), suggesting redistribution of mRNAs upon platelet shedding from megakaryocytes. (iii) Copy numbers of 3.5 k proteins that were restricted in size by the corresponding transcript levels (iv) Near complete coverage of identified proteins in the relevant transcriptome (log2fpkm > 0.20) except for plasma-derived secretory proteins, pointing to adhesion and uptake of such proteins. (v) Underrepresentation in the identified proteome of nuclear-related, membrane and signaling proteins, as well proteins with low-level transcripts. We then constructed a prediction model, based on protein function, transcript level and (peri)nuclear localization, and calculated the achievable proteome at ~ 10 k proteins. Model validation identified 1.0 k additional proteins in the predicted classes. Network and database analysis revealed the presence of 2.4 k proteins with a possible role in thrombosis and hemostasis, and 138 proteins linked to platelet-related disorders. This genome-wide platelet transcriptome and (non)identified proteome database thus provides a scaffold for discovering the roles of unknown platelet proteins in health and disease.

Download Full-text

Large Scale Identification of Genes Involved in Cell Surface Biosynthesis and Architecture in Saccharomyces cerevisiae

Genetics ◽

10.1093/genetics/147.2.435 ◽

1997 ◽

Vol 147 (2) ◽

pp. 435-450 ◽

Cited By ~ 42

Author(s):

Marc Lussier ◽

Ann-Marie White ◽

Jane Sheraton ◽

Tiziano di Paolo ◽

Julie Treadwell ◽

...

Keyword(s):

Cell Surface ◽

Large Scale ◽

Cell Function ◽

Eukaryotic Cell ◽

Yeast Genome ◽

Calcofluor White ◽

Genome Database ◽

Cellular Processes ◽

Genome Wide ◽

Mutant Phenotypes

The sequenced yeast genome offers a unique resource for the analysis of eukaryotic cell function and enables genome-wide screens for genes involved in cellular processes. We have identified genes involved in cell surface assembly by screening transposon-mutagenized cells for altered sensitivity to calcofluor white, followed by supplementary screens to further characterize mutant phenotypes. The mutated genes were directly retrieved from genomic DNA and then matched uniquely to a gene in the yeast genome database. Eighty-two genes with apparent perturbation of the cell surface were identified, with mutations in 65 of them displaying at least one further cell surface phenotype in addition to their modified sensitivity to calcofluor. Fifty of these genes were previously known, 17 encoded proteins whose function could be anticipated through sequence homology or previously recognized phenotypes and 15 genes had no previously known phenotype.

Download Full-text