ProtaBank: A repository for protein design and engineering data

Mapping Intimacies ◽

10.1101/272211 ◽

2018 ◽

Author(s):

Connie Y. Wang ◽

Paul M. Chang ◽

Marie L. Ary ◽

Benjamin D. Allen ◽

Roberto A. Chica ◽

...

Keyword(s):

Protein Engineering ◽

Protein Design ◽

Local Alignment ◽

Data Sets ◽

Standard Format ◽

Predictive Tools ◽

Engineering Data ◽

Schema Design ◽

Wide Range ◽

Protein Design And Engineering

AbstractWe present ProtaBank, a repository for storing, querying, analyzing, and sharing protein design and engineering data in an actively maintained and updated database. ProtaBank provides a format to describe and compare all types of protein mutational data, spanning a wide range of properties and techniques. It features a user-friendly web interface and programming layer that streamlines data deposition and allows for batch input and queries. The database schema design incorporates a standard format for reporting protein sequences and experimental data that facilitates comparison of results across different data sets. A suite of analysis and visualization tools are provided to facilitate discovery, to guide future designs, and to benchmark and train new predictive tools and algorithms. ProtaBank will provide a valuable resource to the protein engineering community by storing and safeguarding newly generated data, allowing for fast searching and identification of relevant data from the existing literature, and exploring correlations between disparate data sets. ProtaBank invites researchers to contribute data to the database to make it accessible for search and analysis. ProtaBank is available at https://protabank.org.ImpactThe ProtaBank database provides a central repository for researchers to store, query, analyze, and share all types of protein engineering data. This modern database will serve a pivotal role in organizing protein engineering data and leveraging the increasingly large amounts of mutational data being generated. Together with the analysis tools, it will help scientists gain insights into sequence-function relationships, support the development of new predictive tools and algorithms, and facilitate future protein engineering efforts.Abbreviations3Dthree-dimensionalAPIapplication programming interfaceAWSAmazon Web ServicesBLASTBasic Local Alignment Search ToolCmconcentration of denaturant at midpoint of unfolding transitionCSVcomma-separated valuesΔGGibbs free energy of folding/unfoldingGβ1β1 domain of Streptococcal protein GGdmClguanidinium chloridekcatcatalytic rate constantKddissociation constantMICminimum inhibitory concentrationPDBProtein Data BankPEprotein engineeringRDSRelational Database ServicesRESTRepresentation State TransferTmmelting temperature

Download Full-text

Study on Effective Classifier for Software Engineering Data Sets

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.671-674.3208 ◽

2013 ◽

Vol 671-674 ◽

pp. 3208-3211

Author(s):

Yong Hong Gao

Keyword(s):

Software Engineering ◽

Development Process ◽

Positive Impact ◽

Data Sets ◽

Software Development Process ◽

Engineering Data ◽

Wide Range ◽

Engineering Problems ◽

Attribute Space ◽

Class Variable

A wide range of important software engineering problems need solutions that involve accurately predicting outcomes, such as the number of defects in a module, the estimated cost, or deciding the best software development process to use. Mathematically a classifier is a function that maps an N-dimensional attribute space to a discrete set of labels of the class variable. We proposed an effective classifier for software engineering data sets, the results show that the accurate predictions can have an enormous positive impact on reducing these problems and ensure projects’ successes.

Download Full-text

ProtaBank: A repository for protein design and engineering data

Protein Science ◽

10.1002/pro.3585 ◽

2019 ◽

Vol 28 (3) ◽

pp. 672-672

Author(s):

Connie Y. Wang ◽

Paul M. Chang ◽

Marie L. Ary ◽

Benjamin D. Allen ◽

Roberto A. Chica ◽

...

Keyword(s):

Protein Design ◽

Engineering Data ◽

Protein Design And Engineering

Download Full-text

Droplet Microfluidics-Enabled High-Throughput Screening for Protein Engineering

Micromachines ◽

10.3390/mi10110734 ◽

2019 ◽

Vol 10 (11) ◽

pp. 734 ◽

Cited By ~ 6

Author(s):

Lindong Weng ◽

James E. Spoonamore

Keyword(s):

Protein Engineering ◽

Directed Evolution ◽

High Throughput ◽

Protein Design ◽

High Throughput Screening ◽

Fluid Phase ◽

Droplet Microfluidics ◽

New Paradigm ◽

Wide Range ◽

Challenges And Opportunities

Protein engineering—the process of developing useful or valuable proteins—has successfully created a wide range of proteins tailored to specific agricultural, industrial, and biomedical applications. Protein engineering may rely on rational techniques informed by structural models, phylogenic information, or computational methods or it may rely upon random techniques such as chemical mutation, DNA shuffling, error prone polymerase chain reaction (PCR), etc. The increasing capabilities of rational protein design coupled to the rapid production of large variant libraries have seriously challenged the capacity of traditional screening and selection techniques. Similarly, random approaches based on directed evolution, which relies on the Darwinian principles of mutation and selection to steer proteins toward desired traits, also requires the screening of very large libraries of mutants to be truly effective. For either rational or random approaches, the highest possible screening throughput facilitates efficient protein engineering strategies. In the last decade, high-throughput screening (HTS) for protein engineering has been leveraging the emerging technologies of droplet microfluidics. Droplet microfluidics, featuring controlled formation and manipulation of nano- to femtoliter droplets of one fluid phase in another, has presented a new paradigm for screening, providing increased throughput, reduced reagent volume, and scalability. We review here the recent droplet microfluidics-based HTS systems developed for protein engineering, particularly directed evolution. The current review can also serve as a tutorial guide for protein engineers and molecular biologists who need a droplet microfluidics-based HTS system for their specific applications but may not have prior knowledge about microfluidics. In the end, several challenges and opportunities are identified to motivate the continued innovation of microfluidics with implications for protein engineering.

Download Full-text

ProtaBank: A repository for protein design and engineering data

Protein Science ◽

10.1002/pro.3406 ◽

2018 ◽

Vol 27 (6) ◽

pp. 1113-1124 ◽

Cited By ~ 15

Author(s):

Connie Y. Wang ◽

Paul M. Chang ◽

Marie L. Ary ◽

Benjamin D. Allen ◽

Roberto A. Chica ◽

...

Keyword(s):

Protein Design ◽

Engineering Data ◽

Protein Design And Engineering

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

Protein Engineering Approaches to Enhance Fungal Laccase Production in S. cerevisiae

International Journal of Molecular Sciences ◽

10.3390/ijms22031157 ◽

2021 ◽

Vol 22 (3) ◽

pp. 1157

Author(s):

Pablo Aza ◽

Felipe de Salas ◽

Gonzalo Molpeceres ◽

David Rodríguez-Escribano ◽

Iñigo de la Fuente ◽

...

Keyword(s):

Protein Engineering ◽

Directed Evolution ◽

Signal Peptide ◽

Laccase Activity ◽

Laccase Production ◽

Wide Range ◽

Mating Factor ◽

Conserved Amino Acids ◽

Basidiomycete Fungi

Laccases secreted by saprotrophic basidiomycete fungi are versatile biocatalysts able to oxidize a wide range of aromatic compounds using oxygen as the sole requirement. Saccharomyces cerevisiae is a preferred host for engineering fungal laccases. To assist the difficult secretion of active enzymes by yeast, the native signal peptide is usually replaced by the preproleader of S. cerevisiae alfa mating factor (MFα1). However, in most cases, only basal enzyme levels are obtained. During directed evolution in S. cerevisiae of laccases fused to the α-factor preproleader, we demonstrated that mutations accumulated in the signal peptide notably raised enzyme secretion. Here we describe different protein engineering approaches carried out to enhance the laccase activity detected in the liquid extracts of S. cerevisiae cultures. We demonstrate the improved secretion of native and engineered laccases by using the fittest mutated α-factor preproleader obtained through successive laccase evolution campaigns in our lab. Special attention is also paid to the role of protein N-glycosylation in laccase production and properties, and to the introduction of conserved amino acids through consensus design enabling the expression of certain laccases otherwise not produced by the yeast. Finally, we revise the contribution of mutations accumulated in laccase coding sequence (CDS) during previous directed evolution campaigns that facilitate enzyme production.

Download Full-text

A Visual and VAE Based Hierarchical Indoor Localization Method

Sensors ◽

10.3390/s21103406 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3406

Author(s):

Jie Jiang ◽

Yin Zou ◽

Lidong Chen ◽

Yujie Fang

Keyword(s):

Image Retrieval ◽

Indoor Localization ◽

Data Sets ◽

Indoor Environments ◽

Global Features ◽

Data Set ◽

Data Annotation ◽

Wide Range ◽

Annotation Costs ◽

Global And Local

Precise localization and pose estimation in indoor environments are commonly employed in a wide range of applications, including robotics, augmented reality, and navigation and positioning services. Such applications can be solved via visual-based localization using a pre-built 3D model. The increase in searching space associated with large scenes can be overcome by retrieving images in advance and subsequently estimating the pose. The majority of current deep learning-based image retrieval methods require labeled data, which increase data annotation costs and complicate the acquisition of data. In this paper, we propose an unsupervised hierarchical indoor localization framework that integrates an unsupervised network variational autoencoder (VAE) with a visual-based Structure-from-Motion (SfM) approach in order to extract global and local features. During the localization process, global features are applied for the image retrieval at the level of the scene map in order to obtain candidate images, and are subsequently used to estimate the pose from 2D-3D matches between query and candidate images. RGB images only are used as the input of the proposed localization system, which is both convenient and challenging. Experimental results reveal that the proposed method can localize images within 0.16 m and 4° in the 7-Scenes data sets and 32.8% within 5 m and 20° in the Baidu data set. Furthermore, our proposed method achieves a higher precision compared to advanced methods.

Download Full-text

An Interoperability Platform Enabling Reuse of Electronic Health Records for Signal Verification Studies

BioMed Research International ◽

10.1155/2016/6741418 ◽

2016 ◽

Vol 2016 ◽

pp. 1-18 ◽

Cited By ~ 5

Author(s):

Mustafa Yuksel ◽

Suat Gonul ◽

Gokce Banu Laleci Erturkmen ◽

Ali Anil Sinaci ◽

Paolo Invernizzi ◽

...

Keyword(s):

Real Life ◽

Case Series ◽

Background Information ◽

Data Sets ◽

Local Data ◽

Lombardy Region ◽

Common Information ◽

Spontaneous Reports ◽

Wide Range ◽

Common Information Model

Depending mostly on voluntarily sent spontaneous reports, pharmacovigilance studies are hampered by low quantity and quality of patient data. Our objective is to improve postmarket safety studies by enabling safety analysts to seamlessly access a wide range of EHR sources for collecting deidentified medical data sets of selected patient populations and tracing the reported incidents back to original EHRs. We have developed an ontological framework where EHR sources and target clinical research systems can continue using their own local data models, interfaces, and terminology systems, while structural interoperability and Semantic Interoperability are handled through rule-based reasoning on formal representations of different models and terminology systems maintained in the SALUS Semantic Resource Set. SALUS Common Information Model at the core of this set acts as the common mediator. We demonstrate the capabilities of our framework through one of the SALUS safety analysis tools, namely, the Case Series Characterization Tool, which have been deployed on top of regional EHR Data Warehouse of the Lombardy Region containing about 1 billion records from 16 million patients and validated by several pharmacovigilance researchers with real-life cases. The results confirm significant improvements in signal detection and evaluation compared to traditional methods with the missing background information.

Download Full-text

Considerations of the Scale of Radiocarbon Offsets in the East Mediterranean, and Considering a Case for the Latest (Most Recent) Likely Date for the Santorini Eruption

Radiocarbon ◽

10.1017/s0033822200047202 ◽

2012 ◽

Vol 54 (3-4) ◽

pp. 449-474 ◽

Cited By ~ 13

Author(s):

Sturt W Manning ◽

Bernd Kromer

Keyword(s):

Time Horizon ◽

Weighted Average ◽

17Th Century ◽

Data Sets ◽

East Mediterranean ◽

Average Value ◽

Accuracy And Precision ◽

Wide Range ◽

The Impact ◽

East Mediterranean Region

The debate over the dating of the Santorini (Thera) volcanic eruption has seen sustained efforts to criticize or challenge the radiocarbon dating of this time horizon. We consider some of the relevant areas of possible movement in the14C dating—and, in particular, any plausible mechanisms to support as late (most recent) a date as possible. First, we report and analyze data investigating the scale of apparent possible14C offsets (growing season related) in the Aegean-Anatolia-east Mediterranean region (excluding the southern Levant and especially pre-modern, pre-dam Egypt, which is a distinct case), and find no evidence for more than very small possible offsets from several cases. This topic is thus not an explanation for current differences in dating in the Aegean and at best provides only a few years of latitude. Second, we consider some aspects of the accuracy and precision of14C dating with respect to the Santorini case. While the existing data appear robust, we nonetheless speculate that examination of the frequency distribution of the14C data on short-lived samples from the volcanic destruction level at Akrotiri on Santorini (Thera) may indicate that the average value of the overall data sets is not necessarily the most appropriate14C age to use for dating this time horizon. We note the recent paper of Soter (2011), which suggests that in such a volcanic context some (small) age increment may be possible from diffuse CO2emissions (the effect is hypothetical at this stage and hasnotbeen observed in the field), and that "if short-lived samples from the same stratigraphic horizon yield a wide range of14C ages, the lower values may be the least altered by old CO2." In this context, it might be argued that a substantive “low” grouping of14C ages observable within the overall14C data sets on short-lived samples from the Thera volcanic destruction level centered about 3326–3328 BP is perhaps more representative of the contemporary atmospheric14C age (without any volcanic CO2contamination). This is a subjective argument (since, in statistical terms, the existing studies using the weighted average remain valid) that looks to support as late a date as reasonable from the14C data. The impact of employing this revised14C age is discussed. In general, a late 17th century BC date range is found (to remain) to be most likelyeven ifsuch a late-dating strategy is followed—a late 17th century BC date range is thus a robust finding from the14C evidence even allowing for various possible variation factors. However, the possibility of a mid-16th century BC date (within ∼1593–1530 cal BC) is increased when compared against previous analyses if the Santorini data are considered in isolation.

Download Full-text