HMM_RA: An Improved Method for Alpha-Helical Transmembrane Protein Topology Prediction

Bioinformatics and Biology Insights ◽

10.4137/bbi.s358 ◽

2008 ◽

Vol 2 ◽

pp. BBI.S358 ◽

Cited By ~ 2

Author(s):

Jing Hu ◽

Changhui Yan

Keyword(s):

Transmembrane Protein ◽

Quality Data ◽

Data Sets ◽

Data Set ◽

Location Accuracy ◽

Functional Sites ◽

Standard Data ◽

Comparable Performance ◽

Improved Performance ◽

Topology Prediction

α-helical transmembrane (TM) proteins play important and diverse functional roles in cells. The ability to predict the topology of these proteins is important for identifying functional sites and inferring function of membrane proteins. This paper presents a Hidden Markov Model (referred to as HMM_RA) that can predict the topology of α-helical transmembrane proteins with improved performance. HMM_RA adopts the same structure as the HMMTOP method, which has five modules: inside loop, inside helix tail, membrane helix, outside helix tail and outside loop. Each module consists of one or multiple states. HMM_RA allows using reduced alphabets to encode protein sequences. Thus, each state of HMM_RA is associated with n emission probabilities, where n is the size of the reduced alphabet set. Direct comparisons using two standard data sets show that HMM_RA consistently outperforms HMMTOP and TMHMM in topology prediction. Specifically, on a high-quality data set of 83 proteins, HMM_RA outperforms HMMTOP by up to 7.6% in topology accuracy and 6.4% in α-helices location accuracy. On the same data set, HMM_RA outperforms TMHMM by up to 6.4% in topology accuracy and 2.9% in location accuracy. Comparison also shows that HMM_RA achieves comparable performance as Phobius, a recently published method.

Download Full-text

Text-to-Text Similarity of Sentences

Applied Natural Language Processing ◽

10.4018/978-1-60960-741-8.ch007 ◽

2012 ◽

pp. 110-121 ◽

Cited By ~ 4

Author(s):

Vasile Rus ◽

Mihai Lintean ◽

Arthur C. Graesser ◽

Danielle S. McNamara

Keyword(s):

Semantic Analysis ◽

Intelligent Tutoring ◽

Intelligent Tutoring System ◽

Semantic Relations ◽

Data Sets ◽

Data Set ◽

Word Similarity ◽

Tutoring Systems ◽

Standard Data ◽

Sentence Level

Assessing the semantic similarity between two texts is a central task in many applications, including summarization, intelligent tutoring systems, and software testing. Similarity of texts is typically explored at the level of word, sentence, paragraph, and document. The similarity can be defined quantitatively (e.g. in the form of a normalized value between 0 and 1) and qualitatively in the form of semantic relations such as elaboration, entailment, or paraphrase. In this chapter, we focus first on measuring quantitatively and then on detecting qualitatively sentence-level text-to-text semantic relations. A generic approach that relies on word-to-word similarity measures is presented as well as experiments and results obtained with various instantiations of the approach. In addition, we provide results of a study on the role of weighting in Latent Semantic Analysis, a statistical technique to assess similarity of texts. The results were obtained on two data sets: a standard data set on sentence-level paraphrase detection and a data set from an intelligent tutoring system.

Download Full-text

Non-Government Sector Mental Health Data Dictionary and Standard Data Set

Health Information Management ◽

10.1177/183335830203000404 ◽

2002 ◽

Vol 30 (4) ◽

pp. 3-13

Author(s):

Christie Wood ◽

Duane Pennebaker

Keyword(s):

Mental Health ◽

Health Sector ◽

Data Sets ◽

Data Dictionary ◽

Data Set ◽

Stakeholder Consultation ◽

Standard Data ◽

Minimum Data ◽

Australian Institute ◽

Government Sector

In order to provide a framework for standardised data reporting in the Australian nongovernment community mental health sector, a Data Dictionary and standard data set were developed. Advisory Committee and key stakeholder consultation, review of local and national minimum data sets and stakeholder validation informed this process. This resulted in a Data Dictionary containing 37 items and a standard data set containing 15 items. These items conform to the Australian Institute of Health & Welfare's (AIHW) standards and address Leginski et al.'s (1989) decision standards.

Download Full-text

Robust Well-Test Interpretation by Using Nonlinear Regression With Parameter and Data Transformations

SPE Journal ◽

10.2118/132467-pa ◽

2011 ◽

Vol 16 (03) ◽

pp. 698-712 ◽

Cited By ~ 2

Author(s):

Aysegul Dastan ◽

Roland N. Horne

Keyword(s):

Parameter Space ◽

Nonlinear Regression ◽

Wavelet Transformation ◽

Data Sets ◽

Wavelet Basis ◽

Test Interpretation ◽

Well Test ◽

Reduced Basis ◽

Data Set ◽

Improved Performance

Summary Nonlinear regression is a well-established technique in well-test interpretation. However, this widely used technique is vulnerable to issues commonly observed in real data sets—specifically, sensitivity to noise, parameter uncertainty, and dependence on starting guess. In this paper, we show significant improvements in nonlinear regression by using transformations on the parameter space and the data space. Our techniques improve the accuracy of parameter estimation substantially. The techniques also provide faster convergence, reduced sensitivity to starting guesses, automatic noise reduction, and data compression. In the first part of the paper, we show, for the first time, that Cartesian parameter transformations are necessary for correct statistical representation of physical systems (e.g., the reservoir). Using true Cartesian parameters enables nonlinear regression to search for the optimal solution homogeneously on the entire parameter space, which results in faster convergence and increases the probability of convergence for a random starting guess. Nonlinear regression using Cartesian parameters also reveals inherent ambiguities in a data set, which may be left concealed when using existing techniques, leading to incorrect conclusions. We proposed suitable Cartesian transform pairs for common reservoir parameters and used a Monte Carlo technique to verify that the transform pairs generate Cartesian parameters. The second part of the paper discusses nonlinear regression using the wavelet transformation of the data set. The wavelet transformation is a process that can compress and denoise data automatically. We showed that only a few wavelet coefficients are sufficient for an improved performance and direct control of nonlinear regression. By using regression on a reduced wavelet basis rather than the original pressure data points, we achieved improved performance in terms of likelihood of convergence and narrower confidence intervals. The wavelet components in the reduced basis isolate the key contributors to the response and, hence, use only the relevant elements in the pressure-transient signal. We investigated four different wavelet strategies, which differ in the method of choosing a reduced wavelet basis. Combinations of the techniques discussed in this paper were used to analyze 20 data sets to find the technique or combination of techniques that works best with a particular data set. Using the appropriate combination of our techniques provides very robust and novel interpretation techniques, which will allow for reliable estimation of reservoir parameters using nonlinear regression.

Download Full-text

An Inspection of IFC Models from Practice

Applied Sciences ◽

10.3390/app11052232 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2232

Author(s):

Francesca Noardo ◽

Ken Arroyo Ohori ◽

Thomas Krijnen ◽

Jantien Stoter

Keyword(s):

Relevant Information ◽

Automatic Analysis ◽

Data Sets ◽

Data Set ◽

Industry Foundation Classes ◽

Standard Data ◽

Information Models ◽

Building Information ◽

Open Standard ◽

High Level

Industry Foundation Classes (IFC) is a complete, wide and complex open standard data model to represent Building Information Models. Big efforts are being made by the standardization organization buildingSMART, to develop and maintain this standard in collaboration with researchers, companies and institutions. However, when trying to use IFC models from practice for automatic analysis, some issues emerge, as a consequence of a misalignment between what is prescribed by, or available in, the standard with the data sets that are produced in practice. In this study, a sample of models produced by practitioners for aims different from their explicit use within automatic processing tools is inspected and analyzed. The aim is to find common patterns in data set from practice and their possible discrepancies with the standard, in order to find ways to address such discrepancies in a next step. In particular, it is noticeable that the overall quality of the models requires specific additional care by the modellers before relying on them for automatic analysis, and a high level of variability is present concerning the storage of some relevant information (such as georeferencing).

Download Full-text

An empirical investigation of alternative semi-supervised segmentation methodologies

South African Journal of Science ◽

10.17159/sajs.2019/5359 ◽

2019 ◽

Vol 115 (3/4) ◽

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

Logistic Regression ◽

Model Performance ◽

Predictive Modelling ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Supervised Segmentation ◽

Improved Performance ◽

Validation Set ◽

Combination Approach

Segmentation of data for the purpose of enhancing predictive modelling is a well-established practice in the banking industry. Unsupervised and supervised approaches are the two main types of segmentation and examples of improved performance of predictive models exist for both approaches. However, both focus on a single aspect – either target separation or independent variable distribution – and combining them may deliver better results. This combination approach is called semi-supervised segmentation. Our objective was to explore four new semi-supervised segmentation techniques that may offer alternative strengths. We applied these techniques to six data sets from different domains, and compared the model performance achieved. The original semi-supervised segmentation technique was the best for two of the data sets (as measured by the improvement in validation set Gini), but others outperformed for the other four data sets. Significance: We propose four newly developed semi-supervised segmentation techniques that can be used as additional tools for segmenting data before fitting a logistic regression. In all comparisons, using semi-supervised segmentation before fitting a logistic regression improved the modelling performance (as measured by the Gini coefficient on the validation data set) compared to using unsegmented logistic regression.

Download Full-text

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Natural Language Engineering ◽

10.1017/s1351324919000366 ◽

2019 ◽

Vol 25 (5) ◽

pp. 651-674 ◽

Cited By ~ 1

Author(s):

Katja Zupan ◽

Nikola Ljubešić ◽

Tomaž Erjavec

Keyword(s):

Domain Adaptation ◽

Training Data ◽

Data Sets ◽

Standard Language ◽

Data Set ◽

Standard Data ◽

Pos Tagging ◽

Part Of Speech ◽

Versus Domain ◽

Pos Tagger

AbstractPart-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

Download Full-text

DETERMINING THE MINIMUM DATA SET FOR DIABETES REGISTRY

Medical Technologies Journal ◽

10.26415/2572-004x-vol1iss4p115-116 ◽

2017 ◽

Vol 1 (4) ◽

pp. 115-116

Author(s):

Masoud Sotoudehfar ◽

Zahra Mazloum Khorasani ◽

Zahra Ebnehoseini ◽

Kobra Etminani ◽

Mahmoud Tara ◽

...

Keyword(s):

Diabetes Care ◽

Health Informatics ◽

Data Gathering ◽

Research Priorities ◽

Complete Information ◽

Data Sets ◽

Data Set ◽

Diabetes Registry ◽

Standard Data ◽

Minimum Data

Introduction: The number of people with diabetes's increasing. More than 220 million people have diabetes, more than 70% of whom live in middle and lower-income countries. already exist many innovations around the world on improving the managed care of diabetes .diabetes registries are one of them. in Iran, development and evaluation of diabetes information systems is one of the most research priorities. since defining health regulations and evaluation of diabetes prevention programs depend on the powerful information system, but in Iran don't exist complete information about incidence and prevalence of diabetes. determine standard data elements (Des) and design diabetes registry is one the most important country requirements. the main purpose of this study is investigating to this subject. Methods: This is a descriptive- analytic study. Resource related to diabetes DEs collected from selective minimum data sets. Then diabetes DEs set derived from selective minimum data sets were investigated in focus group sessions with endocrine specialists, health informatics, and health information management. Duplicate DEs were removed and similar DEs were combined. Then seven endocrine specialists evaluated diabetes DEs set. They determine the value of each DEs using the Delphi technique (scores range from 0 to 5). The DEs that received more than 75% of grade 4 and 5 remained in the study. Following the expert opinion, the final version of the diabetes DEs set was designed. Results: According to literature review 455 DEs included studying, after Delphi sessions, 293 data element remained to study. Main categories of DEs are:1-patient demographic characterizes (12 DEs), 2-patient referral (5 DEs), 3-diabetes care follow up (15 DEs), 4-physical exam, chief complaint and assessment (40 DEs), 5-history (such as: individual, grow up, family, drug abuse) (10 DEs), 6-pregnancy management (13 DEs), 7-screening (10 DEs), 8-specialty evolutions ( such as: cardiovascular (18 DEs), neuropathy (16 DEs), nephropathy (7 DEs), teeth and mouse (3 DEs), eyes (14 DEs), psychology situation (2 DEs), sexual ability (1 DEs)), 9-laboratory exams (33 DEs), 10-drugs (oral antidiabetics drugs (14 DEs), injectable antidiabetics (7 DEs), lipid (11 DEs), hypertension (20 DEs), anti placates (2 DEs)), cardiac (3 DEs), preparing insulin method (5 DEs)), 11-physical activity (4 DEs),12- diet (12 DEs), 13-education and self care (13 DEs). Conclusion: In the study diabetes, DEs set were determined that provide appropriate yield for data gathering and record all required information for diabetes care. Hence diabetes is a chronic disease and Patients suffer from it for years, implementation diabetes DEs can improve documentation and improve diabetes care.

Download Full-text

Comparison of Different Image Data Augmentation Approaches

Journal of Imaging ◽

10.3390/jimaging7120254 ◽

2021 ◽

Vol 7 (12) ◽

pp. 254

Author(s):

Loris Nanni ◽

Michelangelo Paci ◽

Sheryl Brahnam ◽

Alessandra Lumini

Keyword(s):

Image Classification ◽

Data Augmentation ◽

Image Data ◽

Research Literature ◽

Discrete Wavelet ◽

Data Sets ◽

Data Set ◽

Additional Information ◽

Comparable Performance ◽

Data Points

Convolutional neural networks (CNNs) have gained prominence in the research literature on image classification over the last decade. One shortcoming of CNNs, however, is their lack of generalizability and tendency to overfit when presented with small training sets. Augmentation directly confronts this problem by generating new data points providing additional information. In this paper, we investigate the performance of more than ten different sets of data augmentation methods, with two novel approaches proposed here: one based on the discrete wavelet transform and the other on the constant-Q Gabor transform. Pretrained ResNet50 networks are finetuned on each augmentation method. Combinations of these networks are evaluated and compared across four benchmark data sets of images representing diverse problems and collected by instruments that capture information at different scales: a virus data set, a bark data set, a portrait dataset, and a LIGO glitches data set. Experiments demonstrate the superiority of this approach. The best ensemble proposed in this work achieves state-of-the-art (or comparable) performance across all four data sets. This result shows that varying data augmentation is a feasible way for building an ensemble of classifiers for image classification.

Download Full-text

Di-p-bromophenyl ether, a redetermined crystal structure derived from low-quality diffraction data

Acta Crystallographica Section B Structural Science ◽

10.1107/s0108768104023225 ◽

2004 ◽

Vol 60 (6) ◽

pp. 734-738 ◽

Cited By ~ 4

Author(s):

Lars Eriksson ◽

Johan Eriksson ◽

Jiwei Hu

Keyword(s):

Crystal Structure ◽

Diffraction Data ◽

Crystal Structure Analysis ◽

Quality Data ◽

Data Sets ◽

Axially Symmetric ◽

Data Set ◽

Normal Manner ◽

Symmetric Molecule ◽

Good Quality Data

We show that the lack of good quality data, normally essential to successful crystal structure analysis, can in part be compensated for by measuring data from several crystals and merging the resulting data sets. The crystal structure of the flame retardant di-p-bromophenyl ether, C12H8Br2O, a twofold axially symmetric molecule, has been redetermined and refined from such a merged multi-crystal diffraction data set to an acceptable conventional R factor (R 1 = 0.06), a result which could not have been obtained from any one of our single-crystal diffraction data sets used alone in the normal manner.

Download Full-text

A comparison of OEM CO retrievals from the IASI and MOPITT instruments

Atmospheric Measurement Techniques ◽

10.5194/amt-4-775-2011 ◽

2011 ◽

Vol 4 (5) ◽

pp. 775-793 ◽

Cited By ~ 9

Author(s):

S. M. Illingworth ◽

J. J. Remedios ◽

H. Boesch ◽

S.-P. Ho ◽

D. P. Edwards ◽

...

Keyword(s):

Satellite Data ◽

A Priori ◽

Geographic Region ◽

Data Sets ◽

Systematic Bias ◽

Vertical Resolution ◽

Data Set ◽

Standard Data ◽

Retrieval Scheme ◽

Spectrally Resolved

Abstract. Observations of atmospheric carbon monoxide (CO) can only be made on continental and global scales by remote sensing instruments situated in space. One such instrument is the Infrared Atmospheric Sounding Interferometer (IASI), producing spectrally resolved, top-of-atmosphere radiance measurements from which CO vertical layers and total columns can be retrieved. This paper presents a technique for intercomparisons of satellite data with low vertical resolution. The example in the paper also generates the first intercomparison between an IASI CO data set, in this case that produced by the University of Leicester IASI Retrieval Scheme (ULIRS), and the V3 and V4 operationally retrieved CO products from the Measurements Of Pollution In The Troposphere (MOPITT) instrument. The comparison is performed for a localised region of Africa, primarily for an ocean day-time configuration, in order to develop the technique for instrument intercomparison in a region with well defined a priori. By comparing both the standard data and a special version of MOPITT data retrieved using the ULIRS a priori for CO, it is shown that standard intercomparisons of CO are strongly affected by the differing a priori data of the retrievals, and by the differing sensitivities of the two instruments. In particular, the differing a priori profiles for MOPITT V3 and V4 data result in systematic retrieved profile changes as expected. An application of averaging kernels is used to derive a difference quantity which is much less affected by smoothing error, and hence more sensitive to systematic error. These conclusions are confirmed by simulations with model profiles for the same region. This technique is used to show that for the data that has been processed the systematic bias between MOPITT V4 and ULIRS IASI data, at MOPITT vertical resolution, is less than 7 % for the comparison data set, and on average appears to be less than 4 %. The results of this study indicate that intercomparisons of satellite data sets with low vertical resolution should ideally be performed with: retrievals using a common a priori appropriate to the geographic region studied; the application of averaging kernels to compute difference quantities with reduced a priori influence; and a comparison with simulated differences using model profiles for the target gas in the region.

Download Full-text