Khmer Treebank Construction via Interactive Tree  Visualization

Bonpagna Kann; Thodsaporn Chay-intr; Hour Kaing; Thanaruk Theeramunkong

doi:10.22146/ijitee.48545

Khmer Treebank Construction via Interactive Tree Visualization

IJITEE (International Journal of Information Technology and Electrical Engineering) ◽

10.22146/ijitee.48545 ◽

2019 ◽

Vol 3 (3) ◽

pp. 67

Author(s):

Bonpagna Kann ◽

Thodsaporn Chay-intr ◽

Hour Kaing ◽

Thanaruk Theeramunkong

Keyword(s):

Language Processing ◽

Cross Validation ◽

The Other ◽

Pos Tagging ◽

Grammar Rules ◽

High Level ◽

Leave One Out ◽

Self Consistency ◽

Fold Cross Validation ◽

Tree Visualization

Despite the fact that there are a number of researches working on Khmer Language in the field of Natural Language Processing along with some resources regarding words segmentation and POS Tagging, we still lack of high-level resources regarding syntax, Treebanks and grammars, for example. This paper illustrates the semi-automatic framework of constructing Khmer Treebank and the extraction of the Khmer grammar rules from a set of sentences taken from the Khmer grammar books. Initially, these sentences will be manually annotated and processed to generate a number of grammar rules with their probabilities once the Treebank is obtained. In our experiments, the annotated trees and the extracted grammar rules are analyzed in both quantitative and qualitative way. Finally, the results will be evaluated in three evaluation processes including Self-Consistency, 5-Fold Cross-Validation, Leave-One-Out Cross-Validation along with the three validation methods such as Precision, Recall, F1-Measure. According to the result of the three validations, Self-Consistency has shown the best result with more than 92%, followed by the Leave-One-Out Cross-Validation and 5-Fold Cross Validation with the average of 88% and 75% respectively. On the other hand, the crossing bracket data shows that Leave-One-Out Cross Validation holds the highest average with 96% while the other two are 85% and 89%, respectively.

Download Full-text

Study onYang-XuUsing Body Constitution Questionnaire and Blood Variables in Healthy Volunteers

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2016/9437382 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7 ◽

Cited By ~ 7

Author(s):

Hong-Jhang Chen ◽

Yii-Jeng Lin ◽

Pei-Chen Wu ◽

Wei-Hsiang Hsu ◽

Wan-Chung Hu ◽

...

Keyword(s):

Healthy Subjects ◽

Logistic Regression Model ◽

Cross Validation ◽

Blood Biomarkers ◽

Metabolic Characteristics ◽

Body Constitution ◽

Leave One Out ◽

The Relationship ◽

Fold Cross Validation ◽

Blood Variables

Traditional Chinese medicine (TCM) formulates treatment according to body constitution (BC) differentiation. Different constitutions have specific metabolic characteristics and different susceptibility to certain diseases. This study aimed to assess theYang-Xuconstitution using a body constitution questionnaire (BCQ) and clinical blood variables. A BCQ was employed to assess the clinical manifestation ofYang-Xu. The logistic regression model was conducted to explore the relationship between BC scores and biomarkers. Leave-one-out cross-validation (LOOCV) and K-fold cross-validation were performed to evaluate the accuracy of a predictive model in practice. Decision trees (DTs) were conducted to determine the possible relationships between blood biomarkers and BC scores. According to the BCQ analysis, 49% participants without any BC were classified as healthy subjects. Among them, 130 samples were selected for further analysis and divided into two groups. One group comprised healthy subjects without any BC (68%), while subjects of the other group, named as the sub-healthy group, had three BCs (32%). Six biomarkers, CRE, TSH, HB, MONO, RBC, and LH, were found to have the greatest impact on BCQ outcomes inYang-Xusubjects. This study indicated significant biochemical differences inYang-Xusubjects, which may provide a connection between blood variables and theYang-XuBC.

Download Full-text

IILLS: predicting virus-receptor interactions based on similarity and semi-supervised learning

BMC Bioinformatics ◽

10.1186/s12859-019-3278-3 ◽

2019 ◽

Vol 20 (S23) ◽

Cited By ~ 3

Author(s):

Cheng Yan ◽

Guihua Duan ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

Infectious Diseases ◽

Cross Validation ◽

Sequence Similarity ◽

Least Square ◽

Computational Method ◽

Receptor Interaction ◽

Virus Receptor ◽

Receptor Interactions ◽

Leave One Out ◽

Fold Cross Validation

Abstract Background Viral infectious diseases are the serious threat for human health. The receptor-binding is the first step for the viral infection of hosts. To more effectively treat human viral infectious diseases, the hidden virus-receptor interactions must be discovered. However, current computational methods for predicting virus-receptor interactions are limited. Result In this study, we propose a new computational method (IILLS) to predict virus-receptor interactions based on Initial Interaction scores method via the neighbors and the Laplacian regularized Least Square algorithm. IILLS integrates the known virus-receptor interactions and amino acid sequences of receptors. The similarity of viruses is calculated by the Gaussian Interaction Profile (GIP) kernel. On the other hand, we also compute the receptor GIP similarity and the receptor sequence similarity. Then the sequence similarity is used as the final similarity of receptors according to the prediction results. The 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) are used to assess the prediction performance of our method. We also compare our method with other three competing methods (BRWH, LapRLS, CMF). Conlusion The experiment results show that IILLS achieves the AUC values of 0.8675 and 0.9061 with the 10-fold cross validation and leave-one-out cross validation (LOOCV), respectively, which illustrates that IILLS is superior to the competing methods. In addition, the case studies also further indicate that the IILLS method is effective for the virus-receptor interaction prediction.

Download Full-text

RESAMPLING METHODS IN SOFTWARE QUALITY CLASSIFICATION

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194012400037 ◽

2012 ◽

Vol 22 (02) ◽

pp. 203-223 ◽

Cited By ~ 7

Author(s):

WASIF AFZAL ◽

RICHARD TORKAR ◽

ROBERT FELDT

Keyword(s):

Software Engineering ◽

Software Quality ◽

Cross Validation ◽

Predictor Variables ◽

Primary Study ◽

Data Sets ◽

Resampling Methods ◽

Quality Classification ◽

Leave One Out ◽

Fold Cross Validation

In the presence of a number of algorithms for classification and prediction in software engineering, there is a need to have a systematic way of assessing their performances. The performance assessment is typically done by some form of partitioning or resampling of the original data to alleviate biased estimation. For predictive and classification studies in software engineering, there is a lack of a definitive advice on the most appropriate resampling method to use. This is seen as one of the contributing factors for not being able to draw general conclusions on what modeling technique or set of predictor variables are the most appropriate. Furthermore, the use of a variety of resampling methods make it impossible to perform any formal meta-analysis of the primary study results. Therefore, it is desirable to examine the influence of various resampling methods and to quantify possible differences. Objective and method: This study empirically compares five common resampling methods (hold-out validation, repeated random sub-sampling, 10-fold cross-validation, leave-one-out cross-validation and non-parametric bootstrapping) using 8 publicly available data sets with genetic programming (GP) and multiple linear regression (MLR) as software quality classification approaches. Location of (PF, PD) pairs in the ROC (receiver operating characteristics) space and area under an ROC curve (AUC) are used as accuracy indicators. Results: The results show that in terms of the location of (PF, PD) pairs in the ROC space, bootstrapping results are in the preferred region for 3 of the 8 data sets for GP and for 4 of the 8 data sets for MLR. Based on the AUC measure, there are no significant differences between the different resampling methods using GP and MLR. Conclusion: There can be certain data set properties responsible for insignificant differences between the resampling methods based on AUC. These include imbalanced data sets, insignificant predictor variables and high-dimensional data sets. With the current selection of data sets and classification techniques, bootstrapping is a preferred method based on the location of (PF, PD) pair data in the ROC space. Hold-out validation is not a good choice for comparatively smaller data sets, where leave-one-out cross-validation (LOOCV) performs better. For comparatively larger data sets, 10-fold cross-validation performs better than LOOCV.

Download Full-text

Penerapan Algoritma C4.5 Untuk Prediksi Churn Rate Pengguna Jasa Telekomunikasi

Jurnal Komputasi ◽

10.23960/komputasi.v8i2.2647 ◽

2020 ◽

Vol 8 (2) ◽

Author(s):

yohana Tri Utami ◽

Dewi Asiah Shofiana ◽

Yunda Heningtyas

Keyword(s):

Cross Validation ◽

Primary Reason ◽

Accuracy Rate ◽

Customer Churn ◽

Performance Quality ◽

Classification Technique ◽

C4.5 Algorithm ◽

High Level ◽

Fold Cross Validation

Telecommunication industries are experiencing substantial problems related to the migration of customers due to a large number of competing companies, dynamic circumstances, as well as the presence of many innovative and attractive offerings. The situation has resulted in a high level of customer migration, affecting a decrement toward the company revenue. Regarding that condition, the customer churn is one well-know approach that can help in increasing the company's revenue and reputation. As to predict the reason behind the migration of customer, this study proposed a data mining classification technique by applying the C4.5 algorithm. Patterns generated by the model were implemented using 10-fold cross-validation, resulting in a model with an accuracy rate of 87%, precision 87.5%, and a recall of 97%. Based on the good performance quality of the model, it can be stated that the C4.5 algorithm succeeded to discover several causes from the migration of telecommunication users, in which price holds the top place as the primary reason

Download Full-text

Predictive modeling of religiosity, prosociality, and moralizing in 295,000 individuals from European and non-European populations

Humanities and Social Sciences Communications ◽

10.1057/s41599-020-00691-9 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierre O. Jacquet ◽

Farid Pazhoohi ◽

Charles Findling ◽

Hugo Mell ◽

Coralie Chevallier ◽

...

Keyword(s):

Sexual Behaviors ◽

Large Scale ◽

Cross Validation ◽

Psychological Explanation ◽

Collective Actions ◽

Sample Data ◽

Out Of Sample ◽

Prosocial Attitudes ◽

High Level ◽

Fold Cross Validation

AbstractWhy do moral religions exist? An influential psychological explanation is that religious beliefs in supernatural punishment is cultural group adaptation enhancing prosocial attitudes and thereby large-scale cooperation. An alternative explanation is that religiosity is an individual strategy that results from high level of mistrust and the need for individuals to control others’ behaviors through moralizing. Existing evidence is mixed but most works are limited by sample size and generalizability issues. The present study overcomes these limitations by applying k-fold cross-validation on multivariate modeling of data from >295,000 individuals in 108 countries of the World Values Surveys and the European Value Study. First, this methodology reveals no evidence that European and non-European religious people invest more in collective actions and are more trustful of unrelated conspecifics. Instead, the individuals’ level of religiosity is found to be weakly but positively associated with social mistrust and negatively associated with the production of behaviors, which benefit unrelated members of the large-scale community. Second, our models show that individual variation in religiosity is well explained by the interaction of increased levels of social mistrust and increased needs to moralize other people’s sexual behaviors. Finally, stratified k-fold cross-validation demonstrates that the structures of these association patterns are robust to sampling variability and reliable enough to generalize to out-of-sample data.

Download Full-text

Rapid Identification of COVID-19 Severity in CT Scans through Classification of Deep Features

10.21203/rs.3.rs-30802/v1 ◽

2020 ◽

Author(s):

Zekuan Yu ◽

Xiaohu Li ◽

Haitao Sun ◽

Jian Wang ◽

Tongtong Zhao ◽

...

Keyword(s):

Cross Validation ◽

Rapid Identification ◽

Ct Scans ◽

Accurate Identification ◽

Linear Discriminant ◽

Holdout Validation ◽

Novel Coronavirus ◽

Leave One Out ◽

Fold Cross Validation

Abstract Background: To implement the real-time diagnosis of the severity of patients infected with novel coronavirus 2019 (COVID-19) and guide the follow-up therapeutic treatment, We collected chest CT scans of 202 patients diagnosed with the COVID-19 from three hospitals in Anhui Province, China.Methods: A total of 729 2D axial plan slices with 246 severe cases and 483 non-severe cases were employed in this study. Four pre-trained deep models (Inception-V3, ResNet-50, ResNet-101, DenseNet-201) with multiple classifiers (linear discriminant, linear SVM, cubic SVM, KNN and Adaboost decision tree) were applied to identify the severe and non-severe COVID-19 cases. Three validation strategies (holdout validation, 10-fold cross-validation and leave-one-out) are employed to validate the feasibility of proposed pipelines. Results and conclusion: The experimental results demonstrate that classification of the features from pre-trained deep models show the promising application in COVID-19 screening whereas the DenseNet-201 with cubic SVM model achieved the best performance. Specifically, it achieved the highest severity classification accuracy of 95.20% and 95.34% for 10-fold cross-validation and leave-one-out, respectively. The established pipeline was able to achieve a rapid and accurate identification of the severity of COVID-19. This may assist the physicians to make more efficient and reliable decisions.

Download Full-text

Natural Language Processing in Serious Games: A state of the art.

International Journal of Serious Games ◽

10.17083/ijsg.v2i3.87 ◽

2015 ◽

Vol 2 (3) ◽

Cited By ~ 5

Author(s):

Davide Picca ◽

Dominique Jaccard ◽

Gérald Eberlé

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Serious Games ◽

State Of The Art ◽

Serious Game ◽

The Other ◽

Other Hand ◽

The One ◽

High Level

In the last decades, Natural Language Processing (NLP) has obtained a high level of success. Interactions between NLP and Serious Games have started and some of them already include NLP techniques. The objectives of this paper are twofold: on the one hand, providing a simple framework to enable analysis of potential uses of NLP in Serious Games and, on the other hand, applying the NLP framework to existing Serious Games and giving an overview of the use of NLP in pedagogical Serious Games. In this paper we present 11 serious games exploiting NLP techniques. We present them systematically, according to the following structure: first, we highlight possible uses of NLP techniques in Serious Games, second, we describe the type of NLP implemented in the each specific Serious Game and, third, we provide a link to possible purposes of use for the different actors interacting in the Serious Game.

Download Full-text

DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment

International Journal of Molecular Sciences ◽

10.3390/ijms21165710 ◽

2020 ◽

Vol 21 (16) ◽

pp. 5710

Author(s):

Xiao Wang ◽

Yinping Jin ◽

Qiuwen Zhang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Cross Validation ◽

Protein Localization ◽

Mitochondrial Protein ◽

Intermembrane Space ◽

The Matrix ◽

Localization Predictor ◽

High Level ◽

Fold Cross Validation

Mitochondrial proteins are physiologically active in different compartments, and their abnormal location will trigger the pathogenesis of human mitochondrial pathologies. Correctly identifying submitochondrial locations can provide information for disease pathogenesis and drug design. A mitochondrion has four submitochondrial compartments, the matrix, the outer membrane, the inner membrane, and the intermembrane space, but various existing studies ignored the intermembrane space. The majority of researchers used traditional machine learning methods for predicting mitochondrial protein localization. Those predictors required expert-level knowledge of biology to be encoded as features rather than allowing the underlying predictor to extract features through a data-driven procedure. Besides, few researchers have considered the imbalance in datasets. In this paper, we propose a novel end-to-end predictor employing deep neural networks, DeepPred-SubMito, for protein submitochondrial location prediction. First, we utilize random over-sampling to decrease the influence caused by unbalanced datasets. Next, we train a multi-channel bilayer convolutional neural network for multiple subsequences to learn high-level features. Third, the prediction result is outputted through the fully connected layer. The performance of the predictor is measured by 10-fold cross-validation and 5-fold cross-validation on the SM424-18 dataset and the SubMitoPred dataset, respectively. Experimental results show that the predictor outperforms state-of-the-art predictors. In addition, the prediction of results in the M983 dataset also confirmed its effectiveness in predicting submitochondrial locations.

Download Full-text

QSPR study of supercooled liquid vapour pressures of PBDEs by using molecular distance-edge vector index

Journal of the Serbian Chemical Society ◽

10.2298/jsc140716087j ◽

2015 ◽

Vol 80 (4) ◽

pp. 499-508 ◽

Cited By ~ 1

Author(s):

Long Jiao ◽

Xiaofei Wang ◽

Shan Bing ◽

Zhiwei Xue ◽

Hua Li

Keyword(s):

Cross Validation ◽

Quantitative Relationship ◽

Supercooled Liquid ◽

Structure Property ◽

Ann Model ◽

Linear Network ◽

Prediction Ability ◽

Leave One Out ◽

Fold Cross Validation ◽

Edge Vector

The quantitative structure property relationship (QSPR) for supercooled liquid vapour pressures (PL) of PBDEs was investigated. Molecular distance-edge vector (MDEV) index was used as the structural descriptor. The quantitative relationship between the MDEV index and lgPL was modeled by using multivariate linear regression (MLR) and artificial neural network (ANN) respectively. Leave-one-out cross validation and k-fold cross validation were carried out to assess the prediction ability of the developed models. For the MLR method, the prediction root mean square relative error (RMSRE) of leave-one-out cross validation and k-fold cross validation is 9.95 and 9.05 respectively. For the ANN method, the prediction RMSRE of leave-one-out cross validation and k-fold cross validation is 8.75 and 8.31 respectively. It is demonstrated the established models are practicable for predicting the lgPL of PBDEs. The MDEV index is quantitatively related to the lgPL of PBDEs. MLR and L-ANN are practicable for modeling this relationship. Compared with MLR, ANN shows slightly higher prediction accuracy. Subsequently, an MLR model, which regression equation is lgPL = 0.2868 M11 - 0.8449 M12 - 0.0605, and an ANN model, which is a two inputs linear network, were developed. The two models can be used to predict the lgPL of each PBDE.

Download Full-text

Perbandingan Performa Metode Klasifikasi SVM, Neural Network, dan Naive Bayes untuk Mendeteksi Kualitas Pengajuan Kredit di Koperasi Simpan Pinjam

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2019641352 ◽

2019 ◽

Vol 6 (4) ◽

pp. 444

Author(s):

Iqbal Taufiq Ahmad Nur ◽

Nanang Yudi Setiawan ◽

Fitra Abdurrachman Bachtiar

Keyword(s):

Neural Network ◽

Execution Time ◽

Cross Validation ◽

Naive Bayes ◽

Naïve Bayes ◽

The Other ◽

Credit Quality ◽

The Neural Network ◽

Network Method ◽

Fold Cross Validation

Mendeteksi kualitas kredit sejak dini merupakan satu tahapan penting yang wajib dilakukan oleh koperasi simpan pinjam guna meminimalisir adanya risiko kredit. Dalam penelitian ini, kami menggunakan tiga metode klasifikasi yaitu SVM, Neural Network, dan Naïve Bayes untuk menemukan metode dengan performa yang paling baik dan optimal pada kasus pendeteksian kualitas kredit di koperasi simpan pinjam. Proses yang dilakukan adalah dengan mengimplementasikan data hasil pre processing menggunakan algoritme SVM, Neural Network, dan Naïve Bayes dengan proses evaluasi menggunakan 5-fold cross validation. Hasil yang didapatkan adalah metode Neural Network menjadi metode dengan performa paling baik. Rerata tingkat akurasi yang dihasilkan sebesar 86,81%, rerata precision sebesar 0,8194, rerata recall sebesar 0,8236, dan rerata nilai AUC sebesar 0,9158. Namun, waktu eksekusi yang dihasilkan algoritme Neural Network menjadikan algoritme ini sebagai algoritme paling lambat dibandingkan dengan dua metode lain. Nilai rerata waktu eksekusi dari metode Neural Network sebesar 3,058 detik, jauh lebih lama dibandingkan dua algoritme lain yang hanya berkisar pada nilai 0 – 1 detik. AbstractDetecting credit quality at the early stage is an important step that must be done by koperasi simpan pinjam in order to minimize the credit risk. In this research, we use three classification methods i.e. SVM, Neural Network, and Naïve Bayes to find the best performance and optimal method to be used in credit quality detection for koperasi simpan pinjam. The process conducted by implementing pre-processing data using an SVM, Neural Network, and Naïve Bayes algorithm with the evaluation process using 5-fold cross validation. As the result, The Neural Network method was the best performing method. The average level of accuracy produced was 86.81%, mean precision was 0.8194, average recall was 0.8236, and the average AUC value was 0.9158. However, the execution time generated by the Neural Network algorithm made this algorithm the slowest algorithm compared to the other two methods. The average execution time of the Neural Network method was 3.058 seconds, longer than the other two algorithms which only range from 0 - 1 second.

Download Full-text