scholarly journals iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC

2020 ◽  
Author(s):  
Yongxian Fan ◽  
Wanru Wang ◽  
Qingqi Zhu

AbstractTerminator is a DNA sequence that give the RNA polymerase the transcriptional termination signal. Identifying terminators correctly can optimize the genome annotation, more importantly, it has considerable application value in disease diagnosis and therapies. However, accurate prediction methods are deficient and in urgent need. Therefore, we proposed a prediction method “iterb-PPse” for terminators by incorporating 47 nucleotide properties into PseKNC- I and PseKNC- II and utilizing Extreme Gradient Boosting to predict terminators based on Escherichia coli and Bacillus subtilis. Combing with the preceding methods, we employed three new feature extraction methods K-pwm, Base-content, Nucleotidepro to formulate raw samples. The two-step method was applied to select features. When identifying terminators based on optimized features, we compared five single models as well as 16 ensemble models. As a result, the accuracy of our method on benchmark dataset achieved 99.88%, higher than the existing state-of-the-art predictor iTerm-PseKNC in 100 times five-fold cross-validation test. It’s prediction accuracy for two independent datasets reached 94.24% and 99.45% respectively. For the convenience of users, a software was developed with the same name on the basis of “iterb-PPse”. The open software and source code of “iterb-PPse” are available at https://github.com/Sarahyouzi/iterb-PPse.

2019 ◽  
Vol 2019 ◽  
pp. 1-13 ◽  
Author(s):  
Yunxin Xie ◽  
Chenyang Zhu ◽  
Yue Lu ◽  
Zhengwei Zhu

Lithology identification is an indispensable part in geological research and petroleum engineering study. In recent years, several mathematical approaches have been used to improve the accuracy of lithology classification. Based on our earlier work that assessed machine learning models on formation lithology classification, we optimize the boosting approaches to improve the classification ability of our boosting models with the data collected from the Daniudi gas field and Hangjinqi gas field. Three boosting models, namely, AdaBoost, Gradient Tree Boosting, and eXtreme Gradient Boosting, are evaluated with 5-fold cross validation. Regularization is applied to the Gradient Tree Boosting and eXtreme Gradient Boosting to avoid overfitting. After adapting the hyperparameter tuning approach on each boosting model to optimize the parameter set, we use stacking to combine the three optimized models to improve the classification accuracy. Results suggest that the optimized stacked boosting model has better performance concerning the evaluation matrix such as precision, recall, and f1 score compared with the single optimized boosting model. Confusion matrix also shows that the stacked model has better performance in distinguishing sandstone classes.


2021 ◽  
Vol 13 (11) ◽  
pp. 2096
Author(s):  
Zhongqi Yu ◽  
Yuanhao Qu ◽  
Yunxin Wang ◽  
Jinghui Ma ◽  
Yu Cao

A visibility forecast model called a boosting-based fusion model (BFM) was established in this study. The model uses a fusion machine learning model based on multisource data, including air pollutants, meteorological observations, moderate resolution imaging spectroradiometer (MODIS) aerosol optical depth (AOD) data, and an operational regional atmospheric environmental modeling System for eastern China (RAEMS) outputs. Extreme gradient boosting (XGBoost), a light gradient boosting machine (LightGBM), and a numerical prediction method, i.e., RAEMS were fused to establish this prediction model. Three sets of prediction models, that is, BFM, LightGBM based on multisource data (LGBM), and RAEMS, were used to conduct visibility prediction tasks. The training set was from 1 January 2015 to 31 December 2018 and used several data pre-processing methods, including a synthetic minority over-sampling technique (SMOTE) data resampling, a loss function adjustment, and a 10-fold cross verification. Moreover, apart from the basic features (variables), more spatial and temporal gradient features were considered. The testing set was from 1 January to 31 December 2019 and was adopted to validate the feasibility of the BFM, LGBM, and RAEMS. Statistical indicators confirmed that the machine learning methods improved the RAEMS forecast significantly and consistently. The root mean square error and correlation coefficient of BFM for the next 24/48 h were 5.01/5.47 km and 0.80/0.77, respectively, which were much higher than those of RAEMS. The statistics and binary score analysis for different areas in Shanghai also proved the reliability and accuracy of using BFM, particularly in low-visibility forecasting. Overall, BFM is a suitable tool for predicting the visibility. It provides a more accurate visibility forecast for the next 24 and 48 h in Shanghai than LGBM and RAEMS. The results of this study provide support for real-time operational visibility forecasts.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Sumayh S. Aljameel ◽  
Irfan Ullah Khan ◽  
Nida Aslam ◽  
Malak Aljabri ◽  
Eman S. Alsulmi

The novel coronavirus (COVID-19) outbreak produced devastating effects on the global economy and the health of entire communities. Although the COVID-19 survival rate is high, the number of severe cases that result in death is increasing daily. A timely prediction of at-risk patients of COVID-19 with precautionary measures is expected to increase the survival rate of patients and reduce the fatality rate. This research provides a prediction method for the early identification of COVID-19 patient’s outcome based on patients’ characteristics monitored at home, while in quarantine. The study was performed using 287 COVID-19 samples of patients from the King Fahad University Hospital, Saudi Arabia. The data were analyzed using three classification algorithms, namely, logistic regression (LR), random forest (RF), and extreme gradient boosting (XGB). Initially, the data were preprocessed using several preprocessing techniques. Furthermore, 10-k cross-validation was applied for data partitioning and SMOTE for alleviating the data imbalance. Experiments were performed using twenty clinical features, identified as significant for predicting the survival versus the deceased COVID-19 patients. The results showed that RF outperformed the other classifiers with an accuracy of 0.95 and area under curve (AUC) of 0.99. The proposed model can assist the decision-making and health care professional by early identification of at-risk COVID-19 patients effectively.


2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Dan Zhang ◽  
Hua-Dong Chen ◽  
Hasan Zulfiqar ◽  
Shi-Shi Yuan ◽  
Qin-Lai Huang ◽  
...  

Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.


Author(s):  
William Stive Fajardo-Moreno ◽  
Rubén Dario Acosta Velásquez ◽  
Ivan Dario Castaño Pérez ◽  
Leonardo Espinosa-Leal

In this chapter, the results concerning the modeling of companies' disappearance from Bogota's market using machine learning methods are presented. The authors use the available information from Bogota's Chamber of Commerce, where the companies are registered yearly. The dataset comprises the years 2017 to 2020 with almost 3 million registries. In this work, a deep analysis of the different features of the data is presented and explained. Next, four state-of-the-art machine learning models are trained for comparison: logistic regression (LR), extreme learning machine (ELM), random forest (RF), and extreme gradient boosting (XGBoost), all with five-fold cross-validation and 50 steps in the randomized grid search. All methods showed excellent performance, with an average of 0.895 in the area under the curve (AUC), being the latter algorithm the best overall (0.97). These results are in agreement with the state-of-the-art values in the field and will be of paramount importance to assess companies' stability for Bogota's local economy.


2021 ◽  
Vol 1995 (1) ◽  
pp. 012017
Author(s):  
Yongchang Lao ◽  
Fangzhong Qi ◽  
Jiakai Zhou ◽  
Xiaobao Fang

PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0260612
Author(s):  
Jong-Hwan Jang ◽  
Tae Young Kim ◽  
Hong-Seok Lim ◽  
Dukyong Yoon

Most existing electrocardiogram (ECG) feature extraction methods rely on rule-based approaches. It is difficult to manually define all ECG features. We propose an unsupervised feature learning method using a convolutional variational autoencoder (CVAE) that can extract ECG features with unlabeled data. We used 596,000 ECG samples from 1,278 patients archived in biosignal databases from intensive care units to train the CVAE. Three external datasets were used for feature validation using two approaches. First, we explored the features without an additional training process. Clustering, latent space exploration, and anomaly detection were conducted. We confirmed that CVAE features reflected the various types of ECG rhythms. Second, we applied CVAE features to new tasks as input data and CVAE weights to weight initialization for different models for transfer learning for the classification of 12 types of arrhythmias. The f1-score for arrhythmia classification with extreme gradient boosting was 0.86 using CVAE features only. The f1-score of the model in which weights were initialized with the CVAE encoder was 5% better than that obtained with random initialization. Unsupervised feature learning with CVAE can extract the characteristics of various types of ECGs and can be an alternative to the feature extraction method for ECGs.


2020 ◽  
Author(s):  
Ibrahim Karabayir ◽  
Suguna Pappu ◽  
Samuel Goldman ◽  
Oguz Akbilgic

Abstract Background : Parkinson’s Disease (PD) is a clinically diagnosed neurodegenerative disorder that affects both motor and non-motor neural circuits. Speech deterioration (hypokinetic dysarthria) is a common symptom, which often presents early in the disease course. Machine learning can help movement disorders specialists improve their diagnostic accuracy using non-invasive and inexpensive voice recordings. Method : We used “Parkinson Dataset with Replicated Acoustic Features Data Set” from the UCI-Machine Learning repository. The dataset included 45 features including sex and 44 speech test based acoustic features from 40 patients with Parkinson’s disease and 40 controls. We analyzed the data using various machine learning algorithms including tree-based ensemble approaches such as random forest and extreme gradient boosting. We also implemented a variable importance analysis to identify important variables classifying patients with PD. Results : The cohort included total of 80 subjects; 40 patients with PD (55% men) and 40 controls (67.5% men). PD patients showed at least two of the three symptoms; resting tremor, bradykinesia, or rigidity. All patients were over 50 years old and the mean age for PD subjects and controls were 69.6 (SD 7.8) and 66.4 (SD 8.4), respectively. Our final model provided an AUC of 0.940 with 95% confidence interval 0.935-0.945in 4-folds cross validation using only six acoustic features including Delta3 (Run2), Delta0 (Run 3), MFCC4 (Run 2), Delta10 (Run 2/Run 3), MFCC10 (Run 2) and Jitter_Rap (Run 1/Run 2). Conclusions : Machine learning can accurately detect Parkinson’s disease using an inexpensive and non-invasive voice recording. Such technologies can be deployed into smartphones for screening of large patient populations for Parkinson’s disease.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4383 ◽  
Author(s):  
Alqahtani ◽  
Gumaei ◽  
Mathkour ◽  
Maher Ben Ismail

An Intrusion detection system is an essential security tool for protecting services and infrastructures of wireless sensor networks from unseen and unpredictable attacks. Few works of machine learning have been proposed for intrusion detection in wireless sensor networks and that have achieved reasonable results. However, these works still need to be more accurate and efficient against imbalanced data problems in network traffic. In this paper, we proposed a new model to detect intrusion attacks based on a genetic algorithm and an extreme gradient boosting (XGBoot) classifier, called GXGBoost model. The latter is a gradient boosting model designed for improving the performance of traditional models to detect minority classes of attacks in the highly imbalanced data traffic of wireless sensor networks. A set of experiments were conducted on wireless sensor network-detection system (WSN-DS) dataset using holdout and 10 fold cross validation techniques. The results of 10 fold cross validation tests revealed that the proposed approach outperformed the state-of-the-art approaches and other ensemble learning classifiers with high detection rates of 98.2%, 92.9%, 98.9%, and 99.5% for flooding, scheduling, grayhole, and blackhole attacks, respectively, in addition to 99.9% for normal traffic.


Sign in / Sign up

Export Citation Format

Share Document