scholarly journals rSeqTU – a machine-learning based R package for prediction of bacterial transcription units

2019 ◽  
Author(s):  
Sheng-Yong Niu ◽  
Binqiang Liu ◽  
Qin Ma ◽  
Wen-Chi Chou

AbstractA transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, e.g., gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine-learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random-forest-based feature selection, TU prediction, and TU visualization.

2020 ◽  
Vol 10 (24) ◽  
pp. 9151
Author(s):  
Yun-Chia Liang ◽  
Yona Maimury ◽  
Angela Hsiang-Ling Chen ◽  
Josue Rodolfo Cuevas Juarez

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lei Li ◽  
Desheng Wu

PurposeThe infraction of securities regulations (ISRs) of listed firms in their day-to-day operations and management has become one of common problems. This paper proposed several machine learning approaches to forecast the risk at infractions of listed corporates to solve financial problems that are not effective and precise in supervision.Design/methodology/approachThe overall proposed research framework designed for forecasting the infractions (ISRs) include data collection and cleaning, feature engineering, data split, prediction approach application and model performance evaluation. We select Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machines, Artificial Neural Network and Long Short-Term Memory Networks (LSTMs) as ISRs prediction models.FindingsThe research results show that prediction performance of proposed models with the prior infractions provides a significant improvement of the ISRs than those without prior, especially for large sample set. The results also indicate when judging whether a company has infractions, we should pay attention to novel artificial intelligence methods, previous infractions of the company, and large data sets.Originality/valueThe findings could be utilized to address the problems of identifying listed corporates' ISRs at hand to a certain degree. Overall, results elucidate the value of the prior infraction of securities regulations (ISRs). This shows the importance of including more data sources when constructing distress models and not only focus on building increasingly more complex models on the same data. This is also beneficial to the regulatory authorities.


2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Kerry E. Poppenberg ◽  
Vincent M. Tutino ◽  
Lu Li ◽  
Muhammad Waqas ◽  
Armond June ◽  
...  

Abstract Background Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.


2020 ◽  
Vol 7 (2) ◽  
pp. 631-647
Author(s):  
Emrana Kabir Hashi ◽  
Md. Shahid Uz Zaman

Machine learning techniques are widely used in healthcare sectors to predict fatal diseases. The objective of this research was to develop and compare the performance of the traditional system with the proposed system that predicts the heart disease implementing the Logistic regression, K-nearest neighbor, Support vector machine, Decision tree, and Random Forest classification models. The proposed system helped to tune the hyperparameters using the grid search approach to the five mentioned classification algorithms. The performance of the heart disease prediction system is the major research issue. With the hyperparameter tuning model, it can be used to enhance the performance of the prediction models. The achievement of the traditional and proposed system was evaluated and compared in terms of accuracy, precision, recall, and F1 score. As the traditional system achieved accuracies between 81.97% and 90.16%., the proposed hyperparameter tuning model achieved accuracies in the range increased between 85.25% and 91.80%. These evaluations demonstrated that the proposed prediction approach is capable of achieving more accurate results compared with the traditional approach in predicting heart disease with the acquisition of feasible performance.


Life ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 866
Author(s):  
Sony Hartono Wijaya ◽  
Farit Mochamad Afendi ◽  
Irmanida Batubara ◽  
Ming Huang ◽  
Naoaki Ono ◽  
...  

Background: We performed in silico prediction of the interactions between compounds of Jamu herbs and human proteins by utilizing data-intensive science and machine learning methods. Verifying the proteins that are targeted by compounds of natural herbs will be helpful to select natural herb-based drug candidates. Methods: Initially, data related to compounds, target proteins, and interactions between them were collected from open access databases. Compounds are represented by molecular fingerprints, whereas amino acid sequences are represented by numerical protein descriptors. Then, prediction models that predict the interactions between compounds and target proteins were constructed using support vector machine and random forest. Results: A random forest model constructed based on MACCS fingerprint and amino acid composition obtained the highest accuracy. We used the best model to predict target proteins for 94 important Jamu compounds and assessed the results by supporting evidence from published literature and other sources. There are 27 compounds that can be validated by professional doctors, and those compounds belong to seven efficacy groups. Conclusion: By comparing the efficacy of predicted compounds and the relations of the targeted proteins with diseases, we found that some compounds might be considered as drug candidates.


Sensors ◽  
2021 ◽  
Vol 21 (16) ◽  
pp. 5595
Author(s):  
Tim Englert ◽  
Florian Gruber ◽  
Jan Stiedl ◽  
Simon Green ◽  
Timo Jacob ◽  
...  

To correctly assess the cleanliness of technical surfaces in a production process, corresponding online monitoring systems must provide sufficient data. A promising method for fast, large-area, and non-contact monitoring is hyperspectral imaging (HSI), which was used in this paper for the detection and quantification of organic surface contaminations. Depending on the cleaning parameter constellation, different levels of organic residues remained on the surface. Afterwards, the cleanliness was determined by the carbon content in the atom percent on the sample surfaces, characterized by XPS and AES. The HSI data and the XPS measurements were correlated, using machine learning methods, to generate a predictive model for the carbon content of the surface. The regression algorithms elastic net, random forest regression, and support vector machine regression were used. Overall, the developed method was able to quantify organic contaminations on technical surfaces. The best regression model found was a random forest model, which achieved an R2 of 0.7 and an RMSE of 7.65 At.-% C. Due to the easy-to-use measurement and the fast evaluation by machine learning, the method seems suitable for an online monitoring system. However, the results also show that further experiments are necessary to improve the quality of the prediction models.


Author(s):  
Valentina Bellini ◽  
Marina Valente ◽  
Giorgia Bertorelli ◽  
Barbara Pifferi ◽  
Michelangelo Craca ◽  
...  

Abstract Background Risk stratification plays a central role in anesthetic evaluation. The use of Big Data and machine learning (ML) offers considerable advantages for collection and evaluation of large amounts of complex health-care data. We conducted a systematic review to understand the role of ML in the development of predictive post-surgical outcome models and risk stratification. Methods Following the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines, we selected the period of the research for studies from 1 January 2015 up to 30 March 2021. A systematic search in Scopus, CINAHL, the Cochrane Library, PubMed, and MeSH databases was performed; the strings of research included different combinations of keywords: “risk prediction,” “surgery,” “machine learning,” “intensive care unit (ICU),” and “anesthesia” “perioperative.” We identified 36 eligible studies. This study evaluates the quality of reporting of prediction models using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) checklist. Results The most considered outcomes were mortality risk, systemic complications (pulmonary, cardiovascular, acute kidney injury (AKI), etc.), ICU admission, anesthesiologic risk and prolonged length of hospital stay. Not all the study completely followed the TRIPOD checklist, but the quality was overall acceptable with 75% of studies (Rev #2, comm #minor issue) showing an adherence rate to TRIPOD more than 60%. The most frequently used algorithms were gradient boosting (n = 13), random forest (n = 10), logistic regression (LR; n = 7), artificial neural networks (ANNs; n = 6), and support vector machines (SVM; n = 6). Models with best performance were random forest and gradient boosting, with AUC > 0.90. Conclusions The application of ML in medicine appears to have a great potential. From our analysis, depending on the input features considered and on the specific prediction task, ML algorithms seem effective in outcomes prediction more accurately than validated prognostic scores and traditional statistics. Thus, our review encourages the healthcare domain and artificial intelligence (AI) developers to adopt an interdisciplinary and systemic approach to evaluate the overall impact of AI on perioperative risk assessment and on further health care settings as well.


2020 ◽  
Author(s):  
Kerry E Poppenberg ◽  
Vincent M Tutino ◽  
Lu Li ◽  
Muhammad Waqas ◽  
Armond June ◽  
...  

Abstract Background: Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods. Methods: Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n=94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n=40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction. Results: Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC)=0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance. Conclusions: We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.


Author(s):  
D. O. Oyewola ◽  
Emmanuel Gbenga Gbenga Dada ◽  
Omole Ezekiel Olaoluwa ◽  
K.A. Al-Mustapha

Models of stock price prediction have customarily utilized technical indicators alone to produce trading signals. In this paper, we construct trading techniques by applying machine-learning methods to technical analysis indicators and stock market returns data. The resulting prediction models can be utilized as an artificial trader used to trade on any given stock trade. Here the issue of stock trading decision prediction is enunciated as a classification problem with two class values representing the buy and sell signals. The stacking technique utilized in this paper is to assist trader with applying the proposed algorithms in their trading using random forest which was staked with different algorithms which incorporates Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM) and Neural Network (NN). The experimental results indicated that Top Layer of Random Forest (TRF) produced the best performance among all the algorithms compared. This is an indication that it is a promising strategy for forecasting Nigerian stock returns.


With the advent of technology medical science is growing very fast . Disease diagonosis using machine learning technique is quite cumbersome. But efforts are made by the innovative minds to develope an optimal and efficient prediction model for the prediction of the disease viz, bone disease . Bone disease prediction is also a broad area of research where machine learning techniques can be used. Better prediction and analysis can be helpful to cure such disease . Osteoporosis is an osteo- metabolic disease characterized by low bone mineral density (BMD) and deterioration of the micro- architecture of the bone tissues, causing an increase in bone fragility and consequently leading to an increased risk of fractures.. Machine learning algorithms play important role for predicting and analyzing such disease using available algorithms and by modifying them . There are many algorithms such as SVM (Support vector machine), Genetic Algorithm, Naive bayes classifier and other tree based classifiers which are proposed in traditional scenario for prediction of expected data from an image. The goal of this paper is to discuss about a proposed hybrid analysis and prediction approach for diagonosing osteoporosis. This paper also discuss about various prediction models viz, svm prediction model as the efficient one while processing the textual data for symptoms analysis. Here symptoms are called predictors . A comparison using the Accuracy and training time is performed. The approach shows the efficiency of proposed model over textual data as well as graphical image data analysis.


Sign in / Sign up

Export Citation Format

Share Document