ChemFLuo: a web-server for structure analysis and identification of fluorescent compounds

Author(s):  
Zi-Yi Yang ◽  
Jie Dong ◽  
Zhi-Jiang Yang ◽  
Mingzhu Yin ◽  
Hong-Li Jiang ◽  
...  

Abstract Background Fluorescent detection methods are indispensable tools for chemical biology. However, the frequent appearance of potential fluorescent compound has greatly interfered with the recognition of compounds with genuine activity. Such fluorescence interference is especially difficult to identify as it is reproducible and possesses concentration-dependent characteristic. Therefore, the development of a credible screening tool to detect fluorescent compounds from chemical libraries is urgently needed in early stages of drug discovery. Results In this study, we developed a webserver ChemFLuo for fluorescent compound detection, based on two large and high-quality training datasets containing 4906 blue and 8632 green fluorescent compounds. These molecules were used to construct a group of prediction models based on the combination of three machine learning algorithms and seven types of molecular representations. The best blue fluorescence prediction model achieved with balanced accuracy (BA) = 0.858 and area under the receiver operating characteristic curve (AUC) = 0.931 for the validation set, and BA = 0.823 and AUC = 0.903 for the test set. The best green fluorescence prediction model achieved the prediction accuracy with BA = 0.810 and AUC = 0.887 for the validation set, and BA = 0.771 and AUC = 0.852 for the test set. Besides prediction model, 22 blue and 16 green representative fluorescent substructures were summarized for the screening of potential fluorescent compounds. The comparison with other fluorescence detection tools and theapplication to external validation sets and large molecule libraries have demonstrated the reliability of prediction model for fluorescent compound detection. Conclusion ChemFLuo is a public webserver to filter out compounds with undesirable fluorescent properties, which will benefit the design of high-quality chemical libraries for drug discovery. It is freely available at http://admet.scbdd.com/chemfluo/index/.

Author(s):  
Ade Nurhopipah ◽  
Uswatun Hasanah

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.


Cancers ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 913
Author(s):  
Johannes Fahrmann ◽  
Ehsan Irajizad ◽  
Makoto Kobayashi ◽  
Jody Vykoukal ◽  
Jennifer Dennison ◽  
...  

MYC is an oncogenic driver in the pathogenesis of ovarian cancer. We previously demonstrated that MYC regulates polyamine metabolism in triple-negative breast cancer (TNBC) and that a plasma polyamine signature is associated with TNBC development and progression. We hypothesized that a similar plasma polyamine signature may associate with ovarian cancer (OvCa) development. Using mass spectrometry, four polyamines were quantified in plasma from 116 OvCa cases and 143 controls (71 healthy controls + 72 subjects with benign pelvic masses) (Test Set). Findings were validated in an independent plasma set from 61 early-stage OvCa cases and 71 healthy controls (Validation Set). Complementarity of polyamines with CA125 was also evaluated. Receiver operating characteristic area under the curve (AUC) of individual polyamines for distinguishing cases from healthy controls ranged from 0.74–0.88. A polyamine signature consisting of diacetylspermine + N-(3-acetamidopropyl)pyrrolidin-2-one in combination with CA125 developed in the Test Set yielded improvement in sensitivity at >99% specificity relative to CA125 alone (73.7% vs 62.2%; McNemar exact test 2-sided P: 0.019) in the validation set and captured 30.4% of cases that were missed with CA125 alone. Our findings reveal a MYC-driven plasma polyamine signature associated with OvCa that complemented CA125 in detecting early-stage ovarian cancer.


2021 ◽  
Vol 22 (12) ◽  
pp. 6598
Author(s):  
Cheng Wang ◽  
Jun Zhang ◽  
Peng Chen ◽  
Bing Wang

Backgroud: The prediction of drug–target interactions (DTIs) is of great significance in drug development. It is time-consuming and expensive in traditional experimental methods. Machine learning can reduce the cost of prediction and is limited by the characteristics of imbalanced datasets and problems of essential feature selection. Methods: The prediction method based on the Ensemble model of Multiple Feature Pairs (Ensemble-MFP) is introduced. Firstly, three negative sets are generated according to the Euclidean distance of three feature pairs. Then, the negative samples of the validation set/test set are randomly selected from the union set of the three negative sets in the validation set/test set. At the same time, the ensemble model with weight is optimized and applied to the test set. Results: The area under the receiver operating characteristic curve (area under ROC, AUC) in three out of four sub-datasets in gold standard datasets was more than 94.0% in the prediction of new drugs. The effectiveness of the proposed method is also shown with the comparison of state-of-the-art methods and demonstration of predicted drug–target pairs. Conclusion: The Ensemble-MFP can weigh the existing feature pairs and has a good prediction effect for general prediction on new drugs.


Processes ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 1241
Author(s):  
Véronique Gomes ◽  
Marco S. Reis ◽  
Francisco Rovira-Más ◽  
Ana Mendes-Ferreira ◽  
Pedro Melo-Pinto

The high quality of Port wine is the result of a sequence of winemaking operations, such as harvesting, maceration, fermentation, extraction and aging. These stages require proper monitoring and control, in order to consistently achieve the desired wine properties. The present work focuses on the harvesting stage, where the sugar content of grapes plays a key role as one of the critical maturity parameters. Our approach makes use of hyperspectral imaging technology to rapidly extract information from wine grape berries; the collected spectra are fed to machine learning algorithms that produce estimates of the sugar level. A consistent predictive capability is important for establishing the harvest date, as well as to select the best grapes to produce specific high-quality wines. We compared four different machine learning methods (including deep learning), assessing their generalization capacity for different vintages and varieties not included in the training process. Ridge regression, partial least squares, neural networks and convolutional neural networks were the methods considered to conduct this comparison. The results show that the estimated models can successfully predict the sugar content from hyperspectral data, with the convolutional neural network outperforming the other methods.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Joelle Ngo Hanna ◽  
Boris D. Bekono ◽  
Luc C. O. Owono ◽  
Flavien A. A. Toze ◽  
James A. Mbah ◽  
...  

Abstract In the quest to know why natural products (NPs) have often been considered as privileged scaffolds for drug discovery purposes, many investigations into the differences between NPs and synthetic compounds have been carried out. Several attempts to answer this question have led to the investigation of the atomic composition, scaffolds and functional groups (FGs) of NPs, in comparison with synthetic drugs analysis. This chapter briefly describes an atomic enumeration method for chemical libraries that has been applied for the analysis of NP libraries, followed by a description of the main differences between NPs of marine and terrestrial origin in terms of their general physicochemical properties, most common scaffolds and “drug-likeness” properties. The last parts of the work describe an analysis of scaffolds and FGs common in NP libraries, focusing on huge NP databases, e.g. those in the Dictionary of Natural Products (DNP), NPs from cyanobacteria and the largest chemical class of NP – terpenoids.


2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A62-A62
Author(s):  
Dattatreya Mellacheruvu ◽  
Rachel Pyke ◽  
Charles Abbott ◽  
Nick Phillips ◽  
Sejal Desai ◽  
...  

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.


2021 ◽  
Author(s):  
Xinshi Huang ◽  
Xiaobing Wang ◽  
Dinglai Yu

Abstract Objective To establish and validate a nomogram for individualized prediction of renal involvement in pSS patients. Methods A total of 1293 patients with pSS from the First Affiliated Hospital of Wenzhou Medical University between January 2008 to January 2020 were recruited and further analyzed retrospectively. The patients were randomly divided into a development set (70%, n = 910) and a validation set (30%, n = 383). The univariable and multivariate logistic regression were performed to analyze the risk factors of renal involvement in pSS. Based on the regression β coefficients derived from multivariate logistic analysis, an individualized nomogram prediction model was developed. The prediction model of discrimination and calibration was evaluated with the area under the receiver operating characteristic curves and Calibration plot. Results Multivariate logistic analysis showed that hypertension, anemia, albumin, uric acid, anti-Ro52, hematuria and Chisholm-Mason grade were independent risk factors of renal involvement in pSS. The area under the receiver operating characteristic curves were 0.797 and 0.750 respectively in development set and validation set, indicating the nomogram had a good discrimination capacity. The Calibration plot showed nomogram had a strong concordance performance between the prediction probability and the actual probability. Conclusion The individualized nomogram for pSS patients those who had renal involvement could be used by clinicians to predict the risk of pSS patients developing into renal involvement and improve early screening and intervention.


BMJ Open ◽  
2021 ◽  
Vol 11 (12) ◽  
pp. e050146
Author(s):  
Jenna M Reps ◽  
Patrick Ryan ◽  
P R Rijnbeek

ObjectiveThe internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000).DesignRetrospective cohort.SettingPrimary and secondary care; three US claims databases.Participants1 200 769 patients pharmaceutically treated for their first occurrence of depression.MethodsWe investigated the impact of the development/validation design across 21 real-world prediction questions. Model discrimination and calibration were assessed. We trained LASSO logistic regression models using US claims data and internally validated the models using eight different designs: ‘no test/validation set’, ‘test/validation set’ and cross validation with 3-fold, 5-fold or 10-fold with and without a test set. We then externally validated each model in two new US claims databases. We estimated the internal validation bias per design by empirically comparing the differences between the estimated internal performance and external performance.ResultsThe differences between the models’ internal estimated performances and external performances were largest for the ‘no test/validation set’ design. This indicates even with large data the ‘no test/validation set’ design causes models to overfit. The seven alternative designs included some validation process to select the hyperparameters and a fair testing process to estimate internal performance. These designs had similar internal performance estimates and performed similarly when externally validated in the two external databases.ConclusionsEven with big data, it is important to use some validation process to select the optimal hyperparameters and fairly assess internal validation using a test set or cross-validation.


Healthcare ◽  
2021 ◽  
Vol 9 (10) ◽  
pp. 1334
Author(s):  
Hasan Symum ◽  
José Zayas-Castro

The timing of 30-day pediatric readmissions is skewed with approximately 40% of the incidents occurring within the first week of hospital discharges. The skewed readmission time distribution coupled with delay in health information exchange among healthcare providers might offer a limited time to devise a comprehensive intervention plan. However, pediatric readmission studies are thus far limited to the development of the prediction model after hospital discharges. In this study, we proposed a novel pediatric readmission prediction model at the time of hospital admission which can improve the high-risk patient selection process. We also compared proposed models with the standard at-discharge readmission prediction model. Using the Hospital Cost and Utilization Project database, this prognostic study included pediatric hospital discharges in Florida from January 2016 through September 2017. Four machine learning algorithms—logistic regression with backward stepwise selection, decision tree, Support Vector machines (SVM) with the polynomial kernel, and Gradient Boosting—were developed for at-admission and at-discharge models using a recursive feature elimination technique with a repeated cross-validation process. The performance of the at-admission and at-discharge model was measured by the area under the curve. The performance of the at-admission model was comparable with the at-discharge model for all four algorithms. SVM with Polynomial Kernel algorithms outperformed all other algorithms for at-admission and at-discharge models. Important features associated with increased readmission risk varied widely across the type of prediction model and were mostly related to patients’ demographics, social determinates, clinical factors, and hospital characteristics. Proposed at-admission readmission risk decision support model could help hospitals and providers with additional time for intervention planning, particularly for those targeting social determinants of children’s overall health.


Sign in / Sign up

Export Citation Format

Share Document