Matrix-Structural Learning (MSL) of Cascaded Classifier from Enormous Training Set

STRUCTURAL CONNECTIONIST LEARNING WITH COMPLEMENTARY CODING

International Journal of Neural Systems ◽

10.1142/s0129065792000036 ◽

1992 ◽

Vol 03 (01) ◽

pp. 19-30 ◽

Cited By ~ 10

Author(s):

AKIRA NAMATAME ◽

YOSHIAKI TSUKAMOTO

Keyword(s):

Learning Algorithm ◽

Internal Representation ◽

Threshold Function ◽

Sufficient Condition ◽

Structural Learning ◽

Similarity Matrix ◽

Training Set ◽

Threshold Functions ◽

Connectionist Networks ◽

Hidden Layer

We propose a new learning algorithm, structural learning with the complementary coding for concept learning problems. We introduce the new grouping measure that forms the similarity matrix over the training set and show this similarity matrix provides a sufficient condition for the linear separability of the set. Using the sufficient condition one should figure out a suitable composition of linearly separable threshold functions that classify exactly the set of labeled vectors. In the case of the nonlinear separability, the internal representation of connectionist networks, the number of the hidden units and value-space of these units, is pre-determined before learning based on the structure of the similarity matrix. A three-layer neural network is then constructed where each linearly separable threshold function is computed by a linear-threshold unit whose weights are determined by the one-shot learning algorithm that requires a single presentation of the training set. The structural learning algorithm proceeds to capture the connection weights so as to realize the pre-determined internal representation. The pre-structured internal representation, the activation value spaces at the hidden layer, defines intermediate-concepts. The target-concept is then learned as a combination of those intermediate-concepts. The ability to create the pre-structured internal representation based on the grouping measure distinguishes the structural learning from earlier methods such as backpropagation.

Download Full-text

LEARNING GRAPHS FROM EXAMPLES: AN APPLICATION TO THE PREDICTION OF THE TOXICITY OF CHEMICAL COMPOUNDS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001406005022 ◽

2006 ◽

Vol 20 (06) ◽

pp. 883-896 ◽

Cited By ~ 1

Author(s):

PASQUALE FOGGIA ◽

ALESSANDRO LIMONGIELLO ◽

FRANCESCO TUFANO ◽

MARIO VENTO

Keyword(s):

Pattern Recognition ◽

Chemical Compounds ◽

Structural Learning ◽

Classification Models ◽

Predictive Toxicology ◽

Training Set ◽

Structural Pattern ◽

Discrimination Ability ◽

Structural Pattern Recognition

A common problem encountered in structural pattern recognition is the difficulty of constructing classification models or rules from a set of examples, due to the complexity of the structures needed to represent the patterns. In this paper, we present an extension of a method for structural learning. The goal of the method is to find descriptions which are general (in other words, are successfully applicable to recognize objects different from the ones in the training set), preserving at the same time their discrimination ability. This method has been applied to predictive toxicology evaluation, that is the inference of the cancerogenic characteristics of chemical compounds.

Download Full-text

Mental Health Interpreter training set

PsycEXTRA Dataset ◽

10.1037/e402172008-012 ◽

2004 ◽

Keyword(s):

Mental Health ◽

Training Set ◽

Interpreter Training

Download Full-text

Training Set Size and Response Location Effects on Same/Different Judgments in Humans

PsycEXTRA Dataset ◽

10.1037/e520602012-170 ◽

2011 ◽

Author(s):

Jeffrey S. Katz ◽

John F. Magnotti ◽

Anthony A. Wright

Keyword(s):

Training Set ◽

Response Location ◽

Set Size

Download Full-text

Pathology image-based lung cancer subtyping using deeplearning features and cell-density maps

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.10.ipas-064 ◽

2020 ◽

Vol 2020 (10) ◽

pp. 64-1-64-5

Author(s):

Mustafa I. Jaber ◽

Christopher W. Szeto ◽

Bing Song ◽

Liudmila Beziaeva ◽

Stephen C. Benz ◽

...

Keyword(s):

Lung Cancer ◽

Cell Density ◽

Majority Voting ◽

Training Set ◽

Density Maps ◽

Color Deconvolution ◽

Map Generation ◽

Density Map ◽

Pathology Image ◽

Whole Slide Images

In this paper, we propose a patch-based system to classify non-small cell lung cancer (NSCLC) diagnostic whole slide images (WSIs) into two major histopathological subtypes: adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC). Classifying patients accurately is important for prognosis and therapy decisions. The proposed system was trained and tested on 876 subtyped NSCLC gigapixel-resolution diagnostic WSIs from 805 patients – 664 in the training set and 141 in the test set. The algorithm has modules for: 1) auto-generated tumor/non-tumor masking using a trained residual neural network (ResNet34), 2) cell-density map generation (based on color deconvolution, local drain segmentation, and watershed transformation), 3) patch-level feature extraction using a pre-trained ResNet34, 4) a tower of linear SVMs for different cell ranges, and 5) a majority voting module for aggregating subtype predictions in unseen testing WSIs. The proposed system was trained and tested on several WSI magnifications ranging from x4 to x40 with a best ROC AUC of 0.95 and an accuracy of 0.86 in test samples. This fully-automated histopathology subtyping method outperforms similar published state-of-the-art methods for diagnostic WSIs.

Download Full-text

Iterative Supervised Principal Component Analysis-Driven Ligand Design for Regioselective Ti-Catalyzed Pyrrole Synthesis

10.26434/chemrxiv.12284378 ◽

2020 ◽

Author(s):

Xin Yi See ◽

Benjamin Reiner ◽

Xuelan Wen ◽

T. Alexander Wheeler ◽

Channing Klein ◽

...

Keyword(s):

Principal Component Analysis ◽

De Novo ◽

Principal Component ◽

Component Analysis ◽

Catalyst Design ◽

Data Driven ◽

Initial Reaction ◽

Training Set ◽

Reaction Conditions ◽

Component Loadings

<div> <div> <div> <p>Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H- pyrrole (C) via Ti- catalyzed formal [2+2+1] cycloaddition of phenyl propyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space along with k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (> 90% C) by incorporating 2,6-dimethyl- 4-(pyrrolidin-1-yl)pyridine as a ligand. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development. </p> </div> </div> </div>

Download Full-text

SAMPL6 Challenge Results from pKa Predictions Based on a General Gaussian Process Model

10.26434/chemrxiv.6406505.v2 ◽

2018 ◽

Author(s):

Caitlin C. Bannan ◽

David Mobley ◽

A. Geoff Skillman

Keyword(s):

Gaussian Process ◽

Process Model ◽

Molecular Graph ◽

Gaussian Process Regression ◽

Ionization State ◽

Training Set ◽

Physiochemical Properties ◽

Quantile Plots ◽

Physical And Chemical ◽

Good Agreement

<div>A variety of fields would benefit from accurate pK<sub>a</sub> predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.</div><div>Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK<sub>a</sub>s of 24 drug like small molecules.</div><div>We recently built a general model for predicting pK<sub>a</sub>s using a Gaussian process regression trained using physical and chemical features of each ionizable group.</div><div>Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.</div><div>These features are fed into a Scikit-learn Gaussian process to predict microscopic pK<sub>a</sub>s which are then used to analytically determine macroscopic pK<sub>a</sub>s.</div><div>Our Gaussian process is trained on a set of 2,700 macroscopic pK<sub>a</sub>s from monoprotic and select diprotic molecules.</div><div>Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.</div><div>Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.</div><div>Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. </div><div>Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.</div><div>The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. </div>

Download Full-text

Minimally Empirical Double Hybrid Functionals Trained Against the GMTKN55 Database: revDSD-PBEP86-D4, revDOD-PBE-D4, and DOD-SCAN-D4

10.26434/chemrxiv.7903388.v2 ◽

2019 ◽

Cited By ~ 1

Author(s):

Golokesh Santra ◽

Nitai Sylvetsky ◽

Gershom Martin

Keyword(s):

Substantial Improvement ◽

Viable Alternative ◽

Mean Absolute Deviation ◽

Dispersion Correction ◽

Training Set ◽

Weighted Mean ◽

Absolute Deviation ◽

Hybrid Functionals ◽

Scaling Algorithms

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.

Download Full-text

Minimally Empirical Double Hybrid Functionals Trained Against the GMTKN55 Database: revDSD-PBEP86-D4, revDOD-PBE-D4, and DOD-SCAN-D4

10.26434/chemrxiv.7903388.v1 ◽

2019 ◽

Author(s):

Golokesh Santra ◽

Nitai Sylvetsky ◽

Gershom Martin

Keyword(s):

Substantial Improvement ◽

Viable Alternative ◽

Mean Absolute Deviation ◽

Dispersion Correction ◽

Training Set ◽

Weighted Mean ◽

Absolute Deviation ◽

Hybrid Functionals ◽

Scaling Algorithms

We present a family of minimally empirical double-hybrid DFT functionals parametrized against the very large and diverse GMTKN55 benchmark. The very recently proposed wB97M(2) empirical double hybrid (with 16 empirical parameters) has the lowest WTMAD2 (weighted mean absolute deviation over GMTKN55) ever reported at 2.19 kcal/mol. However, our xrevDSD-PBEP86-D4 functional reaches a statistically equivalent WTMAD2=2.22 kcal/mol, using just a handful of empirical parameters, and the xrevDOD-PBEP86-D4 functional reaches 2.25 kcal/mol with just opposite-spin MP2 correlation, making it amenable to reduced-scaling algorithms. In general, the D4 empirical dispersion correction is clearly superior to D3BJ. If one eschews dispersion corrections of any kind, noDispSD-SCAN offers a viable alternative. Parametrization over the entire GMTKN55 dataset yields substantial improvement over the small training set previously employed in the DSD papers.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text