Beyond the hubble sequence – exploring galaxy morphology with unsupervised machine learning

Ting-Yun Cheng; Marc Huertas-Company; Christopher J Conselice; Alfonso Aragón-Salamanca; Brant E Robertson; Nesar Ramachandra

doi:10.1093/mnras/stab734

Beyond the hubble sequence – exploring galaxy morphology with unsupervised machine learning

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stab734 ◽

2021 ◽

Vol 503 (3) ◽

pp. 4446-4465

Author(s):

Ting-Yun Cheng ◽

Marc Huertas-Company ◽

Christopher J Conselice ◽

Alfonso Aragón-Salamanca ◽

Brant E Robertson ◽

...

Keyword(s):

Machine Learning ◽

Physical Properties ◽

Binary Classification ◽

Imbalanced Data ◽

Classification Systems ◽

Data Set ◽

Unsupervised Machine Learning ◽

Hubble Sequence ◽

Visually Based ◽

Galaxy Morphology

ABSTRACT We explore unsupervised machine learning for galaxy morphology analyses using a combination of feature extraction with a vector-quantized variational autoencoder (VQ-VAE) and hierarchical clustering (HC). We propose a new methodology that includes: (1) consideration of the clustering performance simultaneously when learning features from images; (2) allowing for various distance thresholds within the HC algorithm; (3) using the galaxy orientation to determine the number of clusters. This set-up provides 27 clusters created with this unsupervised learning that we show are well separated based on galaxy shape and structure (e.g. Sérsic index, concentration, asymmetry, Gini coefficient). These resulting clusters also correlate well with physical properties such as the colour–magnitude diagram, and span the range of scaling relations such as mass versus size amongst the different machine-defined clusters. When we merge these multiple clusters into two large preliminary clusters to provide a binary classification, an accuracy of $\sim 87{{\ \rm per\ cent}}$ is reached using an imbalanced data set, matching real galaxy distributions, which includes 22.7 per cent early-type galaxies and 77.3 per cent late-type galaxies. Comparing the given clusters with classic Hubble types (ellipticals, lenticulars, early spirals, late spirals, and irregulars), we show that there is an intrinsic vagueness in visual classification systems, in particular galaxies with transitional features such as lenticulars and early spirals. Based on this, the main result in this work is not how well our unsupervised method matches visual classifications and physical properties, but that the method provides an independent classification that may be more physically meaningful than any visually based ones.

Download Full-text

Comparison of Sampling Methods Using Machine Learning and Deep Learning Algorithms with an Imbalanced Data Set for the Prevention of Violence Against Physicians

10.1109/uyms54260.2021.9659758 ◽

2021 ◽

Author(s):

Hilal Cakir ◽

Nilgun Incereis ◽

Bekir Tevfik Akgun ◽

Av. S. Yazgulu Tastemir

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Sampling Methods ◽

Learning Algorithms ◽

Imbalanced Data ◽

Data Set ◽

Prevention Of Violence

Download Full-text

Analysis of Cancer Data Set with Statistical and Unsupervised Machine Learning Methods

Smart Intelligent Computing and Applications - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-13-1921-1_27 ◽

2018 ◽

pp. 267-276

Author(s):

T. Panduranga Vital ◽

K. Dileep Kumar ◽

H. V. Bhagya Sri ◽

M. Murali Krishna

Keyword(s):

Machine Learning ◽

Learning Methods ◽

Data Set ◽

Unsupervised Machine Learning ◽

Cancer Data ◽

Machine Learning Methods

Download Full-text

Optical Mapping-Validated Machine Learning Improves Atrial Fibrillation Driver Detection by Multi-Electrode Mapping

Circulation Arrhythmia and Electrophysiology ◽

10.1161/circep.119.008249 ◽

2020 ◽

Vol 13 (10) ◽

Cited By ~ 1

Author(s):

Alexander M. Zolotarev ◽

Brian J. Hansen ◽

Ekaterina A. Ivanova ◽

Katelynn M. Helfrich ◽

Ning Li ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Near Infrared ◽

Fourier Spectrum ◽

Optical Mapping ◽

Binary Classification ◽

Future Application ◽

Data Set ◽

Frequency Spectra

Background: Atrial fibrillation (AF) can be maintained by localized intramural reentrant drivers. However, AF driver detection by clinical surface-only multielectrode mapping (MEM) has relied on subjective interpretation of activation maps. We hypothesized that application of machine learning to electrogram frequency spectra may accurately automate driver detection by MEM and add some objectivity to the interpretation of MEM findings. Methods: Temporally and spatially stable single AF drivers were mapped simultaneously in explanted human atria (n=11) by subsurface near-infrared optical mapping (NIOM; 0.3 mm 2 resolution) and 64-electrode MEM (higher density or lower density with 3 and 9 mm 2 resolution, respectively). Unipolar MEM and NIOM recordings were processed by Fourier transform analysis into 28 407 total Fourier spectra. Thirty-five features for machine learning were extracted from each Fourier spectrum. Results: Targeted driver ablation and NIOM activation maps efficiently defined the center and periphery of AF driver preferential tracks and provided validated annotations for driver versus nondriver electrodes in MEM arrays. Compared with analysis of single electrogram frequency features, averaging the features from each of the 8 neighboring electrodes, significantly improved classification of AF driver electrograms. The classification metrics increased when less strict annotation, including driver periphery electrodes, were added to driver center annotation. Notably, f1-score for the binary classification of higher-density catheter data set was significantly higher than that of lower-density catheter (0.81±0.02 versus 0.66±0.04, P <0.05). The trained algorithm correctly highlighted 86% of driver regions with higher density but only 80% with lower-density MEM arrays (81% for lower-density+higher-density arrays together). Conclusions: The machine learning model pretrained on Fourier spectrum features allows efficient classification of electrograms recordings as AF driver or nondriver compared with the NIOM gold-standard. Future application of NIOM-validated machine learning approach may improve the accuracy of AF driver detection for targeted ablation treatment in patients.

Download Full-text

Precision-Recall versus Accuracy and the Role of Large Data Sets

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014039 ◽

2019 ◽

Vol 33 ◽

pp. 4039-4048 ◽

Cited By ~ 8

Author(s):

Brendan Juba ◽

Hai S. Le

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Large Data ◽

Constant Factor ◽

Data Sets ◽

Data Set ◽

Small Constant ◽

Classifier Performance ◽

Necessary And Sufficient

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

A Novel Transcriptomic Classifier for AML Is Highly Associated with Drug Sensitivity

Blood ◽

10.1182/blood-2021-153395 ◽

2021 ◽

Vol 138 (Supplement 1) ◽

pp. 2372-2372

Author(s):

Habib Hamidi ◽

Christopher R Bolen ◽

Elisabeth A Lasater ◽

Diana Dunshee ◽

Elizabeth A Punnoose ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Drug Sensitivity ◽

Molecular Subtypes ◽

Ex Vivo ◽

Response To Therapy ◽

Publicly Traded ◽

Data Set ◽

Unsupervised Machine Learning ◽

Rnaseq Data

Abstract Introduction: AML is a heterogeneous disease with a wide array of common genetic aberrations. Traditional classification of AML leverages both classical cytogenetics and mutational profiling to stratify patients into four distinct risk groups (ELN). However, tumor gene expression profiles can play an important role in response to therapy, and are potentially useful for unravelling the heterogeneity of AML. In this study, we hypothesized that clinical outcomes and variable responses to therapeutic modalities in AML may be driven by patterns of gene expression, and sought to identify clinically actionable molecular subtypes using the available RNAseq data from the BEAT AML functional genomics study. Methods: Unsupervised machine learning approach based on consensus non-negative matrix factorization (cNMF) was applied to VOOM normalized BEAT-AML RNAseq data from patient samples with ≥50% blasts (N=389) to identify transcriptomic-based molecular subtypes. The subtypes were then compared to the genomic based subtypes for their association with clinical outcome (log-rank test) and ex-vivo drug sensitivity (Kruskal Wallis test). Subtypes were also biologically characterized by gene signature scoring using well curated pathway signatures (GSVA analysis using Hallmark pathways), cell type enrichment (xCell enrichment) and AML differentiation state (scRNAseq signature based on Van Galen et. al). Finally, a random forest classifier was defined based on samples from BEAT AML to predict the NMF subtypes in an independent data set (TCGA AML cohort). Results: Our cNMF based analysis identified six clusters of patients based on the 5,060 (top 10%) most variable genes. These novel subtypes were strongly prognostic (Figure 1A, log rank p=2.79e-08), and were independent of ELN genomic based subtypes (anova p=4.45e-07). Comparison to other genomic based classification is ongoing. The prognostic value of the transcriptomic subtypes was further validated by predicting the subtypes in an independent cohort (TCGA LAML, N=200). We observed a significant association with outcome (Figure 1B, p=0.00013), with clusters 5 and 1 showing markedly better prognosis, similar to BEATAML. These subtypes also displayed unique biological profiles, including significant association with scRNAseq-derived AML differentiation state cell types, Hallmark pathways and cellularity signatures. Notably, clusters 1 and 3 showed a mature phenotype, while clusters 2, 4, and 5 were more progenitor-like (table 1). Importantly, the transcriptomic subtypes were highly predictive of ex-vivo drug sensitivity, with sensitivity to 70 compounds significantly associated with cNMF subtype (Kruskal Wallis p>0.01), compared with 4 in the ELN subtypes.Of the tested molecules, single agent Venetoclax was the most strongly associated with subtype (p=1.7e-13); two subtypes were strongly resistant (median IC50 of 10uM) and four were sensitive, with IC50s in the sub-micromolar range (Table 1). No association was seen between the ELN subtypes and venetoclax sensitivity (p=.35). Conclusions: Unsupervised machine learning-based clustering analysis of transcriptomic data identified six novel subtypes which are similarly prognostic as the ELN genomic based subtype and provide a novel avenue for identifying clinically actionable subsets of AML. Figure 1 Figure 1. Disclosures Hamidi: Genentech: Current Employment, Current equity holder in publicly-traded company. Bolen: Genentech: Current Employment; F. Hoffmann-La Roche: Current equity holder in publicly-traded company. Lasater: Genentech: Current Employment, Current equity holder in publicly-traded company. Dunshee: Genentech/Roche: Current Employment, Current equity holder in publicly-traded company. Punnoose: Genentech: Current Employment, Current equity holder in publicly-traded company. Dail: Genentech/Roche: Current Employment, Current equity holder in publicly-traded company.

Download Full-text

Automated Detection and Classification of Ki-67 Stained Nuclear Section Using Machine Learning Based on Texture of Nucleus to Measure Proliferation Score for Prognostic Evaluation of Breast Carcinoma

10.21203/rs.2.21063/v1 ◽

2020 ◽

Author(s):

Anil Kumar ◽

Manish Prateek

Keyword(s):

Machine Learning ◽

Breast Carcinoma ◽

Automatic Segmentation ◽

Blue Color ◽

Ki 67 ◽

Data Set ◽

Unsupervised Machine Learning ◽

Histopathological Images ◽

Hematoxylin And Eosin

Abstract Background: This study aimed significance of Ki-67 labels and calculated the proliferation score based on the counting of immunopositive and immunonegative nuclear sections with the help of machine learning to predict the intensity of breast carcinoma.Methods: BreCaHAD (Breast Cancer Histopathological Annotation and Diagnosis) dataset includes various malignant cases of different patients in their routine diagnosis. It contains H&E stained microscopic histopathological images at 40x magnification and stored in .tiff format using RGB band. In this study, the method start with preprocessing that focuses on resizing, smoothing and enhancement. After preprocessing, it is decomposed RGB sample into HSI values. BreCaHAD data set is hematoxylin and eosin (H&E) stained, where brown and blue color level have a major role to differentiate the immunopositive and immunonegative nuclear sections. Blue color in RGB and Hue in HSI are the intrinsic characteristic of H&E Ki-67. The shape parameters are calculated after segmentation preceded by Otsu thresholding and unsupervised machine learning. Morphological operators help to solve the problem of overlapping of nucleus section in sample images so that the counting will be correct and increase the accuracy of automatic segmentation.Result: With the help of nine morphological features and supported by unsupervised machine learning technique on BreCaHAD dataset, it is predicted the label of breast carcinoma. The performance measures like precision: 95.7%, recall: 93.8%, f-score: 94.74%, accuracy: 0.9088, specificity: 0.6803, BCR: 0.7975 and MCC: 0.5855 are obtained in proposed methodology which is better than existing techniques. Conclusion: This study developed an efficient automated nuclear section segmentation model implemented on BreCaHAD dataset contains H&E stained microscopic biopsy images. Potentially, this model will assist the pathologist for fast, effective, efficient and accurate computation of Ki-67 proliferation score on breast IHC carcinoma images.

Download Full-text

Exploration of Obesity Status of Indonesia Basic Health Research 2013 With Synthetic Minority Over-Sampling Techniques

Indonesian Journal of Statistics and Its Applications ◽

10.29244/ijsa.v5i1p75-91 ◽

2021 ◽

Vol 5 (1) ◽

pp. 75-91

Author(s):

Sri Astuti Thamrin ◽

Dian Sidik ◽

Hedi Kuswanto ◽

Armin Lawi ◽

Ansariadi Ansariadi

Keyword(s):

Machine Learning ◽

Health Research ◽

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Data Sets ◽

Data Set ◽

Basic Health ◽

Obesity Status ◽

Obese Class

The accuracy of the data class is very important in classification with a machine learning approach. The more accurate the existing data sets and classes, the better the output generated by machine learning. In fact, classification can experience imbalance class data in which each class does not have the same portion of the data set it has. The existence of data imbalance will affect the classification accuracy. One of the easiest ways to correct imbalanced data classes is to balance it. This study aims to explore the problem of data class imbalance in the medium case dataset and to address the imbalance of data classes as well. The Synthetic Minority Over-Sampling Technique (SMOTE) method is used to overcome the problem of class imbalance in obesity status in Indonesia 2013 Basic Health Research (RISKESDAS). The results show that the number of obese class (13.9%) and non-obese class (84.6%). This means that there is an imbalance in the data class with moderate criteria. Moreover, SMOTE with over-sampling 600% can improve the level of minor classes (obesity). As consequence, the classes of obesity status balanced. Therefore, SMOTE technique was better compared to without SMOTE in exploring the obesity status of Indonesia RISKESDAS 2013.

Download Full-text

Detection of Epileptic Seizure Using Pretrained Deep Convolutional Neural Network and Transfer Learning

European Neurology ◽

10.1159/000512985 ◽

2020 ◽

Vol 83 (6) ◽

pp. 602-614

Author(s):

Hidir Selcuk Nogay ◽

Hojjat Adeli

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Transfer Learning ◽

Epileptic Seizure ◽

Binary Classification ◽

Classification Model ◽

Automatic Identification ◽

Eeg Signal ◽

Data Set

Introduction: The diagnosis of epilepsy takes a certain process, depending entirely on the attending physician. However, the human factor may cause erroneous diagnosis in the analysis of the EEG signal. In the past 2 decades, many advanced signal processing and machine learning methods have been developed for the detection of epileptic seizures. However, many of these methods require large data sets and complex operations. Methods: In this study, an end-to-end machine learning model is presented for detection of epileptic seizure using the pretrained deep two-dimensional convolutional neural network (CNN) and the concept of transfer learning. The EEG signal is converted directly into visual data with a spectrogram and used directly as input data. Results: The authors analyzed the results of the training of the proposed pretrained AlexNet CNN model. Both binary and ternary classifications were performed without any extra procedure such as feature extraction. By performing data set creation from short-term spectrogram graphic images, the authors were able to achieve 100% accuracy for binary classification for epileptic seizure detection and 100% for ternary classification. Discussion/Conclusion: The proposed automatic identification and classification model can help in the early diagnosis of epilepsy, thus providing the opportunity for effective early treatment.

Download Full-text

Identifying strong lenses with unsupervised machine learning using convolutional autoencoder

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1015 ◽

2020 ◽

Vol 494 (3) ◽

pp. 3750-3765 ◽

Cited By ~ 4

Author(s):

Ting-Yun Cheng ◽

Nan Li ◽

Christopher J Conselice ◽

Alfonso Aragón-Salamanca ◽

Simon Dye ◽

...

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Binary Classification ◽

Gaussian Mixture ◽

Gravitational Lenses ◽

Imaging Data ◽

Training Set ◽

Unsupervised Machine Learning ◽

Starting Point ◽

Convolutional Autoencoder

ABSTRACT In this paper, we develop a new unsupervised machine learning technique comprised of a feature extractor, a convolutional autoencoder, and a clustering algorithm consisting of a Bayesian Gaussian mixture model. We apply this technique to visual band space-based simulated imaging data from the Euclid Space Telescope using data from the strong gravitational lenses finding challenge. Our technique promisingly captures a variety of lensing features such as Einstein rings with different radii, distorted arc structures, etc., without using predefined labels. After the clustering process, we obtain several classification clusters separated by different visual features which are seen in the images. Our method successfully picks up ∼63 per cent of lensing images from all lenses in the training set. With the assumed probability proposed in this study, this technique reaches an accuracy of 77.25 ± 0.48 per cent in binary classification using the training set. Additionally, our unsupervised clustering process can be used as the preliminary classification for future surveys of lenses to efficiently select targets and to speed up the labelling process. As the starting point of the astronomical application using this technique, we not only explore the application to gravitationally lensed systems, but also discuss the limitations and potential future uses of this technique.

Download Full-text