Robust Machine Learning Applied to Astronomical Data Sets. I. Star‐Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees

2006 ◽  
Vol 650 (1) ◽  
pp. 497-509 ◽  
Author(s):  
Nicholas M. Ball ◽  
Robert J. Brunner ◽  
Adam D. Myers ◽  
David Tcheng
2012 ◽  
Vol 10 (H16) ◽  
pp. 1-15
Author(s):  
Karen L. Masters

AbstractWe live in a universe filled with galaxies with an amazing variety of sizes and shapes. One of the biggest challenges for astronomers working in this field is to understand how all these types relate to each other in the background of an expanding universe. Modern astronomical surveys (like the Sloan Digital Sky Survey) have revolutionised this field of astronomy, by providing vast numbers of galaxies to study. The sheer size of the these databases made traditional visual classification of the types galaxies impossible and in 2007 inspired the Galaxy Zoo project (www.galaxyzoo.org); starting the largest ever scientific collaboration by asking members of the public to help classify galaxies by type and shape. Galaxy Zoo has since shown itself, in a series of now more than 30 scientific papers, to be a fantastic database for the study of galaxy evolution. In this Invited Discourse I spoke a little about the historical background of our understanding of what galaxies are, of galaxy classification, about our modern view of galaxies in the era of large surveys. I finish with showcasing some of the contributions galaxy classifications from the Galaxy Zoo project are making to our understanding of galaxy evolution.


The Intrusion is a major threat to unauthorized data or legal network using the legitimate user identity or any of the back doors and vulnerabilities in the network. IDS mechanisms are developed to detect the intrusions at various levels. The objective of the research work is to improve the Intrusion Detection System performance by applying machine learning techniques based on decision trees for detection and classification of attacks. The methodology adapted will process the datasets in three stages. The experimentation is conducted on KDDCUP99 data sets based on number of features. The Bayesian three modes are analyzed for different sized data sets based upon total number of attacks. The time consumed by the classifier to build the model is analyzed and the accuracy is done.


2020 ◽  
Vol 497 (4) ◽  
pp. 4077-4090 ◽  
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.


2020 ◽  
Vol 499 (1) ◽  
pp. L31-L35
Author(s):  
Biswajit Pandey

ABSTRACT Red and blue galaxies are traditionally classified using some specific cuts in colour or other galaxy properties, which are supported by empirical arguments. The vagueness associated with such cuts are likely to introduce a significant contamination in these samples. Fuzzy sets are vague boundary sets that can efficiently capture the classification uncertainty in the absence of any precise boundary. We propose a method for classification of galaxies according to their colours using fuzzy set theory. We use data from the Sloan Digital Sky Survey (SDSS) to construct a fuzzy set for red galaxies with its members having different degrees of ‘redness’. We show that the fuzzy sets for the blue and green galaxies can be obtained from it using different fuzzy operations. We also explore the possibility of using fuzzy relation to study the relationship between different galaxy properties and discuss its strengths and limitations.


Author(s):  
Charles X. Ling ◽  
John J. Parry ◽  
Handong Wang

Nearest Neighbour (NN) learning algorithms utilize a distance function to determine the classification of testing examples. The attribute weights in the distance function should be set appropriately. We study situations where a simple approach of setting attribute weights using decision trees does not work well, and design three improvements. We test these new methods thoroughly using artificially generated datasets and datasets from the machine learning repository.


2020 ◽  
Vol 493 (3) ◽  
pp. 4209-4228 ◽  
Author(s):  
Ting-Yun Cheng ◽  
Christopher J Conselice ◽  
Alfonso Aragón-Salamanca ◽  
Nan Li ◽  
Asa F L Bluck ◽  
...  

ABSTRACT There are several supervised machine learning methods used for the application of automated morphological classification of galaxies; however, there has not yet been a clear comparison of these different methods using imaging data, or an investigation for maximizing their effectiveness. We carry out a comparison between several common machine learning methods for galaxy classification [Convolutional Neural Network (CNN), K-nearest neighbour, logistic regression, Support Vector Machine, Random Forest, and Neural Networks] by using Dark Energy Survey (DES) data combined with visual classifications from the Galaxy Zoo 1 project (GZ1). Our goal is to determine the optimal machine learning methods when using imaging data for galaxy classification. We show that CNN is the most successful method of these ten methods in our study. Using a sample of ∼2800 galaxies with visual classification from GZ1, we reach an accuracy of ∼0.99 for the morphological classification of ellipticals and spirals. The further investigation of the galaxies that have a different ML and visual classification but with high predicted probabilities in our CNN usually reveals the incorrect classification provided by GZ1. We further find the galaxies having a low probability of being either spirals or ellipticals are visually lenticulars (S0), demonstrating that supervised learning is able to rediscover that this class of galaxy is distinct from both ellipticals and spirals. We confirm that ∼2.5 per cent galaxies are misclassified by GZ1 in our study. After correcting these galaxies’ labels, we improve our CNN performance to an average accuracy of over 0.99 (accuracy of 0.994 is our best result).


2020 ◽  
Vol 493 (3) ◽  
pp. 4078-4093 ◽  
Author(s):  
Samuel R Hinton ◽  
Cullan Howlett ◽  
Tamara M Davis

ABSTRACT We compare the performance of four state-of-the-art models for extracting isotropic measurements of the baryon acoustic oscillation (BAO) scale. To do this, we created a new, public, modular code barry, which contains data sets, model fitting tools, and model implementations incorporating different descriptions of non-linear physics and algorithms for isolating the BAO feature. These are then evaluated for bias, correlation, and fitting strength using mock power spectra and correlation functions developed for the Sloan Digital Sky Survey Data Release 12. Our main findings are as follows: (1) all of the models can recover unbiased constraints when fit to the pre- and post-reconstruction simulations. (2) Models that provide physical descriptions of the damping of the BAO feature (using e.g. standard perturbation or effective-field theory arguments) report smaller errors on average, although the distribution of mock χ2 values indicates these are underestimated. (3) Allowing the BAO damping scale to vary can provide tighter constraints for some mocks, but is an artificial improvement that only arises when noise randomly sharpens the BAO peak. (4) Unlike recent claims in the literature when utilizing a BAO Extractor technique, we find no improvement in the accuracy of the recovered BAO scale. (5) We implement a procedure for combining all models into a single consensus result that improves over the standard method without obviously underestimating the uncertainties. Overall, barry provides a framework for performing the cosmological analyses for upcoming surveys, and for rapidly testing and validating new models.


Sign in / Sign up

Export Citation Format

Share Document