Robust Machine Learning Applied to Astronomical Data Sets. I. Star‐Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees

AbstractWe live in a universe filled with galaxies with an amazing variety of sizes and shapes. One of the biggest challenges for astronomers working in this field is to understand how all these types relate to each other in the background of an expanding universe. Modern astronomical surveys (like the Sloan Digital Sky Survey) have revolutionised this field of astronomy, by providing vast numbers of galaxies to study. The sheer size of the these databases made traditional visual classification of the types galaxies impossible and in 2007 inspired the Galaxy Zoo project (www.galaxyzoo.org); starting the largest ever scientific collaboration by asking members of the public to help classify galaxies by type and shape. Galaxy Zoo has since shown itself, in a series of now more than 30 scientific papers, to be a fantastic database for the study of galaxy evolution. In this Invited Discourse I spoke a little about the historical background of our understanding of what galaxies are, of galaxy classification, about our modern view of galaxies in the era of large surveys. I finish with showcasing some of the contributions galaxy classifications from the Galaxy Zoo project are making to our understanding of galaxy evolution.

Download Full-text

Decision Tree: A Machine Learning for Intrusion Detection

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1234.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 1126-1130

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Detection System ◽

Research Work ◽

Machine Learning Techniques ◽

Data Sets ◽

Legitimate User ◽

Learning Techniques ◽

Three Stages

The Intrusion is a major threat to unauthorized data or legal network using the legitimate user identity or any of the back doors and vulnerabilities in the network. IDS mechanisms are developed to detect the intrusions at various levels. The objective of the research work is to improve the Intrusion Detection System performance by applying machine learning techniques based on decision trees for detection and classification of attacks. The methodology adapted will process the datasets in three stages. The experimentation is conducted on KDDCUP99 data sets based on number of features. The Bayesian three modes are analyzed for different sized data sets based upon total number of attacks. The time consumed by the classifier to build the model is analyzed and the accuracy is done.

Download Full-text

A study on the statistical significance of mutual information between morphology of a galaxy and its large-scale environment

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2236 ◽

2020 ◽

Vol 497 (4) ◽

pp. 4077-4090 ◽

Cited By ~ 1

Author(s):

Suman Sarkar ◽

Biswajit Pandey

Keyword(s):

Mutual Information ◽

Large Scale ◽

Statistical Significance ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Information Theoretic ◽

Galaxy Distribution ◽

Sky Survey ◽

The Galaxy ◽

Physical Correlations

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.

Download Full-text

Learning to see the wood for the trees: machine learning, decision trees, and the classification of isolated theropod teeth

Palaeontology ◽

10.1111/pala.12512 ◽

2020 ◽

Author(s):

Simon Wills ◽

Charlie J. Underwood ◽

Paul M. Barrett

Keyword(s):

Machine Learning ◽

Decision Trees

Download Full-text

A method for classification of red, blue, and green galaxies using fuzzy set theory

Monthly Notices of the Royal Astronomical Society Letters ◽

10.1093/mnrasl/slaa152 ◽

2020 ◽

Vol 499 (1) ◽

pp. L31-L35

Author(s):

Biswajit Pandey

Keyword(s):

Fuzzy Sets ◽

Set Theory ◽

Fuzzy Set ◽

Fuzzy Set Theory ◽

Sloan Digital Sky Survey ◽

Sky Survey ◽

Blue Galaxies ◽

Boundary Sets ◽

Red Galaxies

ABSTRACT Red and blue galaxies are traditionally classified using some specific cuts in colour or other galaxy properties, which are supported by empirical arguments. The vagueness associated with such cuts are likely to introduce a significant contamination in these samples. Fuzzy sets are vague boundary sets that can efficiently capture the classification uncertainty in the absence of any precise boundary. We propose a method for classification of galaxies according to their colours using fuzzy set theory. We use data from the Sloan Digital Sky Survey (SDSS) to construct a fuzzy set for red galaxies with its members having different degrees of ‘redness’. We show that the fuzzy sets for the blue and green galaxies can be obtained from it using different fuzzy operations. We also explore the possibility of using fuzzy relation to study the relationship between different galaxy properties and discuss its strengths and limitations.

Download Full-text

Setting Attribute Weights for Nearest Neighbor Learning Algorithms Using C4.5

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001497000184 ◽

1997 ◽

Vol 11 (03) ◽

pp. 405-415 ◽

Cited By ~ 3

Author(s):

Charles X. Ling ◽

John J. Parry ◽

Handong Wang

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Distance Function ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Simple Approach ◽

Nearest Neighbour ◽

Attribute Weights ◽

New Methods

Nearest Neighbour (NN) learning algorithms utilize a distance function to determine the classification of testing examples. The attribute weights in the distance function should be set appropriately. We study situations where a simple approach of setting attribute weights using decision trees does not work well, and design three improvements. We test these new methods thoroughly using artificially generated datasets and datasets from the machine learning repository.

Download Full-text

Spectral classification of emission-line galaxies from the Sloan Digital Sky Survey

Astronomy and Astrophysics ◽

10.1051/0004-6361/201016143 ◽

2011 ◽

Vol 531 ◽

pp. A71 ◽

Cited By ~ 11

Author(s):

J. Marocco ◽

E. Hache ◽

F. Lamareille

Keyword(s):

Emission Line ◽

Sloan Digital Sky Survey ◽

Spectral Classification ◽

Sky Survey ◽

Emission Line Galaxies

Download Full-text

Photometric Classification of Stars, Galaxies and Quasars in the Sloan Digital Sky Survey DR6 Using Support Vector Machines

10.1063/1.3059095 ◽

2008 ◽

Cited By ~ 3

Author(s):

C. Elting ◽

C. A. L. Bailer-Jones ◽

K. W. Smith ◽

Coryn A.L. Bailer-Jones

Keyword(s):

Support Vector Machines ◽

Sloan Digital Sky Survey ◽

Support Vector ◽

Sky Survey ◽

Vector Machines

Download Full-text

Optimizing automatic morphological classification of galaxies with machine learning and deep learning using Dark Energy Survey imaging

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa501 ◽

2020 ◽

Vol 493 (3) ◽

pp. 4209-4228 ◽

Cited By ~ 6

Author(s):

Ting-Yun Cheng ◽

Christopher J Conselice ◽

Alfonso Aragón-Salamanca ◽

Nan Li ◽

Asa F L Bluck ◽

...

Keyword(s):

Machine Learning ◽

Imaging Data ◽

Morphological Classification ◽

Learning Methods ◽

Visual Classification ◽

Machine Learning Methods ◽

Dark Energy Survey ◽

Galaxy Classification ◽

Energy Survey

ABSTRACT There are several supervised machine learning methods used for the application of automated morphological classification of galaxies; however, there has not yet been a clear comparison of these different methods using imaging data, or an investigation for maximizing their effectiveness. We carry out a comparison between several common machine learning methods for galaxy classification [Convolutional Neural Network (CNN), K-nearest neighbour, logistic regression, Support Vector Machine, Random Forest, and Neural Networks] by using Dark Energy Survey (DES) data combined with visual classifications from the Galaxy Zoo 1 project (GZ1). Our goal is to determine the optimal machine learning methods when using imaging data for galaxy classification. We show that CNN is the most successful method of these ten methods in our study. Using a sample of ∼2800 galaxies with visual classification from GZ1, we reach an accuracy of ∼0.99 for the morphological classification of ellipticals and spirals. The further investigation of the galaxies that have a different ML and visual classification but with high predicted probabilities in our CNN usually reveals the incorrect classification provided by GZ1. We further find the galaxies having a low probability of being either spirals or ellipticals are visually lenticulars (S0), demonstrating that supervised learning is able to rediscover that this class of galaxy is distinct from both ellipticals and spirals. We confirm that ∼2.5 per cent galaxies are misclassified by GZ1 in our study. After correcting these galaxies’ labels, we improve our CNN performance to an average accuracy of over 0.99 (accuracy of 0.994 is our best result).

Download Full-text

barry and the BAO model comparison

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa361 ◽

2020 ◽

Vol 493 (3) ◽

pp. 4078-4093 ◽

Cited By ~ 5

Author(s):

Samuel R Hinton ◽

Cullan Howlett ◽

Tamara M Davis

Keyword(s):

Model Comparison ◽

State Of The Art ◽

Model Fitting ◽

Power Spectra ◽

Acoustic Oscillation ◽

Effective Field ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

New Public ◽

Sky Survey

ABSTRACT We compare the performance of four state-of-the-art models for extracting isotropic measurements of the baryon acoustic oscillation (BAO) scale. To do this, we created a new, public, modular code barry, which contains data sets, model fitting tools, and model implementations incorporating different descriptions of non-linear physics and algorithms for isolating the BAO feature. These are then evaluated for bias, correlation, and fitting strength using mock power spectra and correlation functions developed for the Sloan Digital Sky Survey Data Release 12. Our main findings are as follows: (1) all of the models can recover unbiased constraints when fit to the pre- and post-reconstruction simulations. (2) Models that provide physical descriptions of the damping of the BAO feature (using e.g. standard perturbation or effective-field theory arguments) report smaller errors on average, although the distribution of mock χ2 values indicates these are underestimated. (3) Allowing the BAO damping scale to vary can provide tighter constraints for some mocks, but is an artificial improvement that only arises when noise randomly sharpens the BAO peak. (4) Unlike recent claims in the literature when utilizing a BAO Extractor technique, we find no improvement in the accuracy of the recovered BAO scale. (5) We implement a procedure for combining all models into a single consensus result that improves over the standard method without obviously underestimating the uncertainties. Overall, barry provides a framework for performing the cosmological analyses for upcoming surveys, and for rapidly testing and validating new models.

Download Full-text