The effect of cluster location and dataset size on 2-stage k-means algorithm

Author(s):  
Raied Salman ◽  
Vojislav Kecman
2021 ◽  
Vol 11 (2) ◽  
pp. 796
Author(s):  
Alhanoof Althnian ◽  
Duaa AlSaeed ◽  
Heyam Al-Baity ◽  
Amani Samha ◽  
Alanoud Bin Dris ◽  
...  

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.


2021 ◽  
Vol 13 (4) ◽  
pp. 596
Author(s):  
David Vint ◽  
Matthew Anderson ◽  
Yuhao Yang ◽  
Christos Ilioudis ◽  
Gaetano Di Caterina ◽  
...  

In recent years, the technological advances leading to the production of high-resolution Synthetic Aperture Radar (SAR) images has enabled more and more effective target recognition capabilities. However, high spatial resolution is not always achievable, and, for some particular sensing modes, such as Foliage Penetrating Radars, low resolution imaging is often the only option. In this paper, the problem of automatic target recognition in Low Resolution Foliage Penetrating (FOPEN) SAR is addressed through the use of Convolutional Neural Networks (CNNs) able to extract both low and high level features of the imaged targets. Additionally, to address the issue of limited dataset size, Generative Adversarial Networks are used to enlarge the training set. Finally, a Receiver Operating Characteristic (ROC)-based post-classification decision approach is used to reduce classification errors and measure the capability of the classifier to provide a reliable output. The effectiveness of the proposed framework is demonstrated through the use of real SAR FOPEN data.


1987 ◽  
Vol 67 (1) ◽  
pp. 331-335
Author(s):  
HAK-YOON JU ◽  
W. JOHN MULLIN

The ascorbic acid (vitamin C) content of fresh imported field tomatoes and Nova Scotia greenhouse and field tomatoes was determined on a bi-weekly basis during the period of availability of each type of tomato to the Nova Scotia consumer in 1984. The average ascorbic acid contents of imported and Nova Scotia field and greenhouse tomatoes were 13.3, 16.7 and 17.7 mg 100 g−1 fresh weight, respectively. A study of nine recommended or promising field tomatoes for the Atlantic region showed significant differences in ascorbic acid content among the cultivars. The cultivar Quick Pick had the highest ascorbic acid content of 22.5 ± 1.5 mg 100 g−1, the cultivar Campbell 18 had the lowest content, 12.0 ± 2.9 mg 100 g−1. In Dombito greenhouse tomatoes the stage of maturity and the effect of cluster location were tested against ascorbic acid content. The lowest ascorbic acid content of 9.1 ± 1.0 mg 100 g−1 was found with the small green tomatoes while others from mature green to overripe contained 14.0–16.7 mg 100 g−1. Tomatoes from different cluster locations showed no significant difference in ascorbic acid content.Key words: Vitamin C, L-ascorbic acid, tomatoes


2021 ◽  
pp. gr.275777.121
Author(s):  
George W Armstrong ◽  
Kalen Cantrell ◽  
Shi Huang ◽  
Daniel McDonald ◽  
Niina Haiminen ◽  
...  

The number of publicly available microbiome samples is continually growing. As dataset size increases, bottlenecks arise in standard analytical pipelines. Faith’s phylogenetic diversity is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's Phylogenetic Diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.


Author(s):  
Erhan Sezerer ◽  
Samet Tenekeci ◽  
Ali Acar ◽  
Bora Baloğlu ◽  
Selma Tekir

In the field of software engineering, practitioners’ share in the constructed knowledge cannot be underestimated and is mostly in the form of grey literature (GL). GL is a valuable resource though it is subjective and lacks an objective quality assurance methodology. In this paper, a quality assessment scheme is proposed for question and answer (Q&A) sites. In particular, we target stack overflow (SO) and stack exchange (SE) sites. We model the problem of author reputation measurement as a classification task on the author-provided answers. The authors’ mean, median, and total answer scores are used as inputs for class labeling. State-of-the-art language models (BERT and DistilBERT) with a softmax layer on top are utilized as classifiers and compared to SVM and random baselines. Our best model achieves [Formula: see text] accuracy in binary classification in SO design patterns tag and [Formula: see text] accuracy in SE software engineering category. Superior performance in SE software engineering can be explained by its larger dataset size. In addition to quantitative evaluation, we provide qualitative evidence, which supports that the system’s predicted reputation labels match the quality of provided answers.


2021 ◽  
Vol 309 ◽  
pp. 01111
Author(s):  
Mohammed Junaid Ahmed ◽  
Padmalaya Nayak

Leukemia detection and diagnosis by inspecting the blood cell images is an intriguing and dynamic exploration region in both the Artificial Intelligence and Medical research fields. There are numerous procedures created to look at blood tests to identify leukemia illness, these strategies are the customary methods and the deep learning (DL) strategy. This survey paper presents a review on the distinctive conventional strategies and Deep Learning and Machine Learning methods towards that have been utilized in leukemia illness diagnosis dependent on platelets images and to analyze between the two methodologies in nature of appraisal, exactness, cost and speed. This article covers 11 research papers, 9 of these examinations were in customary strategies which utilized image handling and AI (ML) calculations, such as, K-closest neighbor (KNN), K-means, SVM, Naïve Bayes, and 2 investigations in cutting edge procedures which utilized Deep Learning, especially Convolutional Neural Networks (CNNs) which is the most generally utilized in the field leukemia detection since it is profoundly precise, quick, and has the smallest expense. What's more, it dissects various late works that have been presented in the field including the dataset size, the pre-owned procedures, the acquired outcomes, and so forth. At last, in view of the led study, it very well may be reasoned that the proposed framework CNN was accomplishing immense triumphs in the field whether in regards to highlights extraction or classification time, precision and also a best low cost in the identification of leukemia.


2021 ◽  
Vol 7 ◽  
pp. e571
Author(s):  
Nurdan Ayse Saran ◽  
Murat Saran ◽  
Fatih Nar

In the last decade, deep learning has been applied in a wide range of problems with tremendous success. This success mainly comes from large data availability, increased computational power, and theoretical improvements in the training phase. As the dataset grows, the real world is better represented, making it possible to develop a model that can generalize. However, creating a labeled dataset is expensive, time-consuming, and sometimes not likely in some domains if not challenging. Therefore, researchers proposed data augmentation methods to increase dataset size and variety by creating variations of the existing data. For image data, variations can be obtained by applying color or spatial transformations, only one or a combination. Such color transformations perform some linear or nonlinear operations in the entire image or in the patches to create variations of the original image. The current color-based augmentation methods are usually based on image processing methods that apply color transformations such as equalizing, solarizing, and posterizing. Nevertheless, these color-based data augmentation methods do not guarantee to create plausible variations of the image. This paper proposes a novel distribution-preserving data augmentation method that creates plausible image variations by shifting pixel colors to another point in the image color distribution. We achieved this by defining a regularized density decreasing direction to create paths from the original pixels’ color to the distribution tails. The proposed method provides superior performance compared to existing data augmentation methods which is shown using a transfer learning scenario on the UC Merced Land-use, Intel Image Classification, and Oxford-IIIT Pet datasets for classification and segmentation tasks.


2016 ◽  
Author(s):  
Kenta Suzuki ◽  
Katsuhiko Yoshida ◽  
Yumiko Nakanishi ◽  
Shinji Fukuda

AbstractMapping the network of ecological interactions is key to understanding the composition, stability, function and dynamics of microbial communities. In recent years various approaches have been used to reveal microbial interaction networks from metagenomic sequencing data, such as time-series analysis, machine learning and statistical techniques. Despite these efforts it is still not possible to capture details of the ecological interactions behind complex microbial dynamics.We developed the sparse S-map method (SSM), which generates a sparse interaction network from a multivariate ecological time-series without presuming any mathematical formulation for the underlying microbial processes. The advantage of the SSM over alternative methodologies is that it fully utilizes the observed data using a framework of empirical dynamic modelling. This makes the SSM robust to non-equilibrium dynamics and underlying complexity (nonlinearity) in microbial processes.We showed that an increase in dataset size or a decrease in observational error improved the accuracy of SSM whereas, the accuracy of a comparative equation-based method was almost unchanged for both cases and equivalent to the SSM at best. Hence, the SSM outperformed a comparative equation-based method when datasets were large and the magnitude of observational errors were small. The results were robust to the magnitude of process noise and the functional forms of inter-specific interactions that we tested. We applied the method to a microbiome data of six mice and found that there were different microbial interaction regimes between young to middle age (4-40 week-old) and middle to old age (36-72 week-old) mice.The complexity of microbial relationships impedes detailed equation-based modeling. Our method provides a powerful alternative framework to infer ecological interaction networks of microbial communities in various environments and will be improved by further developments in metagenomics sequencing technologies leading to increased dataset size and improved accuracy and precision.


Sign in / Sign up

Export Citation Format

Share Document