The effect of cluster location and dataset size on 2-stage k-means algorithm

Correlation Analysis of Dataset Size and Accuracy of the CNN-based Malware Detection Algorithm

Jouranl of Information and Security ◽

10.33778/kcsa.2020.20.3.053 ◽

2020 ◽

Vol 20 (3) ◽

pp. 53-60

Author(s):

Dong Jun Choi ◽

◽

Jae Woo Lee

Keyword(s):

Correlation Analysis ◽

Malware Detection ◽

Detection Algorithm ◽

Dataset Size

Download Full-text

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Applied Sciences ◽

10.3390/app11020796 ◽

2021 ◽

Vol 11 (2) ◽

pp. 796

Author(s):

Alhanoof Althnian ◽

Duaa AlSaeed ◽

Heyam Al-Baity ◽

Amani Samha ◽

Alanoud Bin Dris ◽

...

Keyword(s):

Empirical Evaluation ◽

Classification Performance ◽

Support Vector ◽

Robust Model ◽

Original Distribution ◽

C4.5 Decision Tree ◽

Dataset Size ◽

Overall Performance ◽

Medical Domain ◽

The Impact

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

Download Full-text

Deep Learning-Based Hepatocellular Carcinoma Histopathology Image Classification: Accuracy versus Training Dataset Size

IEEE Access ◽

10.1109/access.2021.3060765 ◽

2021 ◽

pp. 1-1

Author(s):

Yu-Shiang Lin ◽

Pei-Hsin Huang ◽

Yung-Yaw Chen

Keyword(s):

Hepatocellular Carcinoma ◽

Deep Learning ◽

Image Classification ◽

Classification Accuracy ◽

Training Dataset ◽

Dataset Size

Download Full-text

Automatic Target Recognition for Low Resolution Foliage Penetrating SAR Images Using CNNs and GANs

Remote Sensing ◽

10.3390/rs13040596 ◽

2021 ◽

Vol 13 (4) ◽

pp. 596

Author(s):

David Vint ◽

Matthew Anderson ◽

Yuhao Yang ◽

Christos Ilioudis ◽

Gaetano Di Caterina ◽

...

Keyword(s):

Target Recognition ◽

Automatic Target Recognition ◽

Generative Adversarial Networks ◽

Low Resolution ◽

Sar Images ◽

Adversarial Networks ◽

Technological Advances ◽

Dataset Size ◽

Resolution Imaging ◽

High Level

In recent years, the technological advances leading to the production of high-resolution Synthetic Aperture Radar (SAR) images has enabled more and more effective target recognition capabilities. However, high spatial resolution is not always achievable, and, for some particular sensing modes, such as Foliage Penetrating Radars, low resolution imaging is often the only option. In this paper, the problem of automatic target recognition in Low Resolution Foliage Penetrating (FOPEN) SAR is addressed through the use of Convolutional Neural Networks (CNNs) able to extract both low and high level features of the imaged targets. Additionally, to address the issue of limited dataset size, Generative Adversarial Networks are used to enlarge the training set. Finally, a Receiver Operating Characteristic (ROC)-based post-classification decision approach is used to reduce classification errors and measure the capability of the classifier to provide a reliable output. The effectiveness of the proposed framework is demonstrated through the use of real SAR FOPEN data.

Download Full-text

COMPARISON OF ASCORBIC ACID CONTENT IN IMPORTED TOMATOES AND IN CULTIVARS GROWN IN NOVA SCOTIA

Canadian Journal of Plant Science ◽

10.4141/cjps87-047 ◽

1987 ◽

Vol 67 (1) ◽

pp. 331-335

Author(s):

HAK-YOON JU ◽

W. JOHN MULLIN

Keyword(s):

Ascorbic Acid ◽

Nova Scotia ◽

Vitamin C ◽

Acid Content ◽

Ascorbic Acid Content ◽

Fresh Weight ◽

Atlantic Region ◽

Significant Difference ◽

Cluster Location ◽

Greenhouse Tomatoes

The ascorbic acid (vitamin C) content of fresh imported field tomatoes and Nova Scotia greenhouse and field tomatoes was determined on a bi-weekly basis during the period of availability of each type of tomato to the Nova Scotia consumer in 1984. The average ascorbic acid contents of imported and Nova Scotia field and greenhouse tomatoes were 13.3, 16.7 and 17.7 mg 100 g−1 fresh weight, respectively. A study of nine recommended or promising field tomatoes for the Atlantic region showed significant differences in ascorbic acid content among the cultivars. The cultivar Quick Pick had the highest ascorbic acid content of 22.5 ± 1.5 mg 100 g−1, the cultivar Campbell 18 had the lowest content, 12.0 ± 2.9 mg 100 g−1. In Dombito greenhouse tomatoes the stage of maturity and the effect of cluster location were tested against ascorbic acid content. The lowest ascorbic acid content of 9.1 ± 1.0 mg 100 g−1 was found with the small green tomatoes while others from mature green to overripe contained 14.0–16.7 mg 100 g−1. Tomatoes from different cluster locations showed no significant difference in ascorbic acid content.Key words: Vitamin C, L-ascorbic acid, tomatoes

Download Full-text

Efficient computation of Faith's phylogenetic diversity with applications in characterizing microbiomes

Genome Research ◽

10.1101/gr.275777.121 ◽

2021 ◽

pp. gr.275777.121

Author(s):

George W Armstrong ◽

Kalen Cantrell ◽

Shi Huang ◽

Daniel McDonald ◽

Niina Haiminen ◽

...

Keyword(s):

Carbon Footprint ◽

Phylogenetic Diversity ◽

Alpha Diversity ◽

Previous Method ◽

Metagenomic Data ◽

Efficient Computation ◽

Computationally Efficient ◽

Dataset Size ◽

Computational Resources ◽

Older Populations

The number of publicly available microbiome samples is continually growing. As dataset size increases, bottlenecks arise in standard analytical pipelines. Faith’s phylogenetic diversity is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's Phylogenetic Diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.

Download Full-text

Author Reputation Measurement on Question and Answer Sites by the Classification of Author-Generated Content

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021500479 ◽

2021 ◽

Vol 31 (10) ◽

pp. 1421-1445

Author(s):

Erhan Sezerer ◽

Samet Tenekeci ◽

Ali Acar ◽

Bora Baloğlu ◽

Selma Tekir

Keyword(s):

Software Engineering ◽

Design Patterns ◽

Binary Classification ◽

Grey Literature ◽

Language Models ◽

Superior Performance ◽

Reputation Measurement ◽

Objective Quality ◽

Question And Answer ◽

Dataset Size

In the field of software engineering, practitioners’ share in the constructed knowledge cannot be underestimated and is mostly in the form of grey literature (GL). GL is a valuable resource though it is subjective and lacks an objective quality assurance methodology. In this paper, a quality assessment scheme is proposed for question and answer (Q&A) sites. In particular, we target stack overflow (SO) and stack exchange (SE) sites. We model the problem of author reputation measurement as a classification task on the author-provided answers. The authors’ mean, median, and total answer scores are used as inputs for class labeling. State-of-the-art language models (BERT and DistilBERT) with a softmax layer on top are utilized as classifiers and compared to SVM and random baselines. Our best model achieves [Formula: see text] accuracy in binary classification in SO design patterns tag and [Formula: see text] accuracy in SE software engineering category. Superior performance in SE software engineering can be explained by its larger dataset size. In addition to quantitative evaluation, we provide qualitative evidence, which supports that the system’s predicted reputation labels match the quality of provided answers.

Download Full-text

A Survey on Leukemia Detection using Image Processing Techniques

E3S Web of Conferences ◽

10.1051/e3sconf/202130901111 ◽

2021 ◽

Vol 309 ◽

pp. 01111

Author(s):

Mohammed Junaid Ahmed ◽

Padmalaya Nayak

Keyword(s):

Deep Learning ◽

Low Cost ◽

Research Papers ◽

Survey Paper ◽

Machine Learning Methods ◽

Research Fields ◽

Dataset Size ◽

Detection And Diagnosis ◽

Processing Techniques ◽

Late Works

Leukemia detection and diagnosis by inspecting the blood cell images is an intriguing and dynamic exploration region in both the Artificial Intelligence and Medical research fields. There are numerous procedures created to look at blood tests to identify leukemia illness, these strategies are the customary methods and the deep learning (DL) strategy. This survey paper presents a review on the distinctive conventional strategies and Deep Learning and Machine Learning methods towards that have been utilized in leukemia illness diagnosis dependent on platelets images and to analyze between the two methodologies in nature of appraisal, exactness, cost and speed. This article covers 11 research papers, 9 of these examinations were in customary strategies which utilized image handling and AI (ML) calculations, such as, K-closest neighbor (KNN), K-means, SVM, Naïve Bayes, and 2 investigations in cutting edge procedures which utilized Deep Learning, especially Convolutional Neural Networks (CNNs) which is the most generally utilized in the field leukemia detection since it is profoundly precise, quick, and has the smallest expense. What's more, it dissects various late works that have been presented in the field including the dataset size, the pre-owned procedures, the acquired outcomes, and so forth. At last, in view of the led study, it very well may be reasoned that the proposed framework CNN was accomplishing immense triumphs in the field whether in regards to highlights extraction or classification time, precision and also a best low cost in the identification of leukemia.

Download Full-text

Distribution-preserving data augmentation

PeerJ Computer Science ◽

10.7717/peerj-cs.571 ◽

2021 ◽

Vol 7 ◽

pp. e571

Author(s):

Nurdan Ayse Saran ◽

Murat Saran ◽

Fatih Nar

Keyword(s):

Data Augmentation ◽

Image Data ◽

Large Data ◽

Data Availability ◽

Superior Performance ◽

Color Distribution ◽

Spatial Transformations ◽

Wide Range ◽

Dataset Size ◽

Existing Data

In the last decade, deep learning has been applied in a wide range of problems with tremendous success. This success mainly comes from large data availability, increased computational power, and theoretical improvements in the training phase. As the dataset grows, the real world is better represented, making it possible to develop a model that can generalize. However, creating a labeled dataset is expensive, time-consuming, and sometimes not likely in some domains if not challenging. Therefore, researchers proposed data augmentation methods to increase dataset size and variety by creating variations of the existing data. For image data, variations can be obtained by applying color or spatial transformations, only one or a combination. Such color transformations perform some linear or nonlinear operations in the entire image or in the patches to create variations of the original image. The current color-based augmentation methods are usually based on image processing methods that apply color transformations such as equalizing, solarizing, and posterizing. Nevertheless, these color-based data augmentation methods do not guarantee to create plausible variations of the image. This paper proposes a novel distribution-preserving data augmentation method that creates plausible image variations by shifting pixel colors to another point in the image color distribution. We achieved this by defining a regularized density decreasing direction to create paths from the original pixels’ color to the distribution tails. The proposed method provides superior performance compared to existing data augmentation methods which is shown using a transfer learning scenario on the UC Merced Land-use, Intel Image Classification, and Oxford-IIIT Pet datasets for classification and segmentation tasks.

Download Full-text

An equation-free method reveals the ecological interaction networks within complex microbial ecosystems

10.1101/080697 ◽

2016 ◽

Author(s):

Kenta Suzuki ◽

Katsuhiko Yoshida ◽

Yumiko Nakanishi ◽

Shinji Fukuda

Keyword(s):

Time Series ◽

Microbial Communities ◽

Microbial Processes ◽

Interaction Networks ◽

List Type ◽

Metagenomic Sequencing ◽

Ecological Interactions ◽

Microbial Interaction ◽

Ecological Interaction ◽

Dataset Size

AbstractMapping the network of ecological interactions is key to understanding the composition, stability, function and dynamics of microbial communities. In recent years various approaches have been used to reveal microbial interaction networks from metagenomic sequencing data, such as time-series analysis, machine learning and statistical techniques. Despite these efforts it is still not possible to capture details of the ecological interactions behind complex microbial dynamics.We developed the sparse S-map method (SSM), which generates a sparse interaction network from a multivariate ecological time-series without presuming any mathematical formulation for the underlying microbial processes. The advantage of the SSM over alternative methodologies is that it fully utilizes the observed data using a framework of empirical dynamic modelling. This makes the SSM robust to non-equilibrium dynamics and underlying complexity (nonlinearity) in microbial processes.We showed that an increase in dataset size or a decrease in observational error improved the accuracy of SSM whereas, the accuracy of a comparative equation-based method was almost unchanged for both cases and equivalent to the SSM at best. Hence, the SSM outperformed a comparative equation-based method when datasets were large and the magnitude of observational errors were small. The results were robust to the magnitude of process noise and the functional forms of inter-specific interactions that we tested. We applied the method to a microbiome data of six mice and found that there were different microbial interaction regimes between young to middle age (4-40 week-old) and middle to old age (36-72 week-old) mice.The complexity of microbial relationships impedes detailed equation-based modeling. Our method provides a powerful alternative framework to infer ecological interaction networks of microbial communities in various environments and will be improved by further developments in metagenomics sequencing technologies leading to increased dataset size and improved accuracy and precision.

Download Full-text