Silhouette width using generalized mean – a flexible method for assessing clustering efficiency

Mapping Intimacies ◽

10.1101/434100 ◽

2018 ◽

Author(s):

Attila Lengyel ◽

Zoltán Botta-Dukát

Keyword(s):

Vital Role ◽

List Type ◽

Single Linkage ◽

Data Set ◽

Silhouette Width ◽

Complete Linkage ◽

Promising Tool ◽

Clustering Quality ◽

Generalized Mean ◽

Index Value

AbstractCluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used measure for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. This index uses two clustering criteria, compactness (average within-cluster distances) and separation (average between-cluster distances), which implies that spherical cluster shapes are preferred over others – a property that can be seen as a disadvantage in the presence 22 of clusters with high internal heterogeneity, which is common in real situations.We suggest a generalization of the silhouette width using the generalized mean. By changing the p parameter of the generalized mean between −∞ and +∞, several specific summary statistics, including the minimum, maximum, the arithmetic, harmonic, and geometric means, can be reproduced. Implementing the generalized mean in the calculation of silhouette width allows for changing the sensitivity of the index to compactness vs. connectedness. With higher sensitivity to connectedness instead of compactness the preference of silhouette width towards spherical clusters is expected to reduce. We test the performance of the generalized silhouette width on artificial data sets and on the Iris data set. We examine how classifications with different numbers of clusters prepared by single linkage, group average, and complete linkage algorithms are evaluated, if p is set to different values.When p was negative, well separated clusters achieved high silhouette widths despite their elongated or circular shapes. Positive values of p increased the importance of compactness, hence the preference towards spherical clusters became even more detectable. With low p, single linkage clustering was deemed the most efficient clustering method, while with higher parameter values the performance of group average and complete linkage seemed better.The generalized silhouette width is a promising tool for assessing clustering quality. It allows for adjusting the contribution of compactness and connectedness criteria to the index value, thus avoiding underestimation of clustering efficiency in the presence of clusters with high internal heterogeneity.

Download Full-text

Comparison of Single Linkage, Complete Linkage, and Average Linkage Methods on Community Welfare Analysis in Cities and Regencies in East Java

Jurnal Matematika Statistika dan Komputasi ◽

10.20956/j.v18i1.14228 ◽

2021 ◽

Vol 18 (1) ◽

pp. 130-140

Author(s):

Yanuwar Reinaldi ◽

Nurissaidah Ulinnuha ◽

Moh. Hafiyusholeh

Keyword(s):

Hierarchical Clustering ◽

National Development ◽

Clustering Methods ◽

Single Linkage ◽

Complete Linkage ◽

Average Linkage ◽

Linkage Methods ◽

Silhouette Index ◽

Linkage Method ◽

Index Value

Community welfare is one of the important points for a region and is also the essence of national development. The welfare of the people in Indonesia is fairly unequal, especially in East Java. To be able to map an area to the welfare of its people in East Java, one way that can be used is to use clustering. The hierarchical clustering method is one of the clustering methods for grouping data. In hierarchical clustering, single linkage, complete linkage, and average linkage methods are suitable methods for grouping data, which will compare the best method to use. The results of the calculation show that the average linkage method with three clusters is the best calculation with a silhouette index value of 0.6054, with the 1st cluster there are 23 regions, namely the city/district with the highest community welfare, the 2nd cluster there are 11 regions, namely cities/districts with moderate social welfare, and in the third cluster there are 4 regions, namely cities/districts with the lowest community welfare.

Download Full-text

Machine Learning Verdict of EEG Signals in Brain Computer Interface

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1838114 ◽

2018 ◽

pp. 429-441

Author(s):

M. Jeyanthi ◽

C. Velayutham

Keyword(s):

Nearest Neighbor ◽

Technology Development ◽

Vital Role ◽

Svm Classifier ◽

K Nearest Neighbor ◽

Data Mining Technique ◽

Data Set ◽

Eeg Data ◽

Irrelevant Attributes

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.

Download Full-text

Comparison of Spectroscopic Techniques Combined with Chemometrics for Cocaine Powder Analysis

Journal of Analytical Toxicology ◽

10.1093/jat/bkaa101 ◽

2020 ◽

Vol 44 (8) ◽

pp. 851-860

Author(s):

Joy Eliaerts ◽

Natalie Meert ◽

Pierre Dardenne ◽

Vincent Baeten ◽

Juan-Antonio Fernandez Pierna ◽

...

Keyword(s):

Gas Chromatography ◽

Near Infrared ◽

Evaluation Criteria ◽

Classification Model ◽

Support Vector ◽

Spectroscopic Techniques ◽

Data Set ◽

Promising Tool ◽

Powder Analysis ◽

Mir Spectra

Abstract Spectroscopic techniques combined with chemometrics are a promising tool for analysis of seized drug powders. In this study, the performance of three spectroscopic techniques [Mid-InfraRed (MIR), Raman and Near-InfraRed (NIR)] was compared. In total, 364 seized powders were analyzed and consisted of 276 cocaine powders (with concentrations ranging from 4 to 99 w%) and 88 powders without cocaine. A classification model (using Support Vector Machines [SVM] discriminant analysis) and a quantification model (using SVM regression) were constructed with each spectral dataset in order to discriminate cocaine powders from other powders and quantify cocaine in powders classified as cocaine positive. The performances of the models were compared with gas chromatography coupled with mass spectrometry (GC–MS) and gas chromatography with flame-ionization detection (GC–FID). Different evaluation criteria were used: number of false negatives (FNs), number of false positives (FPs), accuracy, root mean square error of cross-validation (RMSECV) and determination coefficients (R2). Ten colored powders were excluded from the classification data set due to fluorescence background observed in Raman spectra. For the classification, the best accuracy (99.7%) was obtained with MIR spectra. With Raman and NIR spectra, the accuracy was 99.5% and 98.9%, respectively. For the quantification, the best results were obtained with NIR spectra. The cocaine content was determined with a RMSECV of 3.79% and a R2 of 0.97. The performance of MIR and Raman to predict cocaine concentrations was lower than NIR, with RMSECV of 6.76% and 6.79%, respectively and both with a R2 of 0.90. The three spectroscopic techniques can be applied for both classification and quantification of cocaine, but some differences in performance were detected. The best classification was obtained with MIR spectra. For quantification, however, the RMSECV of MIR and Raman was twice as high in comparison with NIR. Spectroscopic techniques combined with chemometrics can reduce the workload for confirmation analysis (e.g., chromatography based) and therefore save time and resources.

Download Full-text

Machine Learning Approach for Fatigue Estimation in Sit-to-Stand Exercise

Sensors ◽

10.3390/s21155006 ◽

2021 ◽

Vol 21 (15) ◽

pp. 5006

Author(s):

Andrés Aguirre ◽

Maria J. Pinto ◽

Carlos A. Cifuentes ◽

Oscar Perdomo ◽

Camilo A. R. Díaz ◽

...

Keyword(s):

Heart Rate ◽

Body Part ◽

Physical Rehabilitation ◽

Upper Body ◽

Moderate Intensity ◽

Data Set ◽

Essential Information ◽

Promising Tool ◽

Sit To Stand ◽

Fatigue Estimation

Physical exercise (PE) has become an essential tool for different rehabilitation programs. High-intensity exercises (HIEs) have been demonstrated to provide better results in general health conditions, compared with low and moderate-intensity exercises. In this context, monitoring of a patients’ condition is essential to avoid extreme fatigue conditions, which may cause physical and physiological complications. Different methods have been proposed for fatigue estimation, such as: monitoring the subject’s physiological parameters and subjective scales. However, there is still a need for practical procedures that provide an objective estimation, especially for HIEs. In this work, considering that the sit-to-stand (STS) exercise is one of the most implemented in physical rehabilitation, a computational model for estimating fatigue during this exercise is proposed. A study with 60 healthy volunteers was carried out to obtain a data set to develop and evaluate the proposed model. According to the literature, this model estimates three fatigue conditions (low, moderate, and high) by monitoring 32 STS kinematic features and the heart rate from a set of ambulatory sensors (Kinect and Zephyr sensors). Results show that a random forest model composed of 60 sub-classifiers presented an accuracy of 82.5% in the classification task. Moreover, results suggest that the movement of the upper body part is the most relevant feature for fatigue estimation. Movements of the lower body and the heart rate also contribute to essential information for identifying the fatigue condition. This work presents a promising tool for physical rehabilitation.

Download Full-text

945 Characterization of the immune landscape of malignant pleural effusion composition from patients with metastatic breast carcinoma: a pilot study

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2021-sitc2021.945 ◽

2021 ◽

Vol 9 (Suppl 3) ◽

pp. A993-A994

Author(s):

Caddie Dy Laberiano ◽

Edwin Parra ◽

Qiong Gan ◽

Heladio Ibarguen ◽

Shanyu Zang ◽

...

Keyword(s):

Breast Cancer ◽

Breast Carcinoma ◽

Ductal Carcinoma ◽

Cytotoxic T Cells ◽

Metastatic Breast ◽

Absolute Number ◽

List Type ◽

Clinicopathologic Features ◽

Metastatic Breast Carcinoma ◽

Promising Tool

BackgroundBreast cancer(BC) is the second most common cause after lung cancer of malignant pleural effusions(MPEs),in approximately one third of all MPEs.Although,MPEs are relativity easy to be collated are still not well characterized in their cellular compositions. This opens new avenues to characterize the cellular milieu comprising the MPE, as it has the potential to be highly informative about mutational markers and immune response –ultimately guiding targeted therapy and predicting therapeutic outcomes with their study. The proposed study will characterize immune landscape of the cellular composition of MPE from patients with metastatic breast carcinoma and characterize their relationship with clinicopathologic features in these patients.Abstract 945 Figure 1Comparison between the cell block in H-E and mIF expression CK, CD68 and CD3Abstract 945 Figure 2Composite image in mIF expressing 8 markers. In higher magnification is possible to observe the co expression of CK+Ki67+, CK PDL1, CD3+Foxp3+ and CD3+CD8+Abstract 945 Table 1Results: cell phenotypes in percentage in the six cases analyzedAbstract 945 Table 2Clinical data of the six patients. L: left . R: right , BR : Breast cáncer, CRC: Colorrectal cáncer, NE: No evaluable , IDC : Invasive ductal carcinoma , CT: chemotherapy and BT : biotherapy* Last appointment of the patient.MethodsFive microns thickness paraffin cell pellet blocks from six cases randomly selected of breast carcinoma MPE were stained using a quantitative multiplex immunofluorescence(mIF) panel containing 8 markers against pancytokeratin(CK), PD-L1, PD-1, CD3, CD8, Foxp3, CD68, Ki67, and DAPI (figure 1). Representative regions of interest were scanned using a multispectral scanner (Vectra Polaris) in high magnification (20x) to capture different cell populations. Markers co-expression were processed and analyzed using a quantitative image analysis software (InForm). The final results were obtained as absolute number of cells from each phenotype and were characterized with clinicopathologic features.ResultsWe analyzed and stained six breast cancer MPE cases with previously optimized and validated mIF panel for formalin fixed and paraffin embedded (FFPE) tumor tissues against CK, CD3, CD68, CD8, Foxp3, Ki67, PD1 and PD-L1 (figure 2). The median cellular density was 5870.53 cells. Median for each marker: CK+ was presented in 75.9% (between malignant cells and reactive mesothelial cells) in these cells the expression of Ki67 was 8% and PD-L1+ was present in 0.2%.CD3+ was 0.72% and being the cytotoxic T-cells CD3+CD8+ was 12.13% of these cells and it expression for CD3+PD1+ was in 1.14% without concomitant expression for PD-L1. The median of the macrophages CD68+ was 8.1% of the total cells (table 2).ConclusionsmIF is a promising tool to study diverse corporal effusion from different origin. Although more studies are needed, this new perspective can help us to resolve some clues and possible prognosis in advanced stages of BC.ReferenceNicholas D T, Matthew A. S. Diagnosis and Management of Pleural Metastases and Malignant Effusion in Breast Cancer.En: Kirby I B, Edward M C, V. Suzanne K, William J. G. The Breast (Fifth Edition): Elsevier; 2018. P 934.

Download Full-text

Perbandingan Metode Single Linkage, Complete Linkage Dan Average Linkage dalam Pengelompokan Kecamatan Berdasarkan Variabel Jenis Ternak Kabupaten Sidoarjo

Jurnal INFORM ◽

10.25139/inform.v4i2.1696 ◽

2019 ◽

Vol 4 (2) ◽

Author(s):

Sulthan Fikri Mu'afa ◽

Nurissaidah Ulinnuha

Keyword(s):

Farm Animals ◽

Single Linkage ◽

Daily Lives ◽

Food Ingredients ◽

Complete Linkage ◽

Material Sources ◽

Average Linkage ◽

Linkage Method ◽

Livestock Products ◽

Labor Resources

Livestock products are widely used by the community in their daily lives, for example as food ingredients, industrial material sources, labor resources, fertilizer sources and energy sources. This study aims to cluster livestock potential with data on livestock population in Sidoarjo Regency in 2017 with single linkage, complete linkage and average linkage method and comparing performance of the methods. In this cluster, the data will be grouped into 3 clusters. The results of the three clusters were obtained by sixteen sub-districts in the first cluster with the potential for low livestock and each one in the second and third clusters for single linkage and average linkage. While complete linkage obtained fifteen sub-districts in the first cluster with high potential for livestock, two sub-districts in the second cluster with the potential of medium livestock and one sub-district in the third cluster with the potential for high farm animals. In the comparison of the standard deviation ratio value, the smallest value of 0.222 is obtained by complete linkage, which shows that complete linkage is better than single linkage and average linkage in the case of subgrouping based on Sidoarjo regency livestock types.

Download Full-text

The Evolution of Consumerism in the Marketing Education

Advances in Marketing, Customer Relationship Management, and E-Services - Handbook of Research on Consumerism in Business and Marketing ◽

10.4018/978-1-4666-5880-6.ch003 ◽

2014 ◽

pp. 45-77

Author(s):

George S. Spais

Keyword(s):

Critical Reflection ◽

Regression Analyses ◽

Single Linkage ◽

Data Set ◽

Universities And Colleges ◽

Business Courses ◽

Marketing Education ◽

Key Concepts ◽

Primary Key ◽

Business And Management

The chapter examines how consumerism- one of the primary key themes in marketing and business courses- has evolved the last decade and envisages the shape of these set of courses in the future. From the 1,935 words for 20 key-concepts counted in 141 online course descriptions in English of the last 10 periods delivered by Business and Management Schools or Business/Marketing Academic Depts. of 88 Universities and Colleges, “Marketing,” “business,” “ethics” and “social responsibility” were included in 100% of the course descriptions analyzed, indicating their coverage by all courses. In order to investigate the five (5) research objectives, HCA was adopted for an exploratory analysis based on single-linkage clustering method to reveal natural groupings of the key concepts within a data set of word counts that were not apparent and then multiple linear regression analyses were conducted. The trend analyses indicated prospects for the increasing focus around specific topics. The interpretation of the research results based on the assumptions of Mezirow's critical reflection provided very strong recommendations.

Download Full-text

Predicting Students Grades Using Artificial Neural Networks and Support Vector Machine

Advanced Methodologies and Technologies in Modern Education Delivery - Advances in Educational Technologies and Instructional Design ◽

10.4018/978-1-5225-7365-4.ch059 ◽

2019 ◽

pp. 751-766

Author(s):

Sajid Umair ◽

Muhammad Majid Sharif

Keyword(s):

Neural Networks ◽

Support Vector Machine ◽

Artificial Neural Networks ◽

Student Performance ◽

Semantic Analysis ◽

Vital Role ◽

Support Vector ◽

Data Set ◽

Important Research Topic ◽

Artificial Neural

Prediction of student performance on the basis of habits has been a very important research topic in academics. Studies show that selection of the correct data set also plays a vital role in these predictions. In this chapter, the authors took data from different schools that contains student habits and their comments, analyzed it using latent semantic analysis to get semantics, and then used support vector machine to classify the data into two classes, important for prediction and not important. Finally, they used artificial neural networks to predict the grades of students. Regression was also used to predict data coming from support vector machine, while giving only the important data for prediction.

Download Full-text

DNA barcoding a nightmare taxon: assessing barcode index numbers and barcode gaps for sweat bees

Genome ◽

10.1139/gen-2017-0096 ◽

2018 ◽

Vol 61 (1) ◽

pp. 21-31 ◽

Cited By ~ 15

Author(s):

Jason Gibbs

Keyword(s):

Dna Barcoding ◽

Dna Barcode ◽

Error Rates ◽

Nucleotide Substitutions ◽

Single Linkage ◽

Data Set ◽

Index Numbers ◽

Method Error ◽

Specimen Identification ◽

Barcode Gap

There is an ongoing campaign to DNA barcode the world’s >20 000 bee species. Recent revisions of Lasioglossum (Dialictus) (Hymenoptera: Halictidae) for Canada and the eastern United States were completed using integrative taxonomy. DNA barcode data from 110 species of L. (Dialictus) are examined for their value in identification and discovering additional taxonomic diversity. Specimen identification success was estimated using the best close match method. Error rates were 20% relative to current taxonomic understanding. Barcode Index Numbers (BINs) assigned using Refined Single Linkage Analysis (RESL) and barcode gaps using the Automatic Barcode Gap Discovery (ABGD) method were also assessed. RESL was incongruent for 44.5% of species, although some cryptic diversity may exist. Forty-three of 110 species were part of merged BINs with multiple species. The barcode gap is non-existent for the data set as a whole and ABGD showed levels of discordance similar to the RESL. The viridatum species-group is particularly problematic, so that DNA barcodes alone would be misleading for species delimitation and specimen identification. Character-based methods using fixed nucleotide substitutions could improve specimen identification success in some cases. The use of DNA barcoding for species discovery for standard taxonomic practice in the absence of a well-defined barcode gap is discussed.

Download Full-text

Comparative Study of Single Linkage, Complete Linkage, and Ward Method of Agglomerative Clustering

2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon) ◽

10.1109/comitcon.2019.8862232 ◽

2019 ◽

Cited By ~ 1

Author(s):

Vijaya ◽

Shweta Sharma ◽

Neha Batra

Keyword(s):

Comparative Study ◽

Agglomerative Clustering ◽

Single Linkage ◽

Complete Linkage ◽

Ward Method

Download Full-text