scholarly journals Silhouette width using generalized mean – a flexible method for assessing clustering efficiency

2018 ◽  
Author(s):  
Attila Lengyel ◽  
Zoltán Botta-Dukát

AbstractCluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used measure for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. This index uses two clustering criteria, compactness (average within-cluster distances) and separation (average between-cluster distances), which implies that spherical cluster shapes are preferred over others – a property that can be seen as a disadvantage in the presence 22 of clusters with high internal heterogeneity, which is common in real situations.We suggest a generalization of the silhouette width using the generalized mean. By changing the p parameter of the generalized mean between −∞ and +∞, several specific summary statistics, including the minimum, maximum, the arithmetic, harmonic, and geometric means, can be reproduced. Implementing the generalized mean in the calculation of silhouette width allows for changing the sensitivity of the index to compactness vs. connectedness. With higher sensitivity to connectedness instead of compactness the preference of silhouette width towards spherical clusters is expected to reduce. We test the performance of the generalized silhouette width on artificial data sets and on the Iris data set. We examine how classifications with different numbers of clusters prepared by single linkage, group average, and complete linkage algorithms are evaluated, if p is set to different values.When p was negative, well separated clusters achieved high silhouette widths despite their elongated or circular shapes. Positive values of p increased the importance of compactness, hence the preference towards spherical clusters became even more detectable. With low p, single linkage clustering was deemed the most efficient clustering method, while with higher parameter values the performance of group average and complete linkage seemed better.The generalized silhouette width is a promising tool for assessing clustering quality. It allows for adjusting the contribution of compactness and connectedness criteria to the index value, thus avoiding underestimation of clustering efficiency in the presence of clusters with high internal heterogeneity.

2021 ◽  
Vol 18 (1) ◽  
pp. 130-140
Author(s):  
Yanuwar Reinaldi ◽  
Nurissaidah Ulinnuha ◽  
Moh. Hafiyusholeh

Community welfare is one of the important points for a region and is also the essence of national development. The welfare of the people in Indonesia is fairly unequal, especially in East Java. To be able to map an area to the welfare of its people in East Java, one way that can be used is to use clustering. The hierarchical clustering method is one of the clustering methods for grouping data. In hierarchical clustering, single linkage, complete linkage, and average linkage methods are suitable methods for grouping data, which will compare the best method to use. The results of the calculation show that the average linkage method with three clusters is the best calculation with a silhouette index value of 0.6054, with the 1st cluster there are 23 regions, namely the city/district with the highest community welfare, the 2nd cluster there are 11 regions, namely cities/districts with moderate social welfare, and in the third cluster there are 4 regions, namely cities/districts with the lowest community welfare.


Author(s):  
M. Jeyanthi ◽  
C. Velayutham

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.


2020 ◽  
Vol 44 (8) ◽  
pp. 851-860
Author(s):  
Joy Eliaerts ◽  
Natalie Meert ◽  
Pierre Dardenne ◽  
Vincent Baeten ◽  
Juan-Antonio Fernandez Pierna ◽  
...  

Abstract Spectroscopic techniques combined with chemometrics are a promising tool for analysis of seized drug powders. In this study, the performance of three spectroscopic techniques [Mid-InfraRed (MIR), Raman and Near-InfraRed (NIR)] was compared. In total, 364 seized powders were analyzed and consisted of 276 cocaine powders (with concentrations ranging from 4 to 99 w%) and 88 powders without cocaine. A classification model (using Support Vector Machines [SVM] discriminant analysis) and a quantification model (using SVM regression) were constructed with each spectral dataset in order to discriminate cocaine powders from other powders and quantify cocaine in powders classified as cocaine positive. The performances of the models were compared with gas chromatography coupled with mass spectrometry (GC–MS) and gas chromatography with flame-ionization detection (GC–FID). Different evaluation criteria were used: number of false negatives (FNs), number of false positives (FPs), accuracy, root mean square error of cross-validation (RMSECV) and determination coefficients (R2). Ten colored powders were excluded from the classification data set due to fluorescence background observed in Raman spectra. For the classification, the best accuracy (99.7%) was obtained with MIR spectra. With Raman and NIR spectra, the accuracy was 99.5% and 98.9%, respectively. For the quantification, the best results were obtained with NIR spectra. The cocaine content was determined with a RMSECV of 3.79% and a R2 of 0.97. The performance of MIR and Raman to predict cocaine concentrations was lower than NIR, with RMSECV of 6.76% and 6.79%, respectively and both with a R2 of 0.90. The three spectroscopic techniques can be applied for both classification and quantification of cocaine, but some differences in performance were detected. The best classification was obtained with MIR spectra. For quantification, however, the RMSECV of MIR and Raman was twice as high in comparison with NIR. Spectroscopic techniques combined with chemometrics can reduce the workload for confirmation analysis (e.g., chromatography based) and therefore save time and resources.


Sensors ◽  
2021 ◽  
Vol 21 (15) ◽  
pp. 5006
Author(s):  
Andrés Aguirre ◽  
Maria J. Pinto ◽  
Carlos A. Cifuentes ◽  
Oscar Perdomo ◽  
Camilo A. R. Díaz ◽  
...  

Physical exercise (PE) has become an essential tool for different rehabilitation programs. High-intensity exercises (HIEs) have been demonstrated to provide better results in general health conditions, compared with low and moderate-intensity exercises. In this context, monitoring of a patients’ condition is essential to avoid extreme fatigue conditions, which may cause physical and physiological complications. Different methods have been proposed for fatigue estimation, such as: monitoring the subject’s physiological parameters and subjective scales. However, there is still a need for practical procedures that provide an objective estimation, especially for HIEs. In this work, considering that the sit-to-stand (STS) exercise is one of the most implemented in physical rehabilitation, a computational model for estimating fatigue during this exercise is proposed. A study with 60 healthy volunteers was carried out to obtain a data set to develop and evaluate the proposed model. According to the literature, this model estimates three fatigue conditions (low, moderate, and high) by monitoring 32 STS kinematic features and the heart rate from a set of ambulatory sensors (Kinect and Zephyr sensors). Results show that a random forest model composed of 60 sub-classifiers presented an accuracy of 82.5% in the classification task. Moreover, results suggest that the movement of the upper body part is the most relevant feature for fatigue estimation. Movements of the lower body and the heart rate also contribute to essential information for identifying the fatigue condition. This work presents a promising tool for physical rehabilitation.


2021 ◽  
Vol 9 (Suppl 3) ◽  
pp. A993-A994
Author(s):  
Caddie Dy Laberiano ◽  
Edwin Parra ◽  
Qiong Gan ◽  
Heladio Ibarguen ◽  
Shanyu Zang ◽  
...  

BackgroundBreast cancer(BC) is the second most common cause after lung cancer of malignant pleural effusions(MPEs),in approximately one third of all MPEs.Although,MPEs are relativity easy to be collated are still not well characterized in their cellular compositions. This opens new avenues to characterize the cellular milieu comprising the MPE, as it has the potential to be highly informative about mutational markers and immune response –ultimately guiding targeted therapy and predicting therapeutic outcomes with their study. The proposed study will characterize immune landscape of the cellular composition of MPE from patients with metastatic breast carcinoma and characterize their relationship with clinicopathologic features in these patients.Abstract 945 Figure 1Comparison between the cell block in H-E and mIF expression CK, CD68 and CD3Abstract 945 Figure 2Composite image in mIF expressing 8 markers. In higher magnification is possible to observe the co expression of CK+Ki67+, CK PDL1, CD3+Foxp3+ and CD3+CD8+Abstract 945 Table 1Results: cell phenotypes in percentage in the six cases analyzedAbstract 945 Table 2Clinical data of the six patients. L: left . R: right , BR : Breast cáncer, CRC: Colorrectal cáncer, NE: No evaluable , IDC : Invasive ductal carcinoma , CT: chemotherapy and BT : biotherapy* Last appointment of the patient.MethodsFive microns thickness paraffin cell pellet blocks from six cases randomly selected of breast carcinoma MPE were stained using a quantitative multiplex immunofluorescence(mIF) panel containing 8 markers against pancytokeratin(CK), PD-L1, PD-1, CD3, CD8, Foxp3, CD68, Ki67, and DAPI (figure 1). Representative regions of interest were scanned using a multispectral scanner (Vectra Polaris) in high magnification (20x) to capture different cell populations. Markers co-expression were processed and analyzed using a quantitative image analysis software (InForm). The final results were obtained as absolute number of cells from each phenotype and were characterized with clinicopathologic features.ResultsWe analyzed and stained six breast cancer MPE cases with previously optimized and validated mIF panel for formalin fixed and paraffin embedded (FFPE) tumor tissues against CK, CD3, CD68, CD8, Foxp3, Ki67, PD1 and PD-L1 (figure 2). The median cellular density was 5870.53 cells. Median for each marker: CK+ was presented in 75.9% (between malignant cells and reactive mesothelial cells) in these cells the expression of Ki67 was 8% and PD-L1+ was present in 0.2%.CD3+ was 0.72% and being the cytotoxic T-cells CD3+CD8+ was 12.13% of these cells and it expression for CD3+PD1+ was in 1.14% without concomitant expression for PD-L1. The median of the macrophages CD68+ was 8.1% of the total cells (table 2).ConclusionsmIF is a promising tool to study diverse corporal effusion from different origin. Although more studies are needed, this new perspective can help us to resolve some clues and possible prognosis in advanced stages of BC.ReferenceNicholas D T, Matthew A. S. Diagnosis and Management of Pleural Metastases and Malignant Effusion in Breast Cancer.En: Kirby I B, Edward M C, V. Suzanne K, William J. G. The Breast (Fifth Edition): Elsevier; 2018. P 934.


Jurnal INFORM ◽  
2019 ◽  
Vol 4 (2) ◽  
Author(s):  
Sulthan Fikri Mu'afa ◽  
Nurissaidah Ulinnuha

Livestock products are widely used by the community in their daily lives, for example as food ingredients, industrial material sources, labor resources, fertilizer sources and energy sources. This study aims to cluster livestock potential with data on livestock population in Sidoarjo Regency in 2017 with single linkage, complete linkage and average linkage method and comparing performance of the methods. In this cluster, the data will be grouped into 3 clusters. The results of the three clusters were obtained by sixteen sub-districts in the first cluster with the potential for low livestock and each one in the second and third clusters for single linkage and average linkage. While complete linkage obtained fifteen sub-districts in the first cluster with high potential for livestock, two sub-districts in the second cluster with the potential of medium livestock and one sub-district in the third cluster with the potential for high farm animals. In the comparison of the standard deviation ratio value, the smallest value of 0.222 is obtained by complete linkage, which shows that complete linkage is better than single linkage and average linkage in the case of subgrouping based on Sidoarjo regency livestock types.


Author(s):  
George S. Spais

The chapter examines how consumerism- one of the primary key themes in marketing and business courses- has evolved the last decade and envisages the shape of these set of courses in the future. From the 1,935 words for 20 key-concepts counted in 141 online course descriptions in English of the last 10 periods delivered by Business and Management Schools or Business/Marketing Academic Depts. of 88 Universities and Colleges, “Marketing,” “business,” “ethics” and “social responsibility” were included in 100% of the course descriptions analyzed, indicating their coverage by all courses. In order to investigate the five (5) research objectives, HCA was adopted for an exploratory analysis based on single-linkage clustering method to reveal natural groupings of the key concepts within a data set of word counts that were not apparent and then multiple linear regression analyses were conducted. The trend analyses indicated prospects for the increasing focus around specific topics. The interpretation of the research results based on the assumptions of Mezirow's critical reflection provided very strong recommendations.


Author(s):  
Sajid Umair ◽  
Muhammad Majid Sharif

Prediction of student performance on the basis of habits has been a very important research topic in academics. Studies show that selection of the correct data set also plays a vital role in these predictions. In this chapter, the authors took data from different schools that contains student habits and their comments, analyzed it using latent semantic analysis to get semantics, and then used support vector machine to classify the data into two classes, important for prediction and not important. Finally, they used artificial neural networks to predict the grades of students. Regression was also used to predict data coming from support vector machine, while giving only the important data for prediction.


Genome ◽  
2018 ◽  
Vol 61 (1) ◽  
pp. 21-31 ◽  
Author(s):  
Jason Gibbs

There is an ongoing campaign to DNA barcode the world’s >20 000 bee species. Recent revisions of Lasioglossum (Dialictus) (Hymenoptera: Halictidae) for Canada and the eastern United States were completed using integrative taxonomy. DNA barcode data from 110 species of L. (Dialictus) are examined for their value in identification and discovering additional taxonomic diversity. Specimen identification success was estimated using the best close match method. Error rates were 20% relative to current taxonomic understanding. Barcode Index Numbers (BINs) assigned using Refined Single Linkage Analysis (RESL) and barcode gaps using the Automatic Barcode Gap Discovery (ABGD) method were also assessed. RESL was incongruent for 44.5% of species, although some cryptic diversity may exist. Forty-three of 110 species were part of merged BINs with multiple species. The barcode gap is non-existent for the data set as a whole and ABGD showed levels of discordance similar to the RESL. The viridatum species-group is particularly problematic, so that DNA barcodes alone would be misleading for species delimitation and specimen identification. Character-based methods using fixed nucleotide substitutions could improve specimen identification success in some cases. The use of DNA barcoding for species discovery for standard taxonomic practice in the absence of a well-defined barcode gap is discussed.


Sign in / Sign up

Export Citation Format

Share Document