scholarly journals Generalization of the minimum covariance determinant algorithm for categorical and mixed data types

2018 ◽  
Author(s):  
Derek Beaton ◽  
Kelly M. Sunderland ◽  
Brian Levine ◽  
Jennifer Mandzia ◽  
Mario Masellis ◽  
...  

AbstractThe minimum covariance determinant (MCD) algorithm is one of the most common techniques to detect anomalous or outlying observations. The MCD algorithm depends on two features of multivariate data: the determinant of a matrix (i.e., geometric mean of the eigenvalues) and Mahalanobis distances (MD). While the MCD algorithm is commonly used, and has many extensions, the MCD is limited to analyses of quantitative data and more specifically data assumed to be continuous. One reason why the MCD does not extend to other data types such as categorical or ordinal data is because there is not a well-defined MD for data types other than continuous data. To address the lack of MCD-like techniques for categorical or mixed data we present a generalization of the MCD. To do so, we rely on a multivariate technique called correspondence analysis (CA). Through CA we can define MD via singular vectors and also compute the determinant from CA’s eigenvalues. Here we define and illustrate a generalized MCD on categorical data and then show how our generalized MCD extends beyond categorical data to accommodate mixed data types (e.g., categorical, ordinal, and continuous). We illustrate this generalized MCD on data from two large scale projects: the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI), with genetics (categorical), clinical instruments and surveys (categorical or ordinal), and neuroimaging (continuous) data. We also make R code and toy data available in order to illustrate our generalized MCD.

2021 ◽  
Author(s):  
Petros Barmpas ◽  
Sotiris Tasoulis ◽  
Aristidis G. Vrahatis ◽  
Panagiotis Anagnostou ◽  
Spiros Georgakopoulos ◽  
...  

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.


Author(s):  
Zhanyou Xu ◽  
Steven B. Cannon ◽  
William D. Beavis

AbstractModels have been developed to account for heterogeneous spatial variation in field trials. These spatial models have been shown to successfully increase the quality of phenotypic data resulting in improved effectiveness of selection by plant breeders. The models were developed for continuous data types such as grain yield and plant height, but data for most traits, such as in iron deficiency chlorosis (IDC), are recorded on ordinal scales. Is it reasonable to make spatial adjustments to ordinal data by simply applying methods developed for continuous data? The objective of the research described herein is to evaluate methods for spatial adjustment on ordinal data, using soybean IDC as an example. Spatial adjustment models are classified into three different groups: group I, moving average grid adjustment; group II, geospatial autoregressive regression (SAR) models; and group III, tensor product penalized P-splines. Comparisons of eight models sampled from these three classes demonstrate that spatial adjustments depend on severity of field heterogeneity, the irregularity of the spatial patterns, and the model used. SAR models generally produce better performance metrics than other classes of models. However, none of the eight evaluated models fully removed spatial patterns indicating that there is a need to either adjust existing models or develop novel models for spatial adjustments of ordinal data collected in fields exhibiting discontinuous transitions between heterogeneous patches.


Author(s):  
Jianchao Han ◽  
◽  
Mohsen Beheshti

Clustering is a technique to group a set of unsupervised data based on the conceptual clustering principle: maximizing the intraclass similarity and minimizing the interclass similarity. Existing clustering approaches concentrate in the different data types and assume that all features play the same role in algorithm validations. However, some features may be more significant than others in forming clusters. In this paper, we consider the feature significance and include it in the clustering algorithms. An iterative approach for fuzzy clustering based on the feature significance is presented and applied in the k-means algorithm for numerical data, the k-modes algorithm for categorical data, and the k-prototypes algorithm for mixed data.


2020 ◽  
Author(s):  
Zhanyou Xu ◽  
Andreomar Kurek ◽  
Steven B. Cannon ◽  
Williams D. Beavis

AbstractSelection of markers linked to alleles at quantitative trait loci (QTL) for tolerance to Iron Deficiency Chlorosis (IDC) has not been successful. Genomic selection has been advocated for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, genomic prediction models have not been systematically compared. The objectives of research reported in this manuscript were to evaluate the most commonly used genomic prediction method, ridge regression and it’s equivalent logistic ridge regression method, with algorithmic modeling methods including random forest, gradient boosting, support vector machine, K-nearest neighbors, Naïve Bayes, and artificial neural network using the usual comparator metric of prediction accuracy. In addition we compared the methods using metrics of greater importance for decisions about selecting and culling lines for use in variety development and genetic improvement projects. These metrics include specificity, sensitivity, precision, decision accuracy, and area under the receiver operating characteristic curve. We found that Support Vector Machine provided the best specificity for culling IDC susceptible lines, while Random Forest GP models provided the best combined set of decision metrics for retaining IDC tolerant and culling IDC susceptible lines.


Stroke ◽  
2021 ◽  
Author(s):  
Shenpeng R. Zhang ◽  
Hyun Ah Kim ◽  
Hannah X. Chu ◽  
Seyoung Lee ◽  
Megan A. Evans ◽  
...  

Background and Purpose: Preclinical stroke studies endeavor to model the pathophysiology of clinical stroke, assessing a range of parameters of injury and impairment. However, poststroke pathology is complex and variable, and associations between diverse parameters may be difficult to identify within the usual small study designs that focus on infarct size. Methods: We have performed a retrospective large-scale big data analysis of records from 631 C57BL/6 mice of either sex in which the middle cerebral artery was occluded by 1 of 5 surgeons either transiently for 1 hour followed by 23-hour reperfusion (transient middle cerebral artery occlusion [MCAO]; n=435) or permanently for 24 hours without reperfusion (permanent MCAO; n=196). Analyses included a multivariate linear mixed model with random intercept for different surgeons as a random effect to reduce type I and type II errors and a generalized ordinal regression model for ordinal data when random effects are low. Results: Analyses indicated that brain edema volume was associated with infarct volume at 24 hours (β, 0.52 [95% CI, 0.45–0.59]) and was higher after permanent MCAO than after transient MCAO ( P <0.05). A more severe clinical score was associated with a greater infarct volume but not with the animal’s age or edema volume. Further, a more severe clinical score was observed for a given brain infarct volume after transient MCAO versus permanent MCAO. Remarkably the animal’s age, which corresponded with the period of young adulthood (6–40 weeks; equivalent to ≈18–35 years in humans), was positively associated with severity of lung infection (β, 0.65 [95% CI, 0.42–0.88]) and negatively with spleen weight (β, −0.36 [95% CI, −0.63 to −0.09]). Conclusions: Large-scale analysis of preclinical stroke data can provide researchers in our field with insight into relationships between variables not possible if individual studies are analyzed in isolation and has identified hypotheses for future study.


Author(s):  
Éric Piel ◽  
Alberto González ◽  
Hans-Gerhard Gross

Publish/subscribe systems are event-based systems separated into several components which publish and subscribe events that correspond to data types. Testing each component individually is not sufficient for testing the whole system; it also requires testing the integration of those components together. In this chapter, first we identify the specificities and difficulties of integration testing of publish/subscribe systems. Afterwards, two different and complementary techniques to test the integration are presented. One is based on the random generation of a high number of event sequences and on generic oracles, in order to find a malfunctioning state of the system. The second one uses a limited number of predefined data-flows which must respect a precise behaviour, implementable with the same mechanism as unit-testing. As event-based systems are well fitted for runtime modification, the particularities of runtime testing are also introduced, and the usage in the context of integration testing is detailed. A case study presents an example of integration testing on a small system inspired by the systems used in the maritime safety and security domain.


2017 ◽  
pp. 83-99
Author(s):  
Sivamathi Chokkalingam ◽  
Vijayarani S.

The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. Big Data is differentiated from traditional technologies in three ways: volume, velocity and variety of data. Big data analytics is the process of analyzing large data sets which contains a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Since Big Data is new emerging field, there is a need for development of new technologies and algorithms for handling big data. The main objective of this paper is to provide knowledge about various research challenges of Big Data analytics. A brief overview of various types of Big Data analytics is discussed in this paper. For each analytics, the paper describes process steps and tools. A banking application is given for each analytics. Some of research challenges and possible solutions for those challenges of big data analytics are also discussed.


Sign in / Sign up

Export Citation Format

Share Document