Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽  
2015 ◽  
pp. 76-107
2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Michele Allegra ◽  
Elena Facco ◽  
Francesco Denti ◽  
Alessandro Laio ◽  
Antonietta Mira

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.


Plant Disease ◽  
2002 ◽  
Vol 86 (12) ◽  
pp. 1396-1401 ◽  
Author(s):  
Weikai Yan ◽  
Duane E. Falk

Effective breeding for disease resistance relies on a thorough understanding of host-by-pathogen relations. Achieving such understanding can be difficult and challenging, particularly for large data sets with complex host genotype-by-pathogen strain interactions. This paper presents a biplot approach that facilitates visual analysis of host-by-pathogen data. A biplot displays both host genotypes and pathogen isolates in a single scatter plot; each genotype or isolate is displayed as a point defined by its scores on the first two principal components derived from subjecting genotype- or strain-centered data to singular value decomposition. From a biplot, clusters of host genotypes and clusters of pathogen strains can be simultaneously visualized. Moreover, the basis for genotype and strain classifications, i.e., interactions between individual genotypes and strains, can be visualized at the same time. A biplot based on genotype-centered data and that based on strain-centered data are appropriate for visual evaluation of susceptibility/resistance of genotypes and virulence/avirulence of strains, respectively. Biplot analysis of genotype-by-strain is illustrated with published response scores of 13 barley line groups to 8 net blotch isolate groups.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Yuri Amorim Coutinho ◽  
Nico Vervliet ◽  
Lieven De Lathauwer ◽  
Nele Moelans

AbstractMulticomponent alloys show intricate microstructure evolution, providing materials engineers with a nearly inexhaustible variety of solutions to enhance material properties. Multicomponent microstructure evolution simulations are indispensable to exploit these opportunities. These simulations, however, require the handling of high-dimensional and prohibitively large data sets of thermodynamic quantities, of which the size grows exponentially with the number of elements in the alloy, making it virtually impossible to handle the effects of four or more elements. In this paper, we introduce the use of tensor completion for high-dimensional data sets in materials science as a general and elegant solution to this problem. We show that we can obtain an accurate representation of the composition dependence of high-dimensional thermodynamic quantities, and that the decomposed tensor representation can be evaluated very efficiently in microstructure simulations. This realization enables true multicomponent thermodynamic and microstructure modeling for alloy design.


Sign in / Sign up

Export Citation Format

Share Document