scholarly journals Theoretical properties of nearest-neighbor distance distributions and novel metrics for high dimensional bioinformatics data

2019 ◽  
Author(s):  
Bryan A. Dawkins ◽  
Trang T. Le ◽  
Brett A. McKinney

AbstractThe performance of nearest-neighbor feature selection and prediction methods depends on the metric for computing neighborhoods and the distribution properties of the underlying data. The effects of the distribution and metric, as well as the presence of correlation and interactions, are reflected in the expected moments of the distribution of pairwise distances. We derive general analytical expressions for the mean and variance of pairwise distances for Lq metrics for normal and uniform random data with p attributes and m instances. We use extreme value theory to derive results for metrics that are normalized by the range of each attribute (max – min). In addition to these expressions for continuous data, we derive similar analytical formulas for a new metric for genetic variants (categorical data) in genome-wide association studies (GWAS). The genetic distance distributions account for minor allele frequency and transition/transversion ratio. We introduce a new metric for resting-state functional MRI data (rs-fMRI) and derive its distance properties. This metric is applicable to correlation-based predictors derived from time series data. Derivations assume independent data, but empirically we also consider the effect of correlation. These analytical results and new metrics can be used to inform the optimization of nearest neighbor methods for a broad range of studies including gene expression, GWAS, and fMRI data. The summary of distribution moments and detailed derivations provide a resource for understanding the distance properties for various metrics and data types.

PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0246761
Author(s):  
Bryan A. Dawkins ◽  
Trang T. Le ◽  
Brett A. McKinney

The performance of nearest-neighbor feature selection and prediction methods depends on the metric for computing neighborhoods and the distribution properties of the underlying data. Recent work to improve nearest-neighbor feature selection algorithms has focused on new neighborhood estimation methods and distance metrics. However, little attention has been given to the distributional properties of pairwise distances as a function of the metric or data type. Thus, we derive general analytical expressions for the mean and variance of pairwise distances for Lq metrics for normal and uniform random data with p attributes and m instances. The distribution moment formulas and detailed derivations provide a resource for understanding the distance properties for metrics and data types commonly used with nearest-neighbor methods, and the derivations provide the starting point for the following novel results. We use extreme value theory to derive the mean and variance for metrics that are normalized by the range of each attribute (difference of max and min). We derive analytical formulas for a new metric for genetic variants, which are categorical variables that occur in genome-wide association studies (GWAS). The genetic distance distributions account for minor allele frequency and the transition/transversion ratio. We introduce a new metric for resting-state functional MRI data (rs-fMRI) and derive its distance distribution properties. This metric is applicable to correlation-based predictors derived from time-series data. The analytical means and variances are in strong agreement with simulation results. We also use simulations to explore the sensitivity of the expected means and variances in the presence of correlation and interactions in the data. These analytical results and new metrics can be used to inform the optimization of nearest neighbor methods for a broad range of studies, including gene expression, GWAS, and fMRI data.


2001 ◽  
Vol 27 (2) ◽  
pp. 272-274
Author(s):  
Yoshitomo Hanakuma ◽  
Junzou Yamamoto

2020 ◽  
Author(s):  
Mike A. Nalls ◽  
Cornelis Blauwendraat ◽  
Lana Sargent ◽  
Dan Vitale ◽  
Hampton Leonard ◽  
...  

SUMMARYBackgroundPrevious research using genome wide association studies (GWAS) has identified variants that may contribute to lifetime risk of multiple neurodegenerative diseases. However, whether there are common mechanisms that link neurodegenerative diseases is uncertain. Here, we focus on one gene, GRN, encoding progranulin, and the potential mechanistic interplay between genetic risk, gene expression in the brain and inflammation across multiple common neurodegenerative diseases.MethodsWe utilized GWAS, expression quantitative trait locus (eQTL) mapping and Bayesian colocalization analyses to evaluate potential causal and mechanistic inferences. We integrate various molecular data types from public resources to infer disease connectivity and shared mechanisms using a data driven process.FindingseQTL analyses combined with GWAS identified significant functional associations between increasing genetic risk in the GRN region and decreased expression of the gene in Parkinson’s, Alzheimer’s and amyotrophic lateral sclerosis. Additionally, colocalization analyses show a connection between blood based inflammatory biomarkers relating to platelets and GRN expression in the frontal cortex.InterpretationGRN expression mediates neuroinflammation function related to general neurodegeneration. This analysis suggests shared mechanisms for Parkinson’s, Alzheimer’s and amyotrophic lateral sclerosis.FundingNational Institute on Aging, National Institute of Neurological Disorders and Stroke, and the Michael J. Fox Foundation.


2021 ◽  
Author(s):  
Valentin Buck ◽  
Flemming Stäbler ◽  
Everardo Gonzalez ◽  
Jens Greinert

<p>The study of the earth’s systems depends on a large amount of observations from homogeneous sources, which are usually scattered around time and space and are tightly intercorrelated to each other. The understanding of said systems depends on the ability to access diverse data types and contextualize them in a global setting suitable for their exploration. While the collection of environmental data has seen an enormous increase over the last couple of decades, the development of software solutions necessary to integrate observations across disciplines seems to be lagging behind. To deal with this issue, we developed the Digital Earth Viewer: a new program to access, combine, and display geospatial data from multiple sources over time.</p><p>Choosing a new approach, the software displays space in true 3D and treats time and time ranges as true dimensions. This allows users to navigate observations across spatio-temporal scales and combine data sources with each other as well as with meta-properties such as quality flags. In this way, the Digital Earth Viewer supports the generation of insight from data and the identification of observational gaps across compartments.</p><p>Developed as a hybrid application, it may be used both in-situ as a local installation to explore and contextualize new data, as well as in a hosted context to present curated data to a wider audience.</p><p>In this work, we present this software to the community, show its strengths and weaknesses, give insight into the development process and talk about extending and adapting the software to custom usecases.</p>


2011 ◽  
Vol 10 (3) ◽  
pp. 162-181 ◽  
Author(s):  
Chris North ◽  
Purvi Saraiya ◽  
Karen Duca

This study compares two different empirical research methods for evaluating information visualizations: the traditional benchmark-task method and the insight method. The methods are compared using criteria such as the conclusions about the visualization designs provided by each method, the time participants spent during the study, the time and effort required to analyse the resulting empirical data, and the effect of individual differences between participants on the results. The study compares three graph visualization alternatives that associate bioinformatics microarray time series data to pathway graph vertices in order to investigate the effect of different visual grouping structures in visualization designs that integrate multiple data types. It is confirmed that visual grouping should match task structure, but interactive grouping proves to be a well-rounded alternative. Overall, the results validate the insight method’s ability to confirm results of the task method, but also show advantages of the insight method to illuminate additional types of tasks. Efficiency and insight frequently correlate, but important distinctions are found. Categories: H.5.2 [Information Interfaces and Presentation]: User Interfaces – evaluation/methodology.


INFORMASI ◽  
2022 ◽  
Vol 51 (2) ◽  
pp. 195-226
Author(s):  
Panqiang Niu ◽  
Anang Masduki ◽  
Xigen Li ◽  
Filosa Gita Sukmono

This paper constructs the model of network economics to study the effect of different levels of network convergence on the digital culture industry. Then uses regression models and mediating effect models to test the effect mechanism of network convergence on the digital culture industry of China.  This paper used panel data to conduct an empirical study. The data in this paper were quarterly. The time range was from the first quarter of 2009 to the third quarter of 2013 for 19 quarters.The three data types in econometrics are time series data, cross-sectional data, and panel data.The main conclusions are as follows. Network convergence brings positive policy effects and adverse capital effects. The impact of network convergence on firm performance of the digital culture industry is not statistically significant, and this effect also has no indirect effects on the test of mediating effect. However, network convergence indirectly leads to the reduction of operating costs of the digital culture industry. The indirect effect is brought by the chain mediating effect of policy effect and capital effect. The study could provide a reference for other countries and regions. Meanwhile, it can be used to analyze the impact of different media convergence on digital industries.


2021 ◽  
Vol 10 (3) ◽  
pp. 159-167
Author(s):  
Neli Aida ◽  
Ukhti Ciptawaty ◽  
Toto Gunarto ◽  
Syarifah Aini

This study will discuss the influence of the influx of foreign investment and Chinese foreign workers on the Indonesian economy, where cooperation between the two countries uses a turnkey project scheme. This study uses secondary data with time-series data types and is sourced from the Central Statistics Agency, the Investment Coordinating Board, and the Ministry of Manpower for the 2010-2019 period. The method used in this research is quantitative and statistical descriptive using multiple linear regression or OLS (Ordinary Least Square). The study results show a positive influence of Chinese foreign investment on the Indonesian economy and Chinese foreign workers who positively impact the Indonesian economy. Although both are below 1 percent, the percentage of Chinese foreign workers' influence on the Indonesian economy is greater than that of Chinese foreign investment.


2017 ◽  
Author(s):  
Lina Bouayad ◽  
Anna Ialynytchev ◽  
Balaji Padmanabhan

BACKGROUND A new generation of user-centric information systems is emerging in health care as patient health record (PHR) systems. These systems create a platform supporting the new vision of health services that empowers patients and enables patient-provider communication, with the goal of improving health outcomes and reducing costs. This evolution has generated new sets of data and capabilities, providing opportunities and challenges at the user, system, and industry levels. OBJECTIVE The objective of our study was to assess PHR data types and functionalities through a review of the literature to inform the health care informatics community, and to provide recommendations for PHR design, research, and practice. METHODS We conducted a review of the literature to assess PHR data types and functionalities. We searched PubMed, Embase, and MEDLINE databases from 1966 to 2015 for studies of PHRs, resulting in 1822 articles, from which we selected a total of 106 articles for a detailed review of PHR data content. RESULTS We present several key findings related to the scope and functionalities in PHR systems. We also present a functional taxonomy and chronological analysis of PHR data types and functionalities, to improve understanding and provide insights for future directions. Functional taxonomy analysis of the extracted data revealed the presence of new PHR data sources such as tracking devices and data types such as time-series data. Chronological data analysis showed an evolution of PHR system functionalities over time, from simple data access to data modification and, more recently, automated assessment, prediction, and recommendation. CONCLUSIONS Efforts are needed to improve (1) PHR data quality through patient-centered user interface design and standardized patient-generated data guidelines, (2) data integrity through consolidation of various types and sources, (3) PHR functionality through application of new data analytics methods, and (4) metrics to evaluate clinical outcomes associated with automated PHR system use, and costs associated with PHR data storage and analytics.


2015 ◽  
Vol 2 (4) ◽  
pp. 1301-1315
Author(s):  
E. Lynch ◽  
D. Kaufman ◽  
A. S. Sharma ◽  
E. Kalnay ◽  
K. Ide

Abstract. Bred vectors characterize the nonlinear instability of dynamical systems and so far have been computed only for systems with known evolution equations. In this article, bred vectors are computed from a single time series data using time-delay embedding, with a new technique, nearest-neighbor breeding. Since the dynamical properties of the standard and nearest-neighbor breeding are shown to be similar, this provides a new and novel way to model and predict sudden transitions in systems represented by time series data alone.


Sign in / Sign up

Export Citation Format

Share Document