DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data

Mapping Intimacies ◽

10.1101/831537 ◽

2019 ◽

Author(s):

Yi Jiang ◽

Gina Giase ◽

Kay Grennan ◽

Annie W. Shieh ◽

Yan Xia ◽

...

Keyword(s):

Statistical Power ◽

Genetic Relatedness ◽

Majority Vote ◽

Simulated Data ◽

Real Data ◽

Sorting Algorithm ◽

Omics Data ◽

Sample Collection ◽

Complex Disorders ◽

Integrative Analyses

AbstractStudies of complex disorders benefit from integrative analyses of multiple omics data. Yet, sample mix-ups frequently occur in multi-omics studies, weakening statistical power and risking false findings. Accurately aligning sample information, genotype, and corresponding omics data is critical for integrative analyses. We developed DRAMS (https://github.com/Yi-Jiang/DRAMS) to Detect and Re-Align Mixed-up Samples to address the sample mix-up problem. It uses a logistic regression model followed by a modified topological sorting algorithm to identify the potential true IDs based on data relationships of multi-omics. According to tests using simulated data, the more types of omics data used or the smaller the proportion of mix-ups, the better that DRAMS performs. Applying DRAMS to real data from the PsychENCODE BrainGVEX project, we detected and corrected 201 (12.5% of total data generated) mix-ups. Of the 21 mix-ups involving errors of racial identity, DRAMS re-assigned all samples to the correct racial cluster in the 1000 Genomes project. In doing so, quantitative trait loci (QTL) (FDR<0.01) increased by an average of 1.62-fold. The use of DRAMS in multi-omics studies will strengthen statistical power of the study and improve quality of the results. Even though very limited studies have multi-omics data in place, we expect such data will increase quickly with the needs of DRAMS.Author summarySample mix-up happens inevitably during sample collection, processing, and data management. It leads to reduced statistical power and sometimes false findings. It is of great importance to correct mixed-up samples before conducting any downstream analyses. We developed DRAMS to detect and re-align mixed-up samples in multi-omics studies. The basic idea of DRAMS is to align the data and labels for each sample leveraging the genetic information of multi-omics data. DRAMS corrects sample IDs following a two-step strategy. At first, it estimates pairwise genetic relatedness among all the data generated from all the individuals. Because the different data generated from the same individual should share the same genetics, we can cluster all the highly related data and consider that the data from one cluster have only one potential ID. Then, we used a “majority vote” strategy to infer the potential ID for individuals in each cluster. Other information, such as match of genetics-based and reported sexes, omics priorority, etc., were also used to direct identifying the potential IDs. It has been proved that DRAMS performs very well in both simulation and PsychENCODE BrainGVEX multi-omics data.

Download Full-text

Mapping Quantitative Trait Loci in F2 Incorporating Phenotypes of F3 Progeny

Genetics ◽

10.1093/genetics/166.4.1981 ◽

2004 ◽

Vol 166 (4) ◽

pp. 1981-1993 ◽

Cited By ~ 1

Author(s):

Yuan-Ming Zhang ◽

Shizhong Xu

Keyword(s):

Qtl Mapping ◽

Statistical Method ◽

Mixture Model ◽

Statistical Power ◽

Plant Traits ◽

Simulated Data ◽

Real Data ◽

Fundamental Principle ◽

Laboratory Animals ◽

Daughter Design

AbstractIn plants and laboratory animals, QTL mapping is commonly performed using F2 or BC individuals derived from the cross of two inbred lines. Typical QTL mapping statistics assume that each F2 individual is genotyped for the markers and phenotyped for the trait. For plant traits with low heritability, it has been suggested to use the average phenotypic values of F3 progeny derived from selfing F2 plants in place of the F2 phenotype itself. All F3 progeny derived from the same F2 plant belong to the same F2:3 family, denoted by F2:3. If the size of each F2:3 family (the number of F3 progeny) is sufficiently large, the average value of the family will represent the genotypic value of the F2 plant, and thus the power of QTL mapping may be significantly increased. The strategy of using F2 marker genotypes and F3 average phenotypes for QTL mapping in plants is quite similar to the daughter design of QTL mapping in dairy cattle. We study the fundamental principle of the plant version of the daughter design and develop a new statistical method to map QTL under this F2:3 strategy. We also propose to combine both the F2 phenotypes and the F2:3 average phenotypes to further increase the power of QTL mapping. The statistical method developed in this study differs from published ones in that the new method fully takes advantage of the mixture distribution for F2:3 families of heterozygous F2 plants. Incorporation of this new information has significantly increased the statistical power of QTL detection relative to the classical F2 design, even if only a single F3 progeny is collected from each F2:3 family. The mixture model is developed on the basis of a single-QTL model and implemented via the EM algorithm. Substantial computer simulation was conducted to demonstrate the improved efficiency of the mixture model. Extension of the mixture model to multiple QTL analysis is developed using a Bayesian approach. The computer program performing the Bayesian analysis of the simulated data is available to users for real data analysis.

Download Full-text

Separation of Chromatographic Co-Eluted Compounds by Clustering and by Functional Data Analysis

Metabolites ◽

10.3390/metabo11040214 ◽

2021 ◽

Vol 11 (4) ◽

pp. 214

Author(s):

Aneta Sawikowska ◽

Anna Piasecka ◽

Piotr Kachlicki ◽

Paweł Krajewski

Keyword(s):

Simulated Data ◽

Principal Component ◽

Real Data ◽

Functional Principal Component Analysis ◽

Additional Advantage ◽

Time Alignment ◽

Peak Separation ◽

Biological Mixtures ◽

Overlapping Peaks ◽

Retention Time Alignment

Peak overlapping is a common problem in chromatography, mainly in the case of complex biological mixtures, i.e., metabolites. Due to the existence of the phenomenon of co-elution of different compounds with similar chromatographic properties, peak separation becomes challenging. In this paper, two computational methods of separating peaks, applied, for the first time, to large chromatographic datasets, are described, compared, and experimentally validated. The methods lead from raw observations to data that can form inputs for statistical analysis. First, in both methods, data are normalized by the mass of sample, the baseline is removed, retention time alignment is conducted, and detection of peaks is performed. Then, in the first method, clustering is used to separate overlapping peaks, whereas in the second method, functional principal component analysis (FPCA) is applied for the same purpose. Simulated data and experimental results are used as examples to present both methods and to compare them. Real data were obtained in a study of metabolomic changes in barley (Hordeum vulgare) leaves under drought stress. The results suggest that both methods are suitable for separation of overlapping peaks, but the additional advantage of the FPCA is the possibility to assess the variability of individual compounds present within the same peaks of different chromatograms.

Download Full-text

A Closed-Form Solution to Planar Feature-Based Registration of LiDAR Point Clouds

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070435 ◽

2021 ◽

Vol 10 (7) ◽

pp. 435

Author(s):

Yongbo Wang ◽

Nanshan Zheng ◽

Zhengfu Bian

Keyword(s):

Closed Form ◽

Closed Form Solution ◽

Simulated Data ◽

Real Data ◽

Point Clouds ◽

Form Solution ◽

Spatial Transformation ◽

Dual Quaternions ◽

Feature Based ◽

Planar Feature

Since pairwise registration is a necessary step for the seamless fusion of point clouds from neighboring stations, a closed-form solution to planar feature-based registration of LiDAR (Light Detection and Ranging) point clouds is proposed in this paper. Based on the Plücker coordinate-based representation of linear features in three-dimensional space, a quad tuple-based representation of planar features is introduced, which makes it possible to directly determine the difference between any two planar features. Dual quaternions are employed to represent spatial transformation and operations between dual quaternions and the quad tuple-based representation of planar features are given, with which an error norm is constructed. Based on L2-norm-minimization, detailed derivations of the proposed solution are explained step by step. Two experiments were designed in which simulated data and real data were both used to verify the correctness and the feasibility of the proposed solution. With the simulated data, the calculated registration results were consistent with the pre-established parameters, which verifies the correctness of the presented solution. With the real data, the calculated registration results were consistent with the results calculated by iterative methods. Conclusions can be drawn from the two experiments: (1) The proposed solution does not require any initial estimates of the unknown parameters in advance, which assures the stability and robustness of the solution; (2) Using dual quaternions to represent spatial transformation greatly reduces the additional constraints in the estimation process.

Download Full-text

Penalized partial least squares for pleiotropy

BMC Bioinformatics ◽

10.1186/s12859-021-03968-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Camilo Broc ◽

Therese Truong ◽

Benoit Liquet

Keyword(s):

Least Squares ◽

Partial Least Squares ◽

Association Studies ◽

A Priori ◽

Simulated Data ◽

Real Data ◽

Genome Wide Association Studies ◽

Genetic Associations ◽

Multiple Traits ◽

Application Fields

Abstract Background The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. Results Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. Conclusion The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.

Download Full-text

Calibration of Camera and Flash LiDAR System with a Triangular Pyramid Target

Applied Sciences ◽

10.3390/app11020582 ◽

2021 ◽

Vol 11 (2) ◽

pp. 582

Author(s):

Zean Bu ◽

Changku Sun ◽

Peng Wang ◽

Hang Dong

Keyword(s):

Simulated Data ◽

Real Data ◽

Calibration Method ◽

Multiple Sensors ◽

Triangular Pyramid ◽

World Coordinate System ◽

Flash Lidar ◽

Novel Method ◽

3D Information ◽

Incremental Validation

Calibration between multiple sensors is a fundamental procedure for data fusion. To address the problems of large errors and tedious operation, we present a novel method to conduct the calibration between light detection and ranging (LiDAR) and camera. We invent a calibration target, which is an arbitrary triangular pyramid with three chessboard patterns on its three planes. The target contains both 3D information and 2D information, which can be utilized to obtain intrinsic parameters of the camera and extrinsic parameters of the system. In the proposed method, the world coordinate system is established through the triangular pyramid. We extract the equations of triangular pyramid planes to find the relative transformation between two sensors. One capture of camera and LiDAR is sufficient for calibration, and errors are reduced by minimizing the distance between points and planes. Furthermore, the accuracy can be increased by more captures. We carried out experiments on simulated data with varying degrees of noise and numbers of frames. Finally, the calibration results were verified by real data through incremental validation and analyzing the root mean square error (RMSE), demonstrating that our calibration method is robust and provides state-of-the-art performance.

Download Full-text

Prediction of Fuel Poverty Potential Risk Index Using Six Regression Algorithms: A Case-Study of Chilean Social Dwellings

Sustainability ◽

10.3390/su13052426 ◽

2021 ◽

Vol 13 (5) ◽

pp. 2426

Author(s):

David Bienvenido-Huertas ◽

Jesús A. Pulido-Arcas ◽

Carlos Rubio-Bellido ◽

Alexis Pérez-Fargallo

Keyword(s):

Low Income ◽

Potential Risk ◽

Energy Use ◽

Risk Index ◽

Computing Time ◽

Simulated Data ◽

Real Data ◽

Support Vector ◽

Energy Poverty ◽

Regression Algorithms

In recent times, studies about the accuracy of algorithms to predict different aspects of energy use in the building sector have flourished, being energy poverty one of the issues that has received considerable critical attention. Previous studies in this field have characterized it using different indicators, but they have failed to develop instruments to predict the risk of low-income households falling into energy poverty. This research explores the way in which six regression algorithms can accurately forecast the risk of energy poverty by means of the fuel poverty potential risk index. Using data from the national survey of socioeconomic conditions of Chilean households and generating data for different typologies of social dwellings (e.g., form ratio or roof surface area), this study simulated 38,880 cases and compared the accuracy of six algorithms. Multilayer perceptron, M5P and support vector regression delivered the best accuracy, with correlation coefficients over 99.5%. In terms of computing time, M5P outperforms the rest. Although these results suggest that energy poverty can be accurately predicted using simulated data, it remains necessary to test the algorithms against real data. These results can be useful in devising policies to tackle energy poverty in advance.

Download Full-text

Constructing Large-Scale Genetic Maps Using an Evolutionary Strategy Algorithm

Genetics ◽

10.1093/genetics/165.4.2269 ◽

2003 ◽

Vol 165 (4) ◽

pp. 2269-2282

Author(s):

D Mester ◽

Y Ronin ◽

D Minkov ◽

E Nevo ◽

A Korol

Keyword(s):

Discrete Optimization ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Real Data ◽

Genetic Maps ◽

Chromosome 1 ◽

Evolutionary Strategy ◽

Group A ◽

The One

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.

Download Full-text

Improving the performance of a radio-frequency localization system in adverse outdoor applications

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-021-02001-6 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Marcelo N. de Sousa ◽

Ricardo Sant’Ana ◽

Rigel P. Fernandes ◽

Julio Cesar Duarte ◽

José A. Apolinário ◽

...

Keyword(s):

Random Forest ◽

Ray Tracing ◽

Real World ◽

Practical Implication ◽

Real Life ◽

Simulated Data ◽

Real Data ◽

Gradient Boosting ◽

Real World Data ◽

Localization Accuracy

AbstractIn outdoor RF localization systems, particularly where line of sight can not be guaranteed or where multipath effects are severe, information about the terrain may improve the position estimate’s performance. Given the difficulties in obtaining real data, a ray-tracing fingerprint is a viable option. Nevertheless, although presenting good simulation results, the performance of systems trained with simulated features only suffer degradation when employed to process real-life data. This work intends to improve the localization accuracy when using ray-tracing fingerprints and a few field data obtained from an adverse environment where a large number of measurements is not an option. We employ a machine learning (ML) algorithm to explore the multipath information. We selected algorithms random forest and gradient boosting; both considered efficient tools in the literature. In a strict simulation scenario (simulated data for training, validating, and testing), we obtained the same good results found in the literature (error around 2 m). In a real-world system (simulated data for training, real data for validating and testing), both ML algorithms resulted in a mean positioning error around 100 ,m. We have also obtained experimental results for noisy (artificially added Gaussian noise) and mismatched (with a null subset of) features. From the simulations carried out in this work, our study revealed that enhancing the ML model with a few real-world data improves localization’s overall performance. From the machine ML algorithms employed herein, we also observed that, under noisy conditions, the random forest algorithm achieved a slightly better result than the gradient boosting algorithm. However, they achieved similar results in a mismatch experiment. This work’s practical implication is that multipath information, once rejected in old localization techniques, now represents a significant source of information whenever we have prior knowledge to train the ML algorithm.

Download Full-text

Hyperbolastic Models from a Stochastic Differential Equation Point of View

Mathematics ◽

10.3390/math9161835 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1835

Author(s):

Antonio Barrera ◽

Patricia Román-Román ◽

Francisco Torres-Ruiz

Keyword(s):

Differential Equation ◽

Simulated Data ◽

Real Data ◽

Point Of View ◽

Likelihood Method ◽

Numerical Resolution ◽

The Family ◽

The Common ◽

Linear Differential ◽

Mean Function

A joint and unified vision of stochastic diffusion models associated with the family of hyperbolastic curves is presented. The motivation behind this approach stems from the fact that all hyperbolastic curves verify a linear differential equation of the Malthusian type. By virtue of this, and by adding a multiplicative noise to said ordinary differential equation, a diffusion process may be associated with each curve whose mean function is said curve. The inference in the resulting processes is presented jointly, as well as the strategies developed to obtain the initial solutions necessary for the numerical resolution of the system of equations resulting from the application of the maximum likelihood method. The common perspective presented is especially useful for the implementation of the necessary procedures for fitting the models to real data. Some examples based on simulated data support the suitability of the development described in the present paper.

Download Full-text

Robust Weighted l1,2 Norm Filtering in Passive Radar Systems

Sensors ◽

10.3390/s20113270 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3270 ◽

Cited By ~ 1

Author(s):

Baris Satar ◽

Gokhan Soysal ◽

Xue Jiang ◽

Murat Efe ◽

Thiagalingam Kirubarajan

Keyword(s):

Target Detection ◽

Impulsive Noise ◽

Simulated Data ◽

Real Data ◽

Detection Performance ◽

Passive Radar ◽

Mobile Telecommunication ◽

Telecommunication System ◽

Conventional Methods ◽

Non Gaussian

Conventional methods such as matched filtering, fractional lower order statistics cross ambiguity function, and recent methods such as compressed sensing and track-before-detect are used for target detection by passive radars. Target detection using these algorithms usually assumes that the background noise is Gaussian. However, non-Gaussian impulsive noise is inherent in real world radar problems. In this paper, a new optimization based algorithm that uses weighted l 1 and l 2 norms is proposed as an alternative to the existing algorithms whose performance degrades in the presence of impulsive noise. To determine the weights of these norms, the parameter that quantifies the impulsiveness level of the noise is estimated. In the proposed algorithm, the aim is to increase the target detection performance of a universal mobile telecommunication system (UMTS) based passive radars by facilitating higher resolution with better suppression of the sidelobes in both range and Doppler. The results obtained from both simulated data with α stable distribution, and real data recorded by a UMTS based passive radar platform are presented to demonstrate the superiority of the proposed algorithm. The results show that the proposed algorithm provides more robust and accurate detection performance for noise models with different impulsiveness levels compared to the conventional methods.

Download Full-text