Rampant false detection of adaptive phenotypic optimization by ParTI-based Pareto front inference

Molecular Biology and Evolution ◽

10.1093/molbev/msaa330 ◽

2020 ◽

Author(s):

Mengyi Sun ◽

Jianzhi Zhang

Keyword(s):

Pareto Front ◽

Simulated Data ◽

High Dimensional ◽

Phenotypic Data ◽

Cast Doubt ◽

Population Structures ◽

Molecular Phenotypes ◽

Almost All ◽

Pareto Fronts ◽

Gene Expression Levels

Abstract Organisms face tradeoffs in performing multiple tasks. Identifying the optimal phenotypes maximizing the organismal fitness (or Pareto front) and inferring the relevant tasks allow testing phenotypic adaptations and help delineate evolutionary constraints, tradeoffs, and critical fitness components, so are of broad interest. It has been proposed that Pareto fronts can be identified from high-dimensional phenotypic data, including molecular phenotypes such as gene expression levels, by fitting polytopes (lines, triangles, tetrahedrons, etc.), and a program named ParTI was recently introduced for this purpose. ParTI has identified Pareto fronts and inferred phenotypes best for individual tasks (or archetypes) from numerous datasets such as the beak morphologies of Darwin’s finches and mRNA concentrations in human tumors, implying evolutionary optimizations of the involved traits. Nevertheless, the reliabilities of these findings are unknown. Using real and simulated data that lack evolutionary optimization, we here report extremely high false positive rates of ParTI. The errors arise from phylogenetic relationships or population structures of the organisms analyzed and the flexibility of data analysis in ParTI that is equivalent to p-hacking. Because these problems are virtually universal, our findings cast doubt on almost all ParTI-based results and suggest that reliably identifying Pareto fronts and archetypes from high-dimensional phenotypic data is currently generally difficult.

Download Full-text

Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer

Mathematics ◽

10.3390/math9030222 ◽

2021 ◽

Vol 9 (3) ◽

pp. 222

Author(s):

Juan C. Laria ◽

M. Carmen Aguilera-Morillo ◽

Enrique Álvarez ◽

Rosa E. Lillo ◽

Sara López-Taruella ◽

...

Keyword(s):

Breast Cancer ◽

Variable Selection ◽

Triple Negative Breast Cancer ◽

Triple Negative ◽

A Priori ◽

Simulated Data ◽

Point Of View ◽

High Dimensional ◽

Whole Genome ◽

Genome Context

Over the last decade, regularized regression methods have offered alternatives for performing multi-marker analysis and feature selection in a whole genome context. The process of defining a list of genes that will characterize an expression profile remains unclear. It currently relies upon advanced statistics and can use an agnostic point of view or include some a priori knowledge, but overfitting remains a problem. This paper introduces a methodology to deal with the variable selection and model estimation problems in the high-dimensional set-up, which can be particularly useful in the whole genome context. Results are validated using simulated data and a real dataset from a triple-negative breast cancer study.

Download Full-text

Multiple-Locus Sequence Typing Analysis of Bacillus cereus and Bacillus thuringiensis Reveals Separate Clustering and a Distinct Population Structure of Psychrotrophic Strains

Applied and Environmental Microbiology ◽

10.1128/aem.72.2.1569-1578.2006 ◽

2006 ◽

Vol 72 (2) ◽

pp. 1569-1578 ◽

Cited By ~ 96

Author(s):

Alexei Sorokin ◽

Benjamin Candelon ◽

Kévin Guilloux ◽

Nathalie Galleron ◽

Natalia Wackerow-Kouzova ◽

...

Keyword(s):

Bacillus Thuringiensis ◽

Bacillus Cereus ◽

Individual Gene ◽

Mild Winter ◽

Systematic Testing ◽

Bacillus Weihenstephanensis ◽

Paris Area ◽

Population Structures ◽

Multiple Strains ◽

Almost All

ABSTRACT We used multilocus sequence typing (MLST) to characterize phylogenetic relationships for a collection of Bacillus cereus group strains isolated from forest soil in the Paris area during a mild winter. This collection contains multiple strains isolated from the same soil sample and strains isolated from samples from different sites. We characterized 115 strains of this collection and 19 other strains based on the sequences of the clpC, dinB, gdpD, panC, purF, and yhfL loci. The number of alleles ranged from 36 to 53, and a total of 93 allelic profiles or sequence types were distinguished. We identified three major strain clusters—C, T, and W—based on the comparison of individual gene sequences or concatenated sequences. Some less representative clusters and subclusters were also distinguished. Analysis of the MLST data using the concept of clonal complexes led to the identification of two, five, and three such groups in clusters C, T, and W, respectively. Some of the forest isolates were closely related to independently isolated psychrotrophic strains. Systematic testing of the strains of this collection showed that almost all the strains that were able to grow at a low temperature (6°C) belonged to cluster W. Most of these strains, including three independently isolated strains, belong to two clonal complexes and are therefore very closely related genetically. These clonal complexes represent strains corresponding to the previously identified species Bacillus weihenstephanensis. Most of the other strains of our collection, including some from the W cluster, are not psychrotrophic. B. weihenstephanensis (cluster W) strains appear to comprise an effectively sexual population, whereas Bacillus thuringiensis (cluster T) and B. cereus (cluster C) have clonal population structures.

Download Full-text

Cancer classification and biomarker selection via a penalized logsum network-based logistic regression model

Technology and Health Care ◽

10.3233/thc-218026 ◽

2021 ◽

Vol 29 ◽

pp. 287-295

Author(s):

Zhiming Zhou ◽

Haihui Huang ◽

Yong Liang

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Gene Selection ◽

Simulated Data ◽

Biological Data ◽

Cancer Classification ◽

High Dimensional ◽

Data Set ◽

Biomarker Selection

BACKGROUND: In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection. OBJECTIVE: The aim of this paper is to give the model efficient gene selection capability. METHODS: In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification. RESULTS: Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. CONCLUSIONS: The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.

Download Full-text

Intermittency and related issues in 16O-Ag/Br collision at 200A GeV/c

Canadian Journal of Physics ◽

10.1139/p10-038 ◽

2010 ◽

Vol 88 (8) ◽

pp. 575-584 ◽

Cited By ~ 4

Author(s):

M. K. Ghosh ◽

P. K. Haldar ◽

S. K. Manna ◽

A. Mukhopadhyay ◽

G. Singh

Keyword(s):

Particle Density ◽

Simulated Data ◽

Space Distribution ◽

Monte Carlo Code ◽

Final State ◽

Data Set ◽

Fractal Properties ◽

Self Similar ◽

Singly Charged ◽

Almost All

In this paper we present some results on the nonstatistical fluctuation in the 1-dimensional (1-d) density distribution of singly charged produced particles in the framework of the intermittency phenomenon. A set of nuclear emulsion data on 16O-Ag/Br interactions at an incident momentum of 200A GeV/c, was analyzed in terms of different statistical methods that are related to the self-similar fractal properties of the particle density function. A comparison of the present experiment with a similar experiment induced by the 32S nuclei and also with a set of results simulated by the Lund Monte Carlo code FRITIOF is presented. A similar comparison between this experiment and a pseudo-random number generated simulated data set is also made. The analysis reveals the presence of a weak intermittency in the 1-d phase space distribution of the produced particles. The results also indicate the occurrence of a nonthermal phase transition during emission of final-state hadrons. Our results on factorial correlators suggests that short-range correlations are present in the angular distribution of charged hadrons, whereas those on oscillatory moments show that such correlations are not restricted only to a few particles. In almost all cases, the simulated results fail to replicate their experimental counterparts.

Download Full-text

The dilemma between eliminating dominance-resistant solutions and preserving boundary solutions of extremely convex Pareto fronts

Complex & Intelligent Systems ◽

10.1007/s40747-021-00543-2 ◽

2021 ◽

Author(s):

Zhenkun Wang ◽

Qingyan Li ◽

Qite Yang ◽

Hisao Ishibuchi

Keyword(s):

Evolutionary Algorithms ◽

Coping Strategies ◽

Test Problem ◽

Pareto Front ◽

Optimization Problems ◽

Feasible Region ◽

Multi Objective Optimization ◽

Multi Objective ◽

Pareto Fronts ◽

Boundary Solutions

AbstractIt has been acknowledged that dominance-resistant solutions (DRSs) extensively exist in the feasible region of multi-objective optimization problems. Recent studies show that DRSs can cause serious performance degradation of many multi-objective evolutionary algorithms (MOEAs). Thereafter, various strategies (e.g., the $$\epsilon $$ ϵ -dominance and the modified objective calculation) to eliminate DRSs have been proposed. However, these strategies may in turn cause algorithm inefficiency in other aspects. We argue that these coping strategies prevent the algorithm from obtaining some boundary solutions of an extremely convex Pareto front (ECPF). That is, there is a dilemma between eliminating DRSs and preserving boundary solutions of the ECPF. To illustrate such a dilemma, we propose a new multi-objective optimization test problem with the ECPF as well as DRSs. Using this test problem, we investigate the performance of six representative MOEAs in terms of boundary solutions preservation and DRS elimination. The results reveal that it is quite challenging to distinguish between DRSs and boundary solutions of the ECPF.

Download Full-text

The theory on and software simulating large-scale genomic data for genotype-by-environment interactions

BMC Genomics ◽

10.1186/s12864-021-08191-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xiujin Li ◽

Hailiang Song ◽

Zhe Zhang ◽

Yunmao Huang ◽

Qin Zhang ◽

...

Keyword(s):

Large Scale ◽

Simulated Data ◽

Genomic Data ◽

Efficient Tool ◽

Phenotypic Data ◽

Genotype By Environment Interactions ◽

Genotype By Environment ◽

Threshold Trait ◽

Genome Wide ◽

Increasing Demand

Abstract Background With the emphasis on analysing genotype-by-environment interactions within the framework of genomic selection and genome-wide association analysis, there is an increasing demand for reliable tools that can be used to simulate large-scale genomic data in order to assess related approaches. Results We proposed a theory to simulate large-scale genomic data on genotype-by-environment interactions and added this new function to our developed tool GPOPSIM. Additionally, a simulated threshold trait with large-scale genomic data was also added. The validation of the simulated data indicated that GPOSPIM2.0 is an efficient tool for mimicking the phenotypic data of quantitative traits, threshold traits, and genetically correlated traits with large-scale genomic data while taking genotype-by-environment interactions into account. Conclusions This tool is useful for assessing genotype-by-environment interactions and threshold traits methods.

Download Full-text

Intra- and inter-specific variations of gene expression levels in yeast are largely neutral

10.1101/089995 ◽

2016 ◽

Cited By ~ 1

Author(s):

Jian-Rong Yang ◽

Calum Maclean ◽

Chungoo Park ◽

Huabin Zhao ◽

Jianzhi Zhang

Keyword(s):

Gene Expression ◽

Genome Sequence ◽

Large Fraction ◽

General Role ◽

Expression Levels ◽

Genome Wide ◽

Sequence Variations ◽

Expression Evolution ◽

Molecular Phenotypes ◽

Gene Expression Levels

ABSTRACTIt is commonly, although not universally, accepted that most intra- and inter-specific genome sequence variations are more or less neutral, whereas a large fraction of organism-level phenotypic variations are adaptive. Gene expression levels are molecular phenotypes that bridge the gap between genotypes and corresponding organism-level phenotypes. Yet, it is unknown whether natural variations in gene expression levels are mostly neutral or adaptive. Here we address this fundamental question by genome-wide profiling and comparison of gene expression levels in nine yeast strains belonging to three closely related Saccharomyces species and originating from five different ecological environments. We find that the transcriptome-based clustering of the nine strains approximates the genome sequence-based phylogeny irrespective of their ecological environments. Remarkably, only ∼0.5% of genes exhibit similar expression levels among strains from a common ecological environment, no greater than that among strains with comparable phylogenetic relationships but different environments. These and other observations strongly suggest that most intra- and inter-specific variations in yeast gene expression levels result from the accumulation of random mutations rather than environmental adaptations. This finding has profound implications for understanding the driving force of gene expression evolution, genetic basis of phenotypic adaptation, and general role of stochasticity in evolution.

Download Full-text

Efficient Gradient-Based Algorithms for the Construction of Pareto Fronts

Volume 7: Turbomachinery, Parts A, B, and C ◽

10.1115/gt2011-45069 ◽

2011 ◽

Cited By ~ 4

Author(s):

Sriram Shankaran ◽

Brian Barr

Keyword(s):

Pareto Front ◽

Computational Cost ◽

High Fidelity ◽

Simulation Tools ◽

Pareto Points ◽

Uniform Spacing ◽

Design Variables ◽

Gradient Based ◽

Pareto Fronts ◽

Sampling Approach

The objective of this study is to develop and assess a gradient-based algorithm that efficiently traverses the Pareto front for multi-objective problems. We use high-fidelity, computationally intensive simulation tools (for eg: Computational Fluid Dynamics (CFD) and Finite Element (FE) structural analysis) for function and gradient evaluations. The use of evolutionary algorithms with these high-fidelity simulation tools results in prohibitive computational costs. Hence, in this study we use an alternate gradient-based approach. We first outline an algorithm that can be proven to recover Pareto fronts. The performance of this algorithm is then tested on three academic problems: a convex front with uniform spacing of Pareto points, a convex front with non-uniform spacing and a concave front. The algorithm is shown to be able to retrieve the Pareto front in all three cases hence overcoming a common deficiency in gradient-based methods that use the idea of scalarization. Then the algorithm is applied to a practical problem in concurrent design for aerodynamic and structural performance of an axial turbine blade. For this problem, with 5 design variables, and for 10 points to approximate the front, the computational cost of the gradient-based method was roughly the same as that of a method that builds the front from a sampling approach. However, as the sampling approach involves building a surrogate model to identify the Pareto front, there is the possibility that validation of this predicted front with CFD and FE analysis results in a different location of the “Pareto” points. This can be avoided with the gradient-based method. Additionally, as the number of design variables increases and/or the number of required points on the Pareto front is reduced, the computational cost favors the gradient-based approach.

Download Full-text

Genetic diversity among Italian melon inodorus (Cucumis melo L.) germplasm revealed by ISSR analysis and agronomic traits

Plant Genetic Resources ◽

10.1017/s1479262111000335 ◽

2011 ◽

Vol 9 (2) ◽

pp. 214-217 ◽

Cited By ~ 9

Author(s):

S. Sestili ◽

A. Giardini ◽

N. Ficcadenti

Keyword(s):

Agronomic Traits ◽

Genetic Relationships ◽

Molecular Data ◽

Inter Simple Sequence Repeat ◽

Phenotypic Traits ◽

Phenotypic Data ◽

Morphological And Molecular Data ◽

And Cluster Analysis ◽

Almost All ◽

Issr Primers

The genetic relationships among 13 melon inodorus populations that were collected in southern Italy were assessed using 100 inter-simple-sequence repeat (ISSR) primers and 15 morphological traits. The dihaploid line Nad-1 and the cultivar Charentais-T, both of which belong to the botanical variety cantalupensis, were used as reference accessions in the molecular analysis. A total of 358 polymorphic bands were obtained from 39 of the 100 ISSR primers used, and 15 phenotypic traits were scored and used for genetic-similarity calculations and cluster analysis. The resulting dendrograms based on the ISSR and phenotypic data allowed almost all of the melon genotypes to be distinguished on the basis of the skin colour of the fruits. Mantel's test revealed a good correlation between the morphological and molecular data in their ability to detect genetic relationships among melon ecotypes (r = 0.50, P = 0.99). The data obtained confirm the effectiveness of this approach, and open new perspectives to reveal possible molecular associations with the phenotypic traits analysed.

Download Full-text

What Weights Work for You? Adapting Weights for Any Pareto Front Shape in Decomposition-Based Evolutionary Multiobjective Optimisation

Evolutionary Computation ◽

10.1162/evco_a_00269 ◽

2020 ◽

Vol 28 (2) ◽

pp. 227-253 ◽

Cited By ~ 2

Author(s):

Miqing Li ◽

Xin Yao

Keyword(s):

Pareto Front ◽

Evolutionary Process ◽

Highly Nonlinear ◽

Distributed Solutions ◽

Front Shape ◽

Different Shapes ◽

Pareto Fronts ◽

The Given ◽

Update Frequency

The quality of solution sets generated by decomposition-based evolutionary multi-objective optimisation (EMO) algorithms depends heavily on the consistency between a given problem's Pareto front shape and the specified weights' distribution. A set of weights distributed uniformly in a simplex often leads to a set of well-distributed solutions on a Pareto front with a simplex-like shape, but may fail on other Pareto front shapes. It is an open problem on how to specify a set of appropriate weights without the information of the problem's Pareto front beforehand. In this article, we propose an approach to adapt weights during the evolutionary process (called AdaW). AdaW progressively seeks a suitable distribution of weights for the given problem by elaborating several key parts in weight adaptation—weight generation, weight addition, weight deletion, and weight update frequency. Experimental results have shown the effectiveness of the proposed approach. AdaW works well for Pareto fronts with very different shapes: 1) the simplex-like, 2) the inverted simplex-like, 3) the highly nonlinear, 4) the disconnect, 5) the degenerate, 6) the scaled, and 7) the high-dimensional.

Download Full-text