An Ellipsoidal Bounding Scheme for the Quasi-Clique Number of a Graph

Zhuqi Miao; Balabhaskar Balasundaram

doi:10.1287/ijoc.2019.0922

An Ellipsoidal Bounding Scheme for the Quasi-Clique Number of a Graph

INFORMS Journal on Computing ◽

10.1287/ijoc.2019.0922 ◽

2020 ◽

Vol 32 (3) ◽

pp. 763-778

Author(s):

Zhuqi Miao ◽

Balabhaskar Balasundaram

Keyword(s):

Large Scale ◽

Clique Number ◽

Mixed Integer ◽

Edge Density ◽

Data Sets ◽

Quadratic Constraint ◽

Lagrangian Dual ◽

Benchmark Instances ◽

Scale Error ◽

The Given

A γ-quasi-clique in a simple undirected graph refers to a subset of vertices that induces a subgraph with edge density at least γ. When γ equals one, this definition corresponds to a classical clique. When γ is less than one, it relaxes the requirement of all possible edges by the clique definition. Quasi-clique detection has been used in graph-based data mining to find dense clusters, especially in large-scale error-prone data sets in which the clique model can be overly restrictive. The maximum γ-quasi-clique problem, seeking a γ-quasi-clique of maximum cardinality in the given graph, can be formulated as an optimization problem with a linear objective function and a single quadratic constraint in binary variables. This article investigates the Lagrangian dual of this formulation and develops an upper-bounding technique using the geometry of ellipsoids to bound the Lagrangian dual. The tightness of the upper bound is compared with those obtained from multiple mixed-integer programming formulations of the problem via experiments on benchmark instances.

Download Full-text

Identifying significantly impacted pathways: a comprehensive review and assessment

Genome Biology ◽

10.1186/s13059-019-1790-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 26

Author(s):

Tuan-Minh Nguyen ◽

Adib Shafi ◽

Tin Nguyen ◽

Sorin Draghici

Keyword(s):

Pathway Analysis ◽

Null Hypothesis ◽

Large Scale ◽

Data Sets ◽

Actual Performance ◽

Large Scale Assessment ◽

Analysis Methods ◽

Biological Phenomena ◽

High Throughput Experiments ◽

The Given

Abstract Background Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far. These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true. Results This article presents the most comprehensive comparative study on pathway analysis methods available to date. We compare the actual performance of 13 widely used pathway analysis methods in over 1085 analyses. These comparisons were performed using 2601 samples from 75 human disease data sets and 121 samples from 11 knockout mouse data sets. In addition, we investigate the extent to which each method is biased under the null hypothesis. Together, these data and results constitute a reliable benchmark against which future pathway analysis methods could and should be tested. Conclusion Overall, the result shows that no method is perfect. In general, TB methods appear to perform better than non-TB methods. This is somewhat expected since the TB methods take into consideration the structure of the pathway which is meant to describe the underlying phenomena. We also discover that most, if not all, listed approaches are biased and can produce skewed results under the null.

Download Full-text

OPTIMAL METHODS FOR RE-ORDERING DATA MATRICES IN SYSTEMS BIOLOGY AND DRUG DISCOVERY APPLICATIONS

Biophysical Reviews and Letters ◽

10.1142/s1793048008000605 ◽

2008 ◽

Vol 03 (01n02) ◽

pp. 19-42

Author(s):

PETER A. DIMAGGIO ◽

SCOTT R. MCALLISTER ◽

CHRISTODOULOS A. FLOUDAS ◽

XIAO-JIANG FENG ◽

JOSHUA D. RABINOWITZ ◽

...

Keyword(s):

Systems Biology ◽

Drug Discovery ◽

Large Scale ◽

Mixed Integer ◽

Data Sets ◽

Clustering Methods ◽

Clustering Techniques ◽

Large Scale Data ◽

Heuristic Strategies ◽

Data Matrices

The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Many of the methods developed employ local search or heuristic strategies for identifying the "best" arrangement of features according to some metric. In this article, we present rigorous clustering methods based on the optimal re-ordering of data matrices. Distinct mixed-integer linear programming (MILP) models are utilized for the clustering of (a) dense data matrices, such as gene expression data, and (b) sparse data matrices, which are commonly encountered in the field of drug discovery. Both methods can be used in an iterative framework to bicluster data and assist in the synthesis of drug compounds, respectively. We demonstrate the capability of the proposed optimal re-ordering methods on several data sets from both systems biology and molecular discovery studies and compare our results to other clustering techniques when applicable.

Download Full-text

Optimal Test Assembly in Practice

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000146 ◽

2013 ◽

Vol 221 (3) ◽

pp. 190-200 ◽

Cited By ~ 9

Author(s):

Jörg-Tobias Kuhn ◽

Thomas Kiefer

Keyword(s):

Particle Physics ◽

Large Scale ◽

Block Design ◽

Linear Optimization ◽

Mixed Integer ◽

Optimal Test ◽

Automated Test Assembly ◽

Test Assembly ◽

Block Level ◽

Mixed Integer Linear Optimization

Several techniques have been developed in recent years to generate optimal large-scale assessments (LSAs) of student achievement. These techniques often represent a blend of procedures from such diverse fields as experimental design, combinatorial optimization, particle physics, or neural networks. However, despite the theoretical advances in the field, there still exists a surprising scarcity of well-documented test designs in which all factors that have guided design decisions are explicitly and clearly communicated. This paper therefore has two goals. First, a brief summary of relevant key terms, as well as experimental designs and automated test assembly routines in LSA, is given. Second, conceptual and methodological steps in designing the assessment of the Austrian educational standards in mathematics are described in detail. The test design was generated using a two-step procedure, starting at the item block level and continuing at the item level. Initially, a partially balanced incomplete item block design was generated using simulated annealing, whereas in a second step, items were assigned to the item blocks using mixed-integer linear optimization in combination with a shadow-test approach.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

A novel optimal multi-pattern matching method with wildcards for DNA sequence

Technology and Health Care ◽

10.3233/thc-218012 ◽

2021 ◽

Vol 29 ◽

pp. 115-124

Author(s):

Xinlu Wang ◽

Ahmed A.F. Saif ◽

Dayou Liu ◽

Yungang Zhu ◽

Jon Atli Benediktsson

Keyword(s):

Dna Sequence ◽

Pattern Matching ◽

Health Informatics ◽

State Of The Art ◽

Machine Language ◽

Data Sets ◽

Fundamental Issue ◽

Matching Method ◽

Dna Sequence Alignment ◽

The Given

BACKGROUND: DNA sequence alignment is one of the most fundamental and important operation to identify which gene family may contain this sequence, pattern matching for DNA sequence has been a fundamental issue in biomedical engineering, biotechnology and health informatics. OBJECTIVE: To solve this problem, this study proposes an optimal multi pattern matching with wildcards for DNA sequence. METHODS: This proposed method packs the patterns and a sliding window of texts, and the window slides along the given packed text, matching against stored packed patterns. RESULTS: Three data sets are used to test the performance of the proposed algorithm, and the algorithm was seen to be more efficient than the competitors because its operation is close to machine language. CONCLUSIONS: Theoretical analysis and experimental results both demonstrate that the proposed method outperforms the state-of-the-art methods and is especially effective for the DNA sequence.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

Large-Scale Bi-Level Strain Design Approaches and Mixed-Integer Programming Solution Techniques

PLoS ONE ◽

10.1371/journal.pone.0024162 ◽

2011 ◽

Vol 6 (9) ◽

pp. e24162 ◽

Cited By ~ 55

Author(s):

Joonhoon Kim ◽

Jennifer L. Reed ◽

Christos T. Maravelias

Keyword(s):

Integer Programming ◽

Mixed Integer Programming ◽

Large Scale ◽

Mixed Integer ◽

Strain Design ◽

Programming Solution ◽

Solution Techniques

Download Full-text

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text

Solution of the mixed integer large scale unit commitment problem by means of a continuous Stochastic linear programming model

Energy Systems ◽

10.1007/s12667-013-0107-z ◽

2013 ◽

Vol 5 (2) ◽

pp. 269-284 ◽

Cited By ~ 10

Author(s):

D. Siface ◽

M. T. Vespucci ◽

A. Gelmini

Keyword(s):

Linear Programming ◽

Large Scale ◽

Unit Commitment ◽

Programming Model ◽

Linear Programming Model ◽

Mixed Integer ◽

Stochastic Linear Programming ◽

Unit Commitment Problem ◽

Scale Unit ◽

Commitment Problem

Download Full-text