Minimum Common String Partition Problem: Hardness and Approximations

Avraham Goldstein; Petr Kolman; Jie Zheng

doi:10.37236/1947

Minimum Common String Partition Problem: Hardness and Approximations

The Electronic Journal of Combinatorics ◽

10.37236/1947 ◽

2005 ◽

Vol 12 (1) ◽

Cited By ~ 12

Author(s):

Avraham Goldstein ◽

Petr Kolman ◽

Jie Zheng

Keyword(s):

Genome Rearrangement ◽

Linear Time ◽

Fundamental Problem ◽

Text Processing ◽

Partition Problem ◽

Sorting By Reversals ◽

String Comparison ◽

Minimum Number ◽

Tight Connection ◽

Minimum Common String Partition

String comparison is a fundamental problem in computer science, with applications in areas such as computational biology, text processing and compression. In this paper we address the minimum common string partition problem, a string comparison problem with tight connection to the problem of sorting by reversals with duplicates, a key problem in genome rearrangement. A partition of a string $A$ is a sequence ${\cal P} = (P_1,P_2,\dots,P_m)$ of strings, called the blocks, whose concatenation is equal to $A$. Given a partition ${\cal P}$ of a string $A$ and a partition ${\cal Q}$ of a string $B$, we say that the pair $\langle{{\cal P},{\cal Q}}\rangle$ is a common partition of $A$ and $B$ if ${\cal Q}$ is a permutation of ${\cal P}$. The minimum common string partition problem (MCSP) is to find a common partition of two strings $A$ and $B$ with the minimum number of blocks. The restricted version of MCSP where each letter occurs at most $k$ times in each input string, is denoted by $k$-MCSP. In this paper, we show that $2$-MCSP (and therefore MCSP) is NP-hard and, moreover, even APX-hard. We describe a $1.1037$-approximation for $2$-MCSP and a linear time $4$-approximation algorithm for $3$-MCSP. We are not aware of any better approximations.

Download Full-text

Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set

The Electronic Journal of Combinatorics ◽

10.37236/968 ◽

2007 ◽

Vol 14 (1) ◽

Cited By ~ 7

Author(s):

Petr Kolman ◽

Tomasz Waleń

Keyword(s):

Genome Rearrangement ◽

Linear Time ◽

Time Algorithm ◽

Minimum Size ◽

String Comparison ◽

Minimum Number ◽

Tree Data ◽

Tree Data Structure ◽

Disjoint Set Union ◽

Minimum Common String Partition

In the last decade there has been an ongoing interest in string comparison problems; to a large extend the interest was stimulated by genome rearrangement problems in computational biology but related problems appear in many other areas of computer science. Particular attention has been given to the problem of sorting by reversals (SBR): given two strings, $A$ and $B$, find the minimum number of reversals that transform the string $A$ into the string $B$ (a reversal $\rho(i,j)$, $i < j$, transforms a string $A=a_1\ldots a_n$ into a string $A'=a_1\ldots a_{i-1} a_{j} a_{j-1} \ldots a_{i} a_{j+1} \ldots a_n$). Closely related is the minimum common string partition problem (MCSP): given two strings, $A$ and $B$, find a minimum size partition of $A$ into substrings $P_1,\ldots,P_l$ (i.e., $A=P_1\ldots P_l$) and a partition of $B$ into substrings $Q_1,\ldots,Q_l$ such that $(Q_1,\ldots,Q_l)$ is a permutation of $(P_1,\ldots,P_l)$. Primarily the SBR problem has been studied for strings in which every symbol appears exactly once (that is, for permutations) and only recently attention has been given to the general case where duplicates of the symbols are allowed. In this paper we consider the problem $k$-SBR, a version of SBR in which each symbol is allowed to appear up to $k$ times in each string, for some $k\geq 1$. The main result of the paper is a $\Theta(k)$-approximation algorithm for $k$-SBR running in time $O(n)$; compared to the previously known algorithm for $k$-SBR, this is an improvement by a factor of $\Theta(k)$ in the approximation ratio, and by a factor of $\Theta(k)$ in the running time. We approach the $k$-SBR by finding an approximation for the $k$-MCSP first and then turning it into a solution for $k$-SBR. Crucial ingredients of our algorithm are the suffix tree data structure and a linear time algorithm for a special case of a disjoint set union problem.

Download Full-text

Prefix Block-Interchanges on Binary and Ternary Strings

10.1101/659664 ◽

2019 ◽

Author(s):

Md. Khaledur Rahman ◽

M. Sohel Rahman

Keyword(s):

Upper Bound ◽

Genome Rearrangement ◽

Linear Time ◽

Polynomial Time Algorithm ◽

Time Algorithm ◽

Upper Bounds ◽

Linear Time Algorithm ◽

Minimum Number ◽

Binary Strings ◽

Better Than

AbstractThe genome rearrangement problem computes the minimum number of operations that are required to sort all elements of a permutation. A block-interchange operation exchanges two blocks of a permutation which are not necessarily adjacent and in a prefix block-interchange, one block is always the prefix of that permutation. In this paper, we focus on applying prefix block-interchanges on binary and ternary strings. We present upper bounds to group and sort a given binary/ternary string. We also provide upper bounds for a different version of the block-interchange operation which we refer to as the ‘restricted prefix block-interchange’. We observe that our obtained upper bound for restricted prefix block-interchange operations on binary strings is better than that of other genome rearrangement operations to group fully normalized binary strings. Consequently, we provide a linear-time algorithm to solve the problem of grouping binary normalized strings by restricted prefix block-interchanges. We also provide a polynomial time algorithm to group normalized ternary strings by prefix block-interchange operations. Finally, we provide a classification for ternary strings based on the required number of prefix block-interchange operations.

Download Full-text

Algorithms for Sorting by Reversals or Transpositions, with Application to Genome Rearrangement

10.5753/ctd.2016.9145 ◽

2020 ◽

Author(s):

Gustavo Rodrigues Galvão ◽

Zanoni Dias

Keyword(s):

Comparative Genomics ◽

Genome Rearrangement ◽

Heuristic Algorithms ◽

Combinatorial Problem ◽

Sorting By Reversals ◽

Sorting Problem ◽

Minimum Number ◽

Phd Thesis

The problem of finding the minimum sequence of rearrangements that transforms one genome into another is a well-studied problem that finds application in comparative genomics. Representing genomes as permutations, in which genes appear as elements, that problem can be reduced to the combinatorial problem of sorting a permutation using a minimum number of rearrangements. Such combinatorial problem varies according to the types of rearrangements considered. The PhD thesis summarized in this paper presents exact, approximation, and heuristic algorithms for solving variants of the permutation sorting problem involving two types of rearrangements: reversals and transpositions.

Download Full-text

Computational performance evaluation of two integer linear programming models for the minimum common string partition problem

Optimization Letters ◽

10.1007/s11590-015-0921-4 ◽

2015 ◽

Vol 10 (1) ◽

pp. 189-205 ◽

Cited By ~ 3

Author(s):

Christian Blum ◽

Günther R. Raidl

Keyword(s):

Linear Programming ◽

Performance Evaluation ◽

Integer Linear Programming ◽

Programming Models ◽

Partition Problem ◽

Computational Performance ◽

Minimum Common String Partition

Download Full-text

A Contraction-based Ratio-cut Partitioning Algorithm

VLSI Design ◽

10.1080/1065514021000012093 ◽

2002 ◽

Vol 15 (2) ◽

pp. 485-489

Author(s):

Youssef Saab

Keyword(s):

Linear Time ◽

Fundamental Problem ◽

Cluster Formation ◽

Vlsi Circuits ◽

Iterative Improvement ◽

Partitioning Algorithm ◽

Partitioning Algorithms ◽

Simple Ratio ◽

Iterative Partitioning

Partitioning is a fundamental problem in the design of VLSI circuits. In recent years, ratio-cut partitioning has received attention due to its tendency to partition circuits into their natural clusters. Node contraction has also been shown to enhance the performance of iterative partitioning algorithms. This paper describes a new simple ratio-cut partitioning algorithm using node contraction. This new algorithm combines iterative improvement with progressive cluster formation. Under suitably mild assumptions, the new algorithm runs in linear time. It is also shown that the new algorithm compares favorably with previous approaches.

Download Full-text

Ranking top-k trees in tree-based phylogenetic networks

10.21203/rs.2.15349/v1 ◽

2019 ◽

Author(s):

Momoko Hayamizu ◽

Kazuhisa Makino

Keyword(s):

Optimal Algorithm ◽

Linear Time ◽

Fundamental Problem ◽

Phylogenetic Network ◽

Reticulate Evolution ◽

Interesting Property ◽

Biological Data ◽

Phylogenetic Networks ◽

Linear Delay ◽

Algorithmic Problems

Abstract 'Tree-based' phylogenetic networks provide a mathematically-tractable model for representing reticulate evolution in biology. Such networks consist of an underlying 'support tree' together with arcs between the edges of this tree. However, a tree-based network can have several such support trees, and this leads to a variety of algorithmic problems that are relevant to the analysis of biological data. Recently, Hayamizu (arXiv:1811.05849 [math.CO]) proved a structure theorem for tree-based phylogenetic networks and obtained linear-time and linear-delay algorithms for many basic problems on support trees, such as counting, optimisation, and enumeration. In the present paper, we consider the following fundamental problem in statistical data analysis: given a tree-based phylogenetic network $N$ whose arcs are associated with probability, create the top-$k$ support tree ranking for $N$ by their likelihood values. We provide a linear-delay (and hence optimal) algorithm for the problem and thus reveal the interesting property of tree-based phylogenetic networks that ranking top-$k$ support trees is as computationally easy as picking $k$ arbitrary support trees.

Download Full-text

Minimum Cell Connection in Line Segment Arrangements

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195917500017 ◽

2017 ◽

Vol 27 (03) ◽

pp. 159-176

Author(s):

Helmut Alt ◽

Sergio Cabello ◽

Panos Giannopoulos ◽

Christian Knauer

Keyword(s):

Linear Time ◽

Optimal Solution ◽

Time Algorithm ◽

Linear Time Algorithm ◽

Constant Number ◽

Line Segments ◽

Straight Line ◽

Minimum Number ◽

Number Of Segments ◽

Connection Problems

We study the complexity of the following cell connection problems in segment arrangements. Given a set of straight-line segments in the plane and two points [Formula: see text] and [Formula: see text] in different cells of the induced arrangement: [(i)] compute the minimum number of segments one needs to remove so that there is a path connecting [Formula: see text] to [Formula: see text] that does not intersect any of the remaining segments; [(ii)] compute the minimum number of segments one needs to remove so that the arrangement induced by the remaining segments has a single cell. We show that problems (i) and (ii) are NP-hard and discuss some special, tractable cases. Most notably, we provide a near-linear-time algorithm for a variant of problem (i) where the path connecting [Formula: see text] to [Formula: see text] must stay inside a given polygon [Formula: see text] with a constant number of holes, the segments are contained in [Formula: see text], and the endpoints of the segments are on the boundary of [Formula: see text]. The approach for this latter result uses homotopy of paths to group the segments into clusters with the property that either all segments in a cluster or none participate in an optimal solution.

Download Full-text

Efficient Web Mining for Traversal Path Patterns

Web Mining ◽

10.4018/978-1-59140-414-9.ch015 ◽

2011 ◽

pp. 322-338 ◽

Cited By ~ 1

Author(s):

Zhixiang Chen ◽

Richard H. Fowler ◽

Ada Wai-Chee Fu ◽

Chunyue Wang

Keyword(s):

Web Mining ◽

Linear Time ◽

Fundamental Problem ◽

A Priori ◽

Web Pages ◽

Suffix Trees ◽

Web Logs ◽

Large Alphabet ◽

Optimal Linear ◽

Linear Time Algorithms

A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient mining of frequent traversal path patterns, that is, large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of shallow generalized suffix trees over a very large alphabet. These two algorithms have respectively provable linear and sublinear time complexity, and their performances are analyzed in comparison with the a priori-like algorithms and the Ukkonen algorithm. It is shown that these two new algorithms are substantially more efficient than the a priori-like algorithms and the Ukkonen algorithm.

Download Full-text