Table of Contents

Cover

Dedication

Title

Preface

Acknowledgments

List of Acronyms

1 Introduction

1.1. Complete methods for combinatorial optimization
1.2. Approximate methods: metaheuristics
1.3. Outline of the book

2 Minimum Common String Partition Problem

2.1. The MCSP problem
2.2. An ILP model for the UMCSP problem
2.3. Greedy approach
2.4. Construct, merge, solve and adapt
2.5. Experimental evaluation
2.6. Future work

3 Longest Common Subsequence Problems

3.1. Introduction
3.2. Algorithms for the LCS problem
3.3. Algorithms for the RFLCS problem
3.4. Future work

4 The Most Strings With Few Bad Columns Problem

4.1. The MSFBC problem
4.2. An ILP model for the MSFBC problem
4.3. Heuristic approaches
4.4. ILP-based large neighborhood search
4.5. Experimental evaluation
4.6. Future work

5 Consensus String Problems

5.1. Introduction
5.2. Organization of this chapter
5.3. The closest string problem and the close to most string problem
5.4. The farthest string problem and the far from most string problem
5.5. An ILP-based heuristic
5.6. Future work

6 Alignment Problems

6.1. Introduction
6.2. The pairwise alignment problem
6.3. The multiple alignment problem
6.4. Conclusion and future work

7 Conclusions

7.1. DNA sequencing
7.2. Founder sequence reconstruction
7.3. Final remarks

Bibliography

Index

End User License Agreement

List of Tables

2 Minimum Common String Partition Problem

Table 2.1. Results of tuning CMSA for the UMCSP problem using irace
Table 2.2. Results for instances where |Σ| = 4
Table 2.3. Results for instances where |Σ| = 12

3 Longest Common Subsequence Problems

Table 3.1. Setting of κ_pbs, κ_rb, κ_bs, and ρ depending on the convergence factor cf and the Boolean control variable bs_update
Table 3.2. Results of tuning Beam–ACO with irace
Table 3.3. Results for instances where |Σ| = 4
Table 3.4. Results for instances where |Σ| = 12
Table 3.5. Results for instances where |Σ| = 20
Table 3.6. Results of Beam–ACO tuning for the RFLCS problem with irace
Table 3.7. Results of CMSA tuning for the RFLCS problem with irace
Table 3.8. Experimental results concerning the RFLCS problem

4 The Most Strings With Few Bad Columns Problem

Table 4.1. Parameter settings produced by irace for LNS concerning |Σ| = 4
Table 4.2. Parameter settings produced by irace for LNS where |Σ| = 12
Table 4.3. Parameter settings produced by irace for LNS where |Σ| = 20
Table 4.4. Experimental results for instances where |Σ| = 4
Table 4.5. Experimental results for instances where |Σ| = 12
Table 4.6. Experimental results for instances where |Σ| = 20

5 Consensus String Problems

Table 5.1. Setting of κ_ib and κ_rb depending on the convergence factor cf
Table 5.2. Numerical results
Table 5.3. Numerical results for the CSP
Table 5.4. Numerical results for the FSP
Table 5.5. Numerical results for the CTMS problem
Table 5.6. Results for the FFMS problem

7 Conclusions

Table 7.1. Heuristic and metaheuristic approaches for the layout phase in DNA fragment assembly
Table 7.2. Heuristic and metaheuristic approaches to the SBH problem

List of Illustrations

1 Introduction

Figure 1.1. DNA double helix (image courtesy of Wikipedia)
Figure 1.2. Graphical representation of the feasible region X of a generic ILP problem: , where
Figure 1.3. Feasible set of the problem described in Example 1.1. , with and . Furthermore, , since either x_I = (0, 2) or x_I = (1, 2)
Figure 1.4. Feasible set of the problem described in point (6) below Example 1.1: X = Ø, but
Figure 1.5. Branch and bound branching tree
Figure 1.6. B&B scenario
Figure 1.7. B&B partitioning (branching)
Figure 1.8. Branching:
Algorithm 1.1. General Cutting Plane Technique
Algorithm 1.2. Local search
Algorithm 1.3. Ant colony optimization (ACO)
Algorithm 1.4. Evolutionary algorithm (EA)
Algorithm 1.5. Greedy randomized adaptive search procedure (GRASP)
Algorithm 1.6. Iterated local search (ILS)
Algorithm 1.7. Simulated annealing (SA)
Algorithm 1.8. Large neighborhood search (LNS)
Algorithm 1.9. Construct, merge, solve & adapt (CMSA)

2 Minimum Common String Partition Problem

Figure 2.1. Improvement of CMSA over CPLEX (in percent). Note that when boxes are missing, CPLEX was not able to provide feasible solutions within the computation time allowed
Figure 2.2. Improvement of CMSA over GREEDY (in percent)
Figure 2.3. Graphical presentation of the sizes of the sub-instances as a percentage of the size of the original problems
Algorithm 2.1. GREEDY
Algorithm 2.2. CMSA for the UMCSP problem
Algorithm 2.3. ProbabilisticSolutionGeneration(B) function

3 Longest Common Subsequence Problems

Figure 3.1. Consider example I := (S = {s₁, s₂, s₃}, Σ = {a, b, c, d}), where s₁ = dbcadcc, s₂ = acabadd, and s₃ = bacdcdd. Let t = ba be the current partial solution. Figures a), b), and c) show the separation of s_i into x_i and y_i. In addition, position pointers p_i and the next positions of the four letters in y_i are indicated. If, for some i, y_i does not contain a specific letter, the corresponding pointer is set to ∞. This happens, for example, in the case of letters a and b in and
Figure 3.2. Improvement of Beam–ACO over Best-Next. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by Best-Next for the 10 instances of the same type
Figure 3.3. Improvement of Beam–ACO over BS-L. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by BS-L for the 10 instances of the same type
Figure 3.4. Improvement of Beam–ACO over BS-H. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by BS-H for the 10 instances of the same type
Figure 3.5. Improvement of CMSA over Beam–ACO. Each box shows the differences between the objective function value of the solution produced by CMSA and the one of the solution produced by Beam–ACO for 10 instances of the same type
Figure 3.6. Improvement of CMSA over CPLEX. Each box shows the differences between the objective function value of the solution produced by CMSA and that of the solution produced by CPLEX for 10 instances of the same type. Missing boxes indicate that CPLEX was not able to produce any valid solution
Algorithm 3.1. BEST-NEXT heuristic for the LCS problem
Algorithm 3.2. BS for the LCS problem
Algorithm 3.3. Beam–ACO for the LCS problem
Algorithm 3.4. CMSA for the RFLCS problem

4 The Most Strings With Few Bad Columns Problem

Figure 4.1. Improvement of LNS over PILOT-TR (in absolute terms). Each box shows these differences for the corresponding 10 instances. Note that negative values indicate that PILOT-TR obtained a better result than LNS
Figure 4.2. Improvement of LNS over CPLEX (in absolute terms). Each box shows these differences for the corresponding 10 instances. Note that negative values indicate that CPLEX obtained a better result than LNS
Figure 4.3. Average time (in seconds) used by CPLEX within LNS for each application at each iteration. Two examples considering destruction between 10 and 90 percent
Algorithm 4.1. MakeSolutionComplete(S^p) procedure
Algorithm 4.2. Truncated pilot method
Algorithm 4.3. ILP-based LNS for the MSFBC problem

5 Consensus String Problems

Figure 5.1. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a random instance where n = 100, m = 600, t = 480, with a target value of = 0.68 × n
Figure 5.2. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a random instance where n = 200, m = 300, t = 255, and a target value of = 0.025 × n
Figure 5.3. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a real instance where n = 300, m = 1200, t = 1020, and a target value of = 0.22 × n
Figure 5.4. Time-to-target distributions (in seconds) comparing grasp-h-ev, grasp-h-ev_b and grasp-h-ev_ev_pr on the random instance where n = 100, m = 300, t = 240, and a target value of = 0.73 × n
Figure 5.5. Time-to-target distributions (in seconds) comparing grasp-h-ev, grasp-h-ev_b and grasp-h-ev_ev_pr on the random instance where n = 200, m = 300, t = 240, and a target value of = 0.445 × n
Figure 5.6. Justification for the choice of parameter t for the CTMSP and the FFMSP
Figure 5.7. Graphical representation of the comparison between HEURISTIC-3600 and the current FFMSP state-of-the-art methods (GRASP+PR, ACO, and ACO+CPLEX) for t = 0.8 m and t = 0.85 m
Algorithm 5.1. GRASP for the FFMSP
Algorithm 5.2. Function GRRAND of Algorithm 5.1
Algorithm 5.3. Path-relinking for the FFMSP
Algorithm 5.4. Mixed-Path-relinking for the FFMSP [RIB 07]
Algorithm 5.5. (GRAPR) Greedy Randomized Adaptive Path-relinking for FFMSP [FAR 05]
Algorithm 5.6. Evolutionary path-relinking for FFMSP [RES 10b, RES 04]
Algorithm 5.7. GRASP+PR for the FFMSP
Algorithm 5.8. CHOOSE-PR-STRATEGY – Function for selection of path-relinking strategy for FFMSP
Algorithm 5.9. (Hybrid) ACO for the FFMSP
Algorithm 5.10. RunACO(s^bs) function of Algorithm 5.9

6 Alignment Problems

Figure 6.1. An example of alignment of two sequences
Figure 6.2. All possible alignments of two sequences without gaps
Figure 6.3. Two possible alignments of two sequences: a) not inserting gaps with 6 pairings; b) inserting gaps with 7 pairings
Figure 6.4. Dot matrix representing the alignment in Figure 6.1
Figure 6.5. Matrix H generated by the application of Smith and Waterman’s algorithm to strings s = A A U G C C A U U G A C G G and t = C A G C C U C G C U U A G
Figure 6.6. The alignment obtained for the simple example in Figure 6.5

7 Conclusions

Figure 7.1. a) The completely connected directed graph G with the spectrum S = {ACT, TGA, GAC, CTC, TAA} as vertex set. The edge weights (i.e. overlaps) are not indicated for readability reasons. For example, the weight on the edge from TGA to GAC is 2, because GA is the longest DNA strand that is a suffix of TGA and a prefix of GAC. An optimal solution is p^∗ = (ACT, TGA, GAC, CTC). b) How to retrieve the DNA sequence that is encoded by p^∗. Note that c(p^∗) = 8, which is equal to the length of the encoded DNA sequence
Figure 7.2. a) A set of six recombinants in matrix form. Assuming that the number of founders (k_f ) is fixed to three, b) shows a valid solution, that is, a matrix of three founders. Denoting the first founder by a, the second founder by b, and the third one by c, c) shows a decomposition of the recombinant matrix into fragments taken from the founders. Note that breakpoints are marked by vertical lines. This is a decomposition with 8 breakpoints, which is the minimum value for this instance

This book is dedicated to my parents Maria and Dieter, who currently pass through a difficult period of their lives.

(Christian Blum)

This book is dedicated to my daughters Iara, Mara, and Nara, the most beautiful gift that life could give me.

(Paola Festa)

First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

27-37 St George’s Road

London SW19 4EU

www.iste.co.uk

John Wiley & Sons, Inc.

111 River Street

Hoboken, NJ 07030

USA

www.wiley.com

The rights of Christian Blum and Paola Festa to be identified as the author of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2016945029

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-84821-812-3

Preface

DNA (deoxyribonucleic acid) acts as the information archive of most living beings. Due to the fact that a strand of DNA can be expressed as a set of four-letter character strings, so-called string problems have become abundant in bioinformatics and computational biology. Each year, new optimization problems dealing with DNA (or protein) sequences are being formulated that require efficient optimization techniques to arrive at solutions. From this perspective, bioinformatics is a burgeoning field for optimization experts and computer scientists in general. In this book, we will focus on a mixture of well-known and recent string optimization problems in the bioinformatics field. We will focus on problems that are combinatorial in nature.

One of the obstacles for optimization practitioners is the atypical nature of these problems, i.e. although combinatorial in nature, these problems are rather different to the classical traveling salesman problem or the quadratic assignment problem. This implies that a different type of expertise is required to efficiently solve many of these problems. Therefore, one of the main goals of this book is to pass on this kind of expertise and experience to newcomers to this field. The book provides several examples of very successful (hybrid) metaheuristics for solving specific string problems. One such example concerns the use of beam search (an incomplete branch and bound method) in solving longest common subsequence problems. The application of this algorithm in 2009 marked a breakthrough in the solution of this type of problem.

Finally, we would like to address a few words to the interested readers, especially biologists. We apologize for any imprecision in the description of biological processes, which we have tried to keep to a minimum. Keep in mind that, after all, we are only computer scientists and mathematicians.

Christian BLUM
Paola FESTA
June 2016

Acknowledgments

This work was supported by grant TIN2012-37930-C02-02 from the Spanish Government. Support from CSIC (Spanish National Research Council) and IKERBASQUE (Basque Foundation for Science) is also acknowledged. We thank RDlab¹, a BarcelonaTech facility, for allowing us to perform the experiments partly in the high-performance computing environment.

List of Acronyms

ACO	Ant Colony Optimization
B&B	Branch & Bound
CMSA	Construct, Merge, Solve & Adapt
CO	Combinatorial Optimization
DNA	Deoxyribonucleic Acid
DP	Dynamic Programming
EA	Evolutionary Algorithm
GA	Genetic Algorithm
ILP	Integer Linear Programming
ILS	Iterated Local Search
IP	Integer Programming
LNS	Large Neighborhood Search
MCSP	Minimum Common String Partition
MSFBC	Most Strings With Few Bad Columns
RNA	Ribonucleic Acid
SA	Simulated Annealing
TS	Tabu Search
TSP	Traveling Salesman Problem
UMCSP	Unbalanced Minimum Common String Partition