Table of Contents
Cover
Dedication
Title
Copyright
Preface
Acknowledgments
List of Acronyms
1 Introduction
1.1. Complete methods for combinatorial optimization
1.2. Approximate methods: metaheuristics
1.3. Outline of the book
2 Minimum Common String Partition Problem
2.1. The MCSP problem
2.2. An ILP model for the UMCSP problem
2.3. Greedy approach
2.4. Construct, merge, solve and adapt
2.5. Experimental evaluation
2.6. Future work
3 Longest Common Subsequence Problems
3.1. Introduction
3.2. Algorithms for the LCS problem
3.3. Algorithms for the RFLCS problem
3.4. Future work
4 The Most Strings With Few Bad Columns Problem
4.1. The MSFBC problem
4.2. An ILP model for the MSFBC problem
4.3. Heuristic approaches
4.4. ILP-based large neighborhood search
4.5. Experimental evaluation
4.6. Future work
5 Consensus String Problems
5.1. Introduction
5.2. Organization of this chapter
5.3. The closest string problem and the close to most string problem
5.4. The farthest string problem and the far from most string problem
5.5. An ILP-based heuristic
5.6. Future work
6 Alignment Problems
6.1. Introduction
6.2. The pairwise alignment problem
6.3. The multiple alignment problem
6.4. Conclusion and future work
7 Conclusions
7.1. DNA sequencing
7.2. Founder sequence reconstruction
7.3. Final remarks
Bibliography
Index
End User License Agreement
List of Tables
2 Minimum Common String Partition Problem
Table 2.1. Results of tuning CMSA for the UMCSP problem using irace
Table 2.2. Results for instances where |Σ| = 4
Table 2.3. Results for instances where |Σ| = 12
3 Longest Common Subsequence Problems
Table 3.1. Setting of κ pbs , κ rb , κ bs , and ρ depending on the convergence factor cf and the Boolean control variable bs_update
Table 3.2. Results of tuning Beam–ACO with irace
Table 3.3. Results for instances where |Σ| = 4
Table 3.4. Results for instances where |Σ| = 12
Table 3.5. Results for instances where |Σ| = 20
Table 3.6. Results of Beam–ACO tuning for the RFLCS problem with irace
Table 3.7. Results of CMSA tuning for the RFLCS problem with irace
Table 3.8. Experimental results concerning the RFLCS problem
4 The Most Strings With Few Bad Columns Problem
Table 4.1. Parameter settings produced by irace for LNS concerning |Σ| = 4
Table 4.2. Parameter settings produced by irace for LNS where |Σ| = 12
Table 4.3. Parameter settings produced by irace for LNS where |Σ| = 20
Table 4.4. Experimental results for instances where |Σ| = 4
Table 4.5. Experimental results for instances where |Σ| = 12
Table 4.6. Experimental results for instances where |Σ| = 20
5 Consensus String Problems
Table 5.1. Setting of κib and κrb depending on the convergence factor cf
Table 5.2. Numerical results
Table 5.3. Numerical results for the CSP
Table 5.4. Numerical results for the FSP
Table 5.5. Numerical results for the CTMS problem
Table 5.6. Results for the FFMS problem
7 Conclusions
Table 7.1. Heuristic and metaheuristic approaches for the layout phase in DNA fragment assembly
Table 7.2. Heuristic and metaheuristic approaches to the SBH problem
List of Illustrations
1 Introduction
Figure 1.1. DNA double helix (image courtesy of Wikipedia)
Figure 1.2. Graphical representation of the feasible region X of a generic ILP problem: , where
Figure 1.3. Feasible set of the problem described in Example 1.1. , with and . Furthermore, , since either xI = (0, 2) or xI = (1, 2)
Figure 1.4. Feasible set of the problem described in point (6) below Example 1.1: X = Ø, but
Figure 1.5. Branch and bound branching tree
Figure 1.6. B&B scenario
Figure 1.7. B&B partitioning (branching)
Figure 1.8. Branching:
Algorithm 1.1. General Cutting Plane Technique
Algorithm 1.2. Local search
Algorithm 1.3. Ant colony optimization (ACO)
Algorithm 1.4. Evolutionary algorithm (EA)
Algorithm 1.5. Greedy randomized adaptive search procedure (GRASP)
Algorithm 1.6. Iterated local search (ILS)
Algorithm 1.7. Simulated annealing (SA)
Algorithm 1.8. Large neighborhood search (LNS)
Algorithm 1.9. Construct, merge, solve & adapt (CMSA)
2 Minimum Common String Partition Problem
Figure 2.1. Improvement of CMSA over CPLEX (in percent). Note that when boxes are missing, CPLEX was not able to provide feasible solutions within the computation time allowed
Figure 2.2. Improvement of CMSA over GREEDY (in percent)
Figure 2.3. Graphical presentation of the sizes of the sub-instances as a percentage of the size of the original problems
Algorithm 2.1. GREEDY
Algorithm 2.2. CMSA for the UMCSP problem
Algorithm 2.3. ProbabilisticSolutionGeneration(B ) function
3 Longest Common Subsequence Problems
Figure 3.1. Consider example I := (S = {s 1 , s 2 , s 3 }, Σ = {a, b, c, d }), where s 1 = dbcadcc , s 2 = acabadd , and s 3 = bacdcdd . Let t = ba be the current partial solution. Figures a), b), and c) show the separation of si into xi and yi . In addition, position pointers pi and the next positions of the four letters in yi are indicated. If, for some i , yi does not contain a specific letter, the corresponding pointer is set to ∞. This happens, for example, in the case of letters a and b in and
Figure 3.2. Improvement of Beam–ACO over Best-Next. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by Best-Next for the 10 instances of the same type
Figure 3.3. Improvement of Beam–ACO over BS-L. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by BS-L for the 10 instances of the same type
Figure 3.4. Improvement of Beam–ACO over BS-H. Each box shows the differences between the objective function value of the solution produced by Beam–ACO and the one of the solution produced by BS-H for the 10 instances of the same type
Figure 3.5. Improvement of CMSA over Beam–ACO. Each box shows the differences between the objective function value of the solution produced by CMSA and the one of the solution produced by Beam–ACO for 10 instances of the same type
Figure 3.6. Improvement of CMSA over CPLEX. Each box shows the differences between the objective function value of the solution produced by CMSA and that of the solution produced by CPLEX for 10 instances of the same type. Missing boxes indicate that CPLEX was not able to produce any valid solution
Algorithm 3.1. BEST-NEXT heuristic for the LCS problem
Algorithm 3.2. BS for the LCS problem
Algorithm 3.3. Beam–ACO for the LCS problem
Algorithm 3.4. CMSA for the RFLCS problem
4 The Most Strings With Few Bad Columns Problem
Figure 4.1. Improvement of LNS over PILOT-TR (in absolute terms). Each box shows these differences for the corresponding 10 instances. Note that negative values indicate that PILOT-TR obtained a better result than LNS
Figure 4.2. Improvement of LNS over CPLEX (in absolute terms). Each box shows these differences for the corresponding 10 instances. Note that negative values indicate that CPLEX obtained a better result than LNS
Figure 4.3. Average time (in seconds) used by CPLEX within LNS for each application at each iteration. Two examples considering destruction between 10 and 90 percent
Algorithm 4.1. MakeSolutionComplete (Sp ) procedure
Algorithm 4.2. Truncated pilot method
Algorithm 4.3. ILP-based LNS for the MSFBC problem
5 Consensus String Problems
Figure 5.1. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a random instance where n = 100, m = 600, t = 480, with a target value of = 0.68 × n
Figure 5.2. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a random instance where n = 200, m = 300, t = 255, and a target value of = 0.025 × n
Figure 5.3. Time to target distributions (in seconds) comparing grasp and grasp-h-ev for a real instance where n = 300, m = 1200, t = 1020, and a target value of = 0.22 × n
Figure 5.4. Time-to-target distributions (in seconds) comparing grasp-h-ev , grasp-h-ev_b
and grasp-h-ev_ev_pr
on the random instance where n = 100, m = 300, t = 240, and a target value of = 0.73 × n
Figure 5.5. Time-to-target distributions (in seconds) comparing grasp-h-ev , grasp-h-ev_b
and grasp-h-ev_ev_pr
on the random instance where n = 200, m = 300, t = 240, and a target value of = 0.445 × n
Figure 5.6. Justification for the choice of parameter t for the CTMSP and the FFMSP
Figure 5.7. Graphical representation of the comparison between HEURISTIC-3600 and the current FFMSP state-of-the-art methods (GRASP+PR, ACO, and ACO+CPLEX) for t = 0.8 m and t = 0.85 m
Algorithm 5.1. GRASP for the FFMSP
Algorithm 5.2. Function GRRAND of Algorithm 5.1
Algorithm 5.3. Path-relinking for the FFMSP
Algorithm 5.4. Mixed-Path-relinking for the FFMSP [RIB 07]
Algorithm 5.5. (GRAPR) Greedy Randomized Adaptive Path-relinking for FFMSP [FAR 05]
Algorithm 5.6. Evolutionary path-relinking for FFMSP [RES 10b, RES 04]
Algorithm 5.7. GRASP+PR for the FFMSP
Algorithm 5.8. CHOOSE-PR-STRATEGY – Function for selection of path-relinking strategy for FFMSP
Algorithm 5.9. (Hybrid) ACO for the FFMSP
Algorithm 5.10. RunACO(sbs ) function of Algorithm 5.9
6 Alignment Problems
Figure 6.1. An example of alignment of two sequences
Figure 6.2. All possible alignments of two sequences without gaps
Figure 6.3. Two possible alignments of two sequences: a) not inserting gaps with 6 pairings; b) inserting gaps with 7 pairings
Figure 6.4. Dot matrix representing the alignment in Figure 6.1
Figure 6.5. Matrix H generated by the application of Smith and Waterman’s algorithm to strings s = A A U G C C A U U G A C G G and t = C A G C C U C G C U U A G
Figure 6.6. The alignment obtained for the simple example in Figure 6.5
7 Conclusions
Figure 7.1. a) The completely connected directed graph G with the spectrum S = {ACT, TGA, GAC, CTC, TAA } as vertex set. The edge weights (i.e. overlaps) are not indicated for readability reasons. For example, the weight on the edge from TGA to GAC is 2, because GA is the longest DNA strand that is a suffix of TGA and a prefix of GAC. An optimal solution is p ∗ = (ACT, TGA, GAC, CTC ). b) How to retrieve the DNA sequence that is encoded by p ∗ . Note that c (p ∗ ) = 8, which is equal to the length of the encoded DNA sequence
Figure 7.2. a) A set of six recombinants in matrix form. Assuming that the number of founders ( kf ) is fixed to three, b) shows a valid solution, that is, a matrix of three founders. Denoting the first founder by a, the second founder by b, and the third one by c, c) shows a decomposition of the recombinant matrix into fragments taken from the founders. Note that breakpoints are marked by vertical lines. This is a decomposition with 8 breakpoints, which is the minimum value for this instance
Guide
Cover
Table of Contents
Begin Reading
Pages
C1
ii
iii
iv
v
ix
x
xi
xiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
175
176
177
178
179
180
181
182
183
184
185
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
G1
G2
G3
G4
G5
G6
G7
This book is dedicated to my parents Maria and Dieter, who currently pass through a difficult period of their lives .
(Christian Blum)
This book is dedicated to my daughters Iara, Mara, and Nara, the most beautiful gift that life could give me .
(Paola Festa)
Metaheuristics Set
coordinated by Nicolas Monmarché and Patrick Siarry
Metaheuristics for String Problems in Bio-informatics
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
© ISTE Ltd 2016
The rights of Christian Blum and Paola Festa to be identified as the author of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2016945029
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-84821-812-3
DNA (deoxyribonucleic acid) acts as the information archive of most living beings. Due to the fact that a strand of DNA can be expressed as a set of four-letter character strings, so-called string problems have become abundant in bioinformatics and computational biology. Each year, new optimization problems dealing with DNA (or protein) sequences are being formulated that require efficient optimization techniques to arrive at solutions. From this perspective, bioinformatics is a burgeoning field for optimization experts and computer scientists in general. In this book, we will focus on a mixture of well-known and recent string optimization problems in the bioinformatics field. We will focus on problems that are combinatorial in nature.
One of the obstacles for optimization practitioners is the atypical nature of these problems, i.e. although combinatorial in nature, these problems are rather different to the classical traveling salesman problem or the quadratic assignment problem. This implies that a different type of expertise is required to efficiently solve many of these problems. Therefore, one of the main goals of this book is to pass on this kind of expertise and experience to newcomers to this field. The book provides several examples of very successful (hybrid) metaheuristics for solving specific string problems. One such example concerns the use of beam search (an incomplete branch and bound method) in solving longest common subsequence problems. The application of this algorithm in 2009 marked a breakthrough in the solution of this type of problem.
Finally, we would like to address a few words to the interested readers, especially biologists. We apologize for any imprecision in the description of biological processes, which we have tried to keep to a minimum. Keep in mind that, after all, we are only computer scientists and mathematicians.
Christian BLUM Paola FESTA June 2016
This work was supported by grant TIN2012-37930-C02-02 from the Spanish Government. Support from CSIC (Spanish National Research Council) and IKERBASQUE (Basque Foundation for Science) is also acknowledged. We thank RDlab1 , a BarcelonaTech facility, for allowing us to perform the experiments partly in the high-performance computing environment.
1 http://rdlab.lsi.upc.edu.
ACO
Ant Colony Optimization
B&B
Branch & Bound
CMSA
Construct, Merge, Solve & Adapt
CO
Combinatorial Optimization
DNA
Deoxyribonucleic Acid
DP
Dynamic Programming
EA
Evolutionary Algorithm
GA
Genetic Algorithm
ILP
Integer Linear Programming
ILS
Iterated Local Search
IP
Integer Programming
LNS
Large Neighborhood Search
MCSP
Minimum Common String Partition
MSFBC
Most Strings With Few Bad Columns
RNA
Ribonucleic Acid
SA
Simulated Annealing
TS
Tabu Search
TSP
Traveling Salesman Problem
UMCSP
Unbalanced Minimum Common String Partition