reading

Registered office: John Wiley & Sons, Ltd., The Atrium, Southern Gate, Chichester, West Sussex,
PO19 8SQ, UK

Editorial offices:   9600 Garsington Road, Oxford, OX4 2DQ, UK
                        The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
                        111 River Street, Hoboken, NJ 07030-5774, USA

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell.

The right of the author to be identified as the author of this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author(s) have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data
Names: Attwood, Teresa K., author. | Pettifer, Stephen R. (Stephen Robert),
1970- author. | Thorne, David, 1981-, author.
Title: Bioinformatics challenges at the interface of biology and computer
science : mind the gap / Teresa K. Attwood, Stephen R. Pettifer and David
Thorne.
Description: Oxford : John Wiley & Sons Ltd., 2016. | Includes
bibliographical references and index.
Identifiers: LCCN 2016015332| ISBN 9780470035504 (cloth) | ISBN 9780470035481
(pbk.)
Subjects: LCSH: Bioinformatics.
Classification: LCC QH324.2 .A87 2016 | DDC 570.285–dc23 LC record available at https://lccn.loc.gov/2016015332

A catalogue record for this book is available from the British Library.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Cover images: Courtesy of the authors, apart from the DNA strand image: Getty/doguhakan.

Preface

0.1 Who this book is for, and why

As you pick up this book and flick through its pages, we'd like to interrupt you for a moment to tell you who we think this book is for, and why. Let's start with the why. One main thought prompted us to tackle this project. There are now many degree courses in bioinformatics¹ throughout the world, and many excellent accompanying textbooks. In general, these texts tend to focus either on how to use standard bioinformatics tools, or on how to implement the algorithms and get to grips with the programming languages that underpin them. The ‘how to use' approach is appropriate for students who want to become familiar with the tools of the trade quickly so that they can apply them to their own interests; the ‘what's behind them' approach might appeal more to students who want to go on to develop their own software and databases. Books like this are extremely useful, but there's often something missing – bioinformatics isn't just about writing faster programs or creating new databases; it's also about coupling solid computer science with an appreciation of how computers can (or can't) solve biological problems.

During the years we've been working together, we realised that this issue is seldom, if ever, tackled head-on – it's an issue rooted in the nature of bioinformatics itself. Bioinformatics is often described as the interface where computer science and biology meet and overlap, as shown in Figure 0.1.

However, the reality is that modern bioinformatics has become a discipline in its own right – bigger, broader and more complex than this seemingly trivial overlap might suggest, and bringing with it many new challenges of its own. The reality, then, tends to look more like Figure 0.2.

A common difficulty for teachers of bioinformatics courses is that they often have to cater for mixed audiences – for students with diverse backgrounds in biology and computer science. As a result, there generally isn't enough time to cover all of the basic biology and all of the basic bioinformatics, and even less time to deal, at an appropriate level, with some of the emerging issues in computer science and how these relate to bioinformatics. It's hard enough to do justice to each of these disciplines on its own, let alone to tackle the problems that emerge when they're brought together.

For teachers and students, this book is therefore an attempt to bridge the interdisciplinary gap, to address some of the challenges for bioinformatics at the interface of biology and computer science. Compared to other textbooks, it offers rather different perspectives on the challenges, and on the relationship between these disciplines necessary for the success of bioinformatics in the future. It will appeal, we hope, to undergraduate and postgraduate students who want to look beyond ‘how to use this tool' or ‘how to write that program'; it will speak to students who want to discover why the things we often want to do in bioinformatics are not as straightforward as they should be; it will resonate with those who want to understand how we can use computer science to make such tasks more straightforward than they currently are; and it will challenge the thoughtful to reflect on where the limits are in terms of what we want computers to be able to do and what's actually doable.

0.2 Who has written this book

The authors of the book have backgrounds in biophysics and computer science. We've worked together successfully for many years, but early in our relationship we often found it harder to communicate than we imagined it either could or should be. It wasn't just about meeting a different discipline head on and having to grapple with new terminology – the real problem was that we often thought we understood each other, only to find later that we actually meant quite different things, despite having used what appeared to be the same vocabulary. Learning new terminology is relatively easy; learning to spot that the thing you mean and the thing you think he means are different is much harder! It's a bit like trying to read the bottom line of an optician's test chart – the scale of the problem only comes into focus when viewed from another perspective: the optician provides the correct lens, and clarity dawns.

The scale of our communication problem was thrown into focus when we began developing software together. Initially, there was the usual learning curve to negotiate, as we each had to understand the jargon of the other's discipline. When our new software eventually emerged, we were invited to give a talk about it. It was only when we tried to write the talk, and stood back and looked at the work in a different way, that a kind of fog lifted – thinking we'd been speaking the same language, we'd actually reached this point by understanding different things by using the same words. This revelation provided a catalyst for the book: if we could so easily have reached apparent understanding through misunderstanding of basic concepts, could bioinformatics be a similar house of cards, waiting to topple? We reasoned that our different scientific perspectives, focused on some straightforward bioinformatics applications, might help to shed a little light on this question. So, let's now briefly introduce our different backgrounds – this should help to explain why the book is written in the way that it is, and why it contains what it does.

0.2.1 The biophysicist

The biophysicist isn't really a biologist and certainly isn't a computer scientist. She was engaged in her early postdoctoral work on protein sequence and structure analysis when the field of bioinformatics was just emerging; some of it rubbed off. The introductory chapters of the book therefore derive mostly from those early experiences, when the term bioinformatics was more or less synonymous with sequence analysis. Thus, sequence analysis provides an important backdrop for the book. Of course, it isn't the only theme, either of the book or of bioinformatics in general – it's just a convenient place to start, a place that's both historically and conceptually easy to build from, and one that's arguably even more relevant today than it was when the discipline began. To maintain an appropriate focus, then, we use sequence analysis as our springboard and, in deference to their expertise, we leave other aspects of bioinformatics to the relevant specialists in those particular fields.

0.2.2 The computer scientists

The computer scientists aren't really bioinformaticians and certainly aren't biologists. They began developing software tools in collaboration with bioinformaticians about ten years ago; in the process, a bit of bioinformatics rubbed off on them too. They come at the subject from various perspectives, including those of distributed systems, computer graphics, human–computer interaction and scientific visualisation. Especially important for this book, as you'll see if you view the online supplementary materials (which we encourage you to do), is their interest in the design of collaborative and semantically rich software systems for the biosciences, with a particular focus on improving access to scholarly literature and its underlying biochemical/biomedical data.

0.3 What's in this book

0.3.1 The scope

Rather than offering an authoritative exposition of current hot topics in bioinformatics, this book provides a framework for thinking critically about fundamental concepts, giving new perspectives on, and hence trying to bridge the gap between, where traditional bioinformatics is now and where computer science is preparing to take it in the future. In essence, it's an exploration of the philosophical divide between what we want computers to do for us in the life sciences and what it's actually possible to do, today, with current computer technology. You might wonder why this is interesting – don't all the other books out there deal with this sort of thing? And the answer is, no – this book is different. It isn't about how to do bioinformatics, computer science or biology; it's about what happens when all of these disciplines collide, about how to pick up the pieces afterwards, and about what's likely to happen if we can't put all the bits back together again in the right places.

The problem is, when bioinformatics began to emerge as a new discipline in the 1980s, it was, in a sense, a fix for what the biological sciences needed at the time and what computer science could provide at that time (mostly database technology and fast database-search algorithms). The discipline evolved in a very pragmatic way, addressing local problems with convenient, more-or-less readily available solutions. That, of course, is the very essence of evolution: there is no grand plan, no sweeping vision of the future – just a problem that needs to be solved, now, in the most efficient, economical way possible.

The ramifications for systems that evolve are profound: a system that hasn't been designed for the future, but just ‘does the job', will almost certainly be an excellent tool for precisely that job, at that time; however, it's unlikely to stand up to uses for which it wasn't originally devised. To be fit for new uses, the system will probably need to be modified in ad hoc ways, at each stage of the process gaining further complexity. Eventually, such an evolving system is likely to be confronted by new developments, developments that were not foreseen when the system began its evolutionary journey. At this juncture, we arrive at an all-too-familiar turning point: to move forward, either we must continue to ‘make do and mend' the system, to patch it up so that it will continue to ‘do the job', or we must take a deep breath and simply throw it away, and start again on a completely new path.

Many systems in bioinformatics have arrived at this point; many more will arrive there soon. So this is a good time to stop and reflect on how best to move forward. But it would be foolish to try to do this without first considering how bioinformatics reached this point in the first place. For this reason, the book begins, unashamedly, by rehearsing a little of the history of where bioinformatics came from; to balance things up, it includes a potted history of computer science too. It then looks at some of the things we want to be able to do now, things that weren't even thought of back then, and discusses why some of these things are so difficult (or impossible) to do with current systems. The book then crosses over to the other side, to look at the issues from a computer science perspective, and to explain what needs to happen in order to make some of the things we want to be able to do doable. Finally, it moves deeper into the emerging technologies, to consider some of the new problems that remain for the future.

0.3.2 The content

As a scientific discipline rooted in the (sequence) data-storage and data-analysis problems of the 1980s and 1990s, bioinformatics evolved hand-in-hand with the burgeoning of high-throughput biology, and now underpins many aspects of genomics, transcriptomics, proteomics, metabolomics and many other (some rather eccentrically named) ‘omics'. Bioinformaticians today must, therefore, not only understand the strengths and limitations of computer technologies, but must also appreciate the basic biology beneath the new high-throughput sciences.

Accordingly, the book is divided into two parts: the first reviews the fundamental concepts, tools and resources at the heart of bioinformatics; the second examines the relationship between these and some of the ‘big questions' in current computer science research. To give an idea: biological data may be of many different types, stored in many different formats, using a variety of different technologies, ranging from flat-files, to relational and object-oriented databases, to data warehouses. If we're to be able to use these repositories to help us address biological questions, we need to ask to what extent they realistically encapsulate the biological concepts in which we're interested. How accurately do they reflect the dynamic and complex nature of biological systems? To what extent do they allow us to perform seemingly simple tasks, such as comparing biological sequences and their 3-dimensional (3D) structures? To what extent do they allow us to reason with the data and to gain confidence that what we're looking at is, in some sense, true or meaningful? In exploring these issues, we examine the concepts involved in database design and organisation, the protocols and mechanisms by which data can be accessed in the modern distributed and networked environment, and we look at the higher-level issues of data integrity and reusability.

In a nutshell, Chapter 1 introduces the main themes of the book (bioinformatics and computer science) and generally sets the scene; Chapter 2 reviews biology's fundamental building-blocks, both to introduce some basic biological terms and concepts, and to provide sufficient context to start thinking about collecting and archiving biological data; Chapter 3 discusses the emergence of high-throughput sequencing techniques, the consequent data explosion and the spread of biological databases; and Chapter 4 explores the fundamental methods that underpin sequence analysis, with a perspective focused by the need to annotate and add meaning to raw sequence data. At this point, we cross over into Chapter 5 and step into The Gap, where some of the challenges at the interface of computer science, bioinformatics and biology arise. Here, we begin to reflect in earnest on how we humans understand the meaning of data, and how much harder it is to teach computers to achieve the same level of comprehension. From here, we move into the second, more technical part of the book: Chapter 6 delves deep into the way in which computers process data, exploring the twin themes of algorithms and complexity, and how understanding these has profound implications for building efficient bioinformatics tools; Chapter 7 considers the thorny issues of data representation and meaning (touching on aspects of semantic integration, ontologies, provenance,etc.), and examines the implications for building bioinformatics databases and services; Chapter 8 exposes the complexity of linking scientific facts hidden in databases with knowledge embodied in the scientific literature, and includes a case study that draws on many of the concepts, and highlights many of the challenges, outlined in the preceding chapters. We end with an Afterword, in which we try to reflect on how far we've succeeded in making many of our bioinformatics dreams a reality, and what challenges remain for the future. Finally, a comprehensive glossary serves as a reference for some of the biology, bioinformatics and computer science jargon that peppers this text.

Figure 0.3 gives an overview of the book, with suggested routes through it for different types of reader: e.g., after the Introduction, a bioinformatician might want to skip directly to ‘Biological sequence analysis', moving through ‘The Gap' and on to Part 2, referring back to the biological material as appropriate; a computer scientist might want to study Part 1 and ‘The Gap' chapters, before moving to the more esoteric discussions in ‘Linking data and scientific literature', referring back to the other computer science chapters as necessary; and a biologist might wish to skip the ‘Biological context', but otherwise take the chapters as they come.

0.4 What else is new in this book

Most books today come with a variety of online extras, such as image galleries, student tests, instructor websites and so on, to complement the printed work. Alongside this book, one of the chapters is provided online in a form suitable for being read using Utopia Documents, a software suite that confers on the digital text all kinds of interactive functionality (i.e., integrated with the chapter directly, not optional extras on a separate website).

To read the chapter online, we recommend that you download the Utopia Documents PDF reader. With this software, as you journey through the chapter, you'll experience new dimensions that aren't possible or accessible from the print version. Of course, you can read the chapter without installing the software; but, for those of you reading the online version, additional functionality will be revealed through the animating lens of Utopia Documents, and the overall experience will be more interesting. You can install the software from getutopia.com. Installation is easy: simply follow the link to the software download, and the guidance notes will talk you through the process for your platform of choice.

Once you've successfully downloaded Utopia Documents, you'll be ready for the new experience. As you read on, the chapter's interactive features will gradually unfold and will grow in complexity. We invite you to explore this functionality at your leisure (for the more adventurous, full documentation is available from the installation site).

0.5 What's not in this book, and why

Having outlined what you can expect to find in this book, we should emphasise again what you shouldn't expect to find, and why. The book doesn't provide a manual on how to use your favourite software or database; it doesn't offer ‘teach yourself' Perl, Java, machine learning, probabilistic algorithms, etcGoogleAmazon

Bioinformatics challenges at the interface of biology and computer science:

Mind the Gap

Preface

0.1 Who this book is for, and why

0.2 Who has written this book

0.2.1 The biophysicist

0.2.2 The computer scientists

0.3 What's in this book

0.3.1 The scope

0.3.2 The content

0.4 What else is new in this book

0.5 What's not in this book, and why