COMPUTATIONAL

BIOLOGY

A HYPERTEXTBOOK

 

SCOTT T. KELLEY

Department of Biology

San Diego State University
San Diego, California

AND

DENNIS DIDULO

Becton, Dickinson and Company

San Diego, California

COMPUTATIONAL

BIOLOGY

A HYPERTEXTBOOK

Washington, DC

 

Copyright © 2018 American Society for Microbiology. All rights reserved. No part of this publication may be reproduced or transmitted in whole or in part or reused in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission in writing from the publisher.

Disclaimer: To the best of the publisher’s knowledge, this publication provides information concerning the subject matter covered that is accurate as of the date of publication. The publisher is not providing legal, medical, or other professional services. Any reference herein to any specific commercial products, procedures, or services by trade name, trademark, manufacturer, or otherwise does not constitute or imply endorsement, recommendation, or favored status by the American Society for Microbiology (ASM). The views and opinions of the author(s) expressed in this publication do not necessarily state or reflect those of ASM, and they shall not be used to advertise or endorse any product.

Library of Congress Cataloging-in-Publication Data

Names: Kelley, Scott T. (Scott Theodore), author. | Didulo, Dennis, author.

Title: Computational biology : a hypertextbook / Scott T. Kelley, Department of Biology, San Diego State University, San Diego, California, and Dennis Didulo, Becton, Dickinson and Company, San Diego, California.

Description: Washington, DC : ASM Press, [2018] | Includes index.

Identifiers: LCCN 2017051454 (print) | LCCN 2017052307 (ebook) | ISBN 9781683673033 (ebook) | ISBN 9781683670025 (pbk.)

Subjects: LCSH: Computational biology.

Classification: LCC QH324.2 (ebook) | LCC QH324.2 .K45 2018 (print) | DDC 570.285--dc23

LC record available at https://lccn.loc.gov/2017051454

All Rights Reserved

Printed in the United States of America

10   9   8   7   6   5   4   3   2   1

Address editorial correspondence to

ASM Press, 1752 N St., N.W.,

Washington, DC 20036-2904, USA

Send orders to ASM Press, P.O. Box 605, Herndon, VA 20172, USA

Phone: 800-546-2416; 703-661-1593

Fax: 703-661-1501

E-mail: books@asmusa.org

Online: http://www.asmscience.org

 

To Kina and Aidan, my wonderful and supportive family.

And to my brother Brian, who selflessly donated his kidney, without which I would not have had the energy to write this book.

 

CONTENTS

Preface

For the Instructor

For the Student

Acknowledgments

About the Authors

CHAPTER –1    Getting Started

CHAPTER 00    Introduction

Activity 0.1: Biological Databases and Data Storage

CHAPTER 01    BLAST

Activity 1.1: BLAST Algorithm

CHAPTER 02    Protein Analysis

Activity 2.1: Hydrophobicity Plotting

Activity 2.2: Protein Secondary Structure Prediction

CHAPTER 03    Sequence Alignment

Activity 3.1: Dynamic Programming

CHAPTER 04    Patterns in the Data

Activity 4.1: Protein Sequence Motifs

Activity 4.2: Position-Specific Weight Matrices

CHAPTER 05    RNA Structure Prediction

Activity 5.1: RNA Structure Prediction

CHAPTER 06    Phylogenetics

Activity 6.1: Phylogenetic Analysis

CHAPTER 07    Probability: All Mutations are not Equal (-ly Probable)

Activity 7.1: Generating PAM and BLOSUM Substitution Matrices

CHAPTER 08    Bioinformatics Programming: A Primer

Index

 

PREFACE

This textbook is a hypertextbook. Half of the textbook material lies between the pages of this book and the other half on the Internet. It seems natural that a hypertextbook, which combines print and online apps for mobile technology, would be a great way to learn the basics of bioinformatics, which uses informatics (computational) theory to study biological data.

This book was born out of a mix of necessity and inspiration.1 The necessity came from the dearth of bioinformatics instructional materials appropriate for my combination of biology students, with little or no computer background,2 and computer science students, who were interested in the field but had little understanding of biology. The need became acute when I learned that my favorite bioinformatics lab manual, Bioinformatics for Dummies (BFD3), would no longer be updated. BFD was a great lab manual for learning how to perform basic bioinformatics data analysis. This book did not explain the principles behind the algorithms, but I could cover those during lectures. BFD was clear and fun to read and provided practical skills for biologists and others looking to analyze data. Unfortunately, the most recent version was printed in 2007!

I kept using the old edition of BFD for some time, but eventually the tutorials became obsolete and the students took longer and longer to complete the exercises. In fact, several passages of BFD were obsolete a few months after the book was printed. Bioinformatics websites are constantly changing, including their designs and the URL links, and sometimes the pages themselves disappear altogether. Since I began writing this book, two of the websites I teach in the book and online materials changed significantly, and one disappeared altogether.

This led to my original inspiration for the hyper- part of this hypertextbook. What if I made my own bioinformatics tutorials and sample test data for commonly used analysis tools online in easily updated files? That way, when a link changed or the programmers moved a radio button around, I could easily alter the tutorial to reflect these changes in real time. Students would not have to wait for a new version of a book to have an accurate tutorial.

The next inspiration arose from my use of paper-based puzzles and problems to teach the bioinformatics algorithms. The problems I taught in class, combined with the anticipatory conceptual exercises and lecture material, were very successful for teaching how the methods worked. Unfortunately, paper-based problems also had significant drawbacks: the students were given only one practice problem per algorithm, and they received very little feedback as a result. Typically, I would (1) teach the method, (2) do an example with the students in class, (3) assign it for homework, (4) get it back a week later, and (5) return it with feedback a week after that. And that was it.

Fortunately, I realized that the structure of the algorithm puzzles I taught would be perfect for touchscreen devices and laptops. Most of them involved either sliding letters around or filling in boxes with numbers, both easily done with a finger or a mouse. With the collaboration of my bioinformatics website designer and coauthor Dennis Didulo, I created interactive learning tools that provide limitless practice and instant feedback for students. When we combined the bioinformatics software tutorials and test data into one site, we had a comprehensive learning paradigm for introductory bioinformatics. (See “For the Student” section below for an outline of the website features.)

In my class, I noticed an immediate increase in algorithm comprehension and problem-solving ability. Students gained much more practice, received more feedback, and performed much better on tests. Because new problems were easily randomly generated, each student had their own personal data set. Best yet, I could now quickly generate new exam questions and answers with the click of a button!

And what did my students think? These quotes speak for themselves.

“It is a wonderful learning tool. The online programs made learning the algorithms almost easy.”—Ruby, undergraduate student

“I didn’t want to tell you how much I liked the website because I didn’t want your ego to get too big.”—Emily, undergraduate student

“It was much better than that bioinformatics cat video.”—Pedro, graduate student

“You can learn bioinformatics while waiting in line at the DMV or sitting on your couch eating cheese puffs!”—Anonymous

Notes

  1. Much like the invention of the salad spinner.

  2. Many biology students tell me flatly that they are “bad with computers” or even state that “computers hate [them].” For the record, computers really don’t care about you at all. Which is why we should never give them weapons (see the film “Terminator”).

  3. BFD, the budding bioinformatician’s BFF.

 

FOR THE INSTRUCTOR

This hypertextbook can be used in a number of ways: in a lecture or online course, using the book as an outline for a course, or using just the sections of interest. It is important to note that, being a hypertextbook, the web components are not supplemental, but instead are crucial for being able to understand the content presented in the physical book. In my classes, I use the interactives inside of class, and the students also use them outside of class to help them solve algorithm problems or prepare for exams. Generally, I use the following approach:

  1. Teach the biological relevance and background of the method.
  2. Have students solve the conceptual (anticipatory) exercise in class, sharing answers with one another.
  3. Lecture on the basics of the algorithm.
  4. Have students bring out their mobile devices (laptops, smartphones, and tablets) and solve the interactive problems.
  5. Have students share their answers with neighbors in class and with the instructor.

Then, to make sure the students practice at home, I assign the paper-based practice problems. Finally, in the computer lab, or for homework, I assign the lab exercises with the software based on the algorithms.

So far, the approach has been a great success in my classes. The online tools increase comprehension and improve exam results, and the easily updated tutorials for bioinformatics analysis software and biological databases reduce a lot of student frustration. I hope it proves as successful in your class as it has in mine.

FOR THE STUDENT

This textbook is really a hypertextbook, meaning that much of the most exciting learning happens online. Close to half of the book materials are online, and in each chapter you will be directed to the online tools associated with the text. The idea is to leverage the uniquely powerful aspects of the Internet to help you learn about bioinformatics. The puzzle-like nature of bioinformatics algorithms makes them especially suited to interactivity and “gamification” (making difficult things into games with points and scores). The interactive nature of mobile devices and their connection to online bioinformatics software make them useful learning tools for understanding the theory behind bioinformatics methods (the algorithms) and for gaining practical experience with their implementation (software analysis and databases). In order to enhance learning of the principles behind bioinformatics algorithms and make them more engaging, the online resources have been designed to

 

ACKNOWLEDGMENTS

I wish to thank the leadership of the California State University Program in Education, Research, and Biotechnology (CSUPERB) and the grant reviewers who approved my proposal on mobile app education technology that provided the seed money for developing the interactive technology and web resources. I thank Greg Payne at ASM Press for listening to my ideas and taking them seriously and for his support during the writing and publishing process, and I thank my colleague at SDSU, David Lipson, for telling Greg about my project. I thank the hundreds of bioinformatics students who took my course at SDSU, who helped me refine my algorithm teaching methods from their sub-alpha development pencil-and-paper stages all the way through to the interactive app stage. You are the reason I do all this in the first place. I give special thanks to my spouse Kina Thackray for her advice during the long process of developing the bioinformatics learning algorithms, for encouraging me to submit grant proposals, and for her very helpful comments on multiple drafts of the book. Finally, I need to thank my former biometry professor Dr. Michael Grant, who taught me statistics and introduced me to programming (SAS) and Dr. Gary Stormo, who graciously allowed me to pursue bioinformatics as a postdoctoral researcher in his lab.

 

ABOUT THE AUTHORS

Scott T. Kelley is a Professor of Biology at San Diego State University. He has a Ph.D. from the University of Colorado and a B.A. from Cornell University. His lab at San Diego State University combines phylogenetic methods and culture-independent molecular tools to study environmental microbiology. Dr. Kelley has published extensively on the human microbiome, the built environment, and many natural environments. He has published many papers on bioinformatics, and has helped develop some widely-used tools for analyzing next-generation sequence data sets for microbial communities. He has received research grants from the National Institutes of Health, the National Science Foundation, the Alexander von Humboldt Foundation, and the Alfred P. Sloan Foundation, among others. He has served on the scientific advisory board of the Clorox Company, and his work has been featured by The New York Times, NPR, CBC (Canada), Time Magazine, and Der Spiegel, among numerous others. He is a massive fan of the FC St. Pauli and Everton FC football clubs; loves punk rock, jazz, and classical music; speaks German for fun; and makes a mean apple pie. You can follow Scott on twitter@kelleybioinfo.

Dennis Didulo has been a Data Analytics/Software Engineer at CareFusion since 2014 and a Software Test Engineer at Becton, Dickinson and Company since 2016 and also teaches online database and programming courses for the University of Maryland University College. He received his master’s degree in information technology at De La Salle University and his master’s degree in bioinformatics at San Diego State University. Dennis has professional development expertise in more than a dozen computer languages, as well as expertise in database management, algorithm design, and systems engineering. Dennis is a proud father of five grown children, whom he surprised by flying back unannounced to the Philippines for a visit.

CHAPTER

-1

GETTING STARTED

Using the Website

Direct your browser on your phone, computer, or tablet to the following website:

http://www.kelleybioinfo.org

There you will see the homepage, as shown at right.

Touching or clicking an icon (e.g., “Alignment”) will take you to a new page that has tools related to the icon topic. The Alignment, Motifs, and Phylogeny buttons teach algorithms and tools for many types of sequence analysis with DNA, RNA, and proteins. The Protein and RNA buttons focus on algorithms for predicting structural features of the functional macromolecules, while the Probability button teaches how to generate substitution matrices.

Example: The Alignment Page

Clicking or touching the Alignment button will take you to the following page, which begins with the BLAST algorithm interactive tool. All the pages use this basic design.

General features

Information on the interactive learning tool

Tutorials and test data for online bioinformatics software

While most of the pages look like the Alignment page, the Basics page is organized differently and mostly contains information and tutorials.

How To Use This Book

I will assume you are familiar with how to read/use a book, but remember that the physical book is meant to be used in conjunction with the online component. Throughout the text you will be directed to online modules via URLs and QR codes. The online material is not supplemental, but is a critical portion of this hypertextbook.

CHAPTER

00

INTRODUCTION

The word bioinformatics refers to the computational analysis of complex biological data. The “bio-” prefix indicates biology, of course, while “informatics” is the science of data processing, storage, and retrieval (a.k.a. information science) that first developed in the 1960s. Bioinformatics itself dates back to the early 1970s, when computers were first used to analyze molecular sequences. While our knowledge of biological processes, the amount of molecular data, and the speed and throughput of computation have all expanded dramatically, the field of bioinformatics still primarily focuses on the analysis of three critical biological molecules: DNA, RNA, and protein.

These molecules are critical to the cellular processes of all living organisms, and the analysis of the composition and patterns of these molecules should in theory reveal all the secrets to life. (Or, as Dr. Frankenstein would say, “It’s alive! Bwahaha!”) In fact, because DNA encodes the information for all the RNA and protein in every cell, analysis of DNA sequence patterns comprises the majority of bioinformatics. RNA and protein sequences are also analyzed using specific bioinformatics algorithms, but the sequences of these molecules are often computationally determined from the DNA sequence in one way or another (see below).

The purpose of this chapter is to explain the general properties of these biological molecules and how they are represented and stored in the computer. It is critical to understand the connection between the data you observe in computer files and the biological molecules. Otherwise the data analysis and databases that store these data will make little sense. We also briefly discuss what is known as the central dogma of molecular biology, how the DNA inside cells is “read” by the cellular machinery, and the general structure of the gene. The introductions to each chapter provide additional background information about the structure and function of DNA, RNA, and proteins and how bioinformatics can be used to analyze different aspects of these molecules.

Why Bioinformatics?

When nonscientists ask me what I do for a living, I tell them I’m a computational biologist. This oxymoron usually elicits a confused expression (“You study the biology of computers? Say what?”). I quickly follow this by asking them if they have heard of DNA and the human genome, which most people have by now. Then I tell them that the DNA that makes up the human genome is really just 3 BILLION LETTERS in a computer. Here is a little snippet of DNA information from the human genome:

After examining this DNA sequence, try answering the following questions:

  1. Is this actually human DNA? If not, what organism is it from?
  2. What is its biological function?
  3. Is it going to kill you? (Hey, it could be poliovirus DNA. How would you know?)

My guess is that without a computer and bioinformatics, you don’t stand a chance of answering these questions correctly. The above DNA sequence information codes for a small fragment of human DNA on chromosome 12. (Or it could be from a vampire bat. Keep reading to find out!) In fact, the entire human genome contains 1,000 times this much information. (BTW, this information is called sequence information because it is linked together as a sequence of letters.) And look how boring it is! The same 4 letters—A, G, C, and T—over and over again in different combinations. RNA and protein information looks pretty similar in the computer, except that one can tell RNA sequence data apart because it contains U instead of T. Protein sequence data are also easy to differentiate because up to 21 different letters representing the various amino acids are used in the sequences.

Here is some RNA sequence information:

And here is some protein sequence information:

Granted, the protein sequence is a little more interesting, but it is still pretty mind-numbing to stare at all day. However, mind-numbing tasks are exactly what computers were built for: determining the positions of all the known stars in our galaxy, calculating compound interest for billions of bank accounts, and searching through all the house cat video URLs on the Internet, among other things.

In fact, the amount of molecular sequence data has grown so vast, and the technologies for generating DNA sequence information from organisms have become so efficient, that computer processor and hard drive technologies are starting to fall behind the rate of biological information generation. The graph in Fig. 0.1 shows the exponential growth in the databanks from 1982 to 2008.

FIGURE 0.1. Exponential growth of GenBank database from 1982 to 2008. Courtesy of National Library of Medicine.

In 2008, the amount of information depicted in Fig. 0.1 was considered an extreme amount of information. Now a single researcher can, in a single day, generate DNA sequence information equivalent to all the total sequence information available in 2008.

So, what can one do with all these data? That question is the principal subject of this book, namely, how bioinformatics algorithms and banks of fancy computers can make sense of this growing mountain of molecular sequence data. In the next few sections you will learn a little about these critical biological molecules and how letters of the alphabet can be used to represent and store them in computers. Then, having provided a basic understanding of molecular biology and its computational representation, the rest of the book will focus on teaching you about the algorithms used to analyze these data.

DNA in the Computer

Deoxyribonucleic acid, otherwise (thankfully) known as DNA, is life’s single most important molecule. DNA underpins virtually all of biology. With the exception of a few viruses,1 life encodes itself using the chemical nucleotides of the DNA double helix. Every living cell contains its molecular information in the form of DNA, including the 1 trillion cells in the human body and every animal, plant, fungal, and bacterial cell on the planet. Placed end to end, the DNA from a single human cell could stretch an astonishing 3 meters.2 The amount of raw information contained in this 3-meter length of DNA is similarly remarkable. The unit of molecular information in DNA is the nucleotide. Thus, the human genome, the complete set of all the DNA information contained in each cell, contains approximately 3 billion pieces of information.

Theoretically, the ability to read and interpret this chemical code should allow us to learn a great deal about how cells and organisms function and interact. The chapters of this book show ways in which the combination of experimentation and DNA sequence analysis3 can reveal powerful new insights into molecular processes, cellular mechanisms, disease, and biodiversity. However, before we can analyze DNA sequences in the computer, we must first store the DNA information in a computer. How do we represent the complex biochemical structure of DNA in a computer?

Figure 0.2