reading

Contents

Title Page

Dedication

Contents

Preface

For the Instructor

For the Student

Acknowledgments

About the Authors

Chapter –1. Getting Started

Chapter 00. Introduction

Activity 0.1: Biological Databases and Data Storage

Chapter 01. BLAST

Activity 1.1: BLAST Algorithm

Chapter 02. Protein Analysis

Activity 2.1: Hydrophobicity Plotting
Activity 2.2: Protein Secondary Structure Prediction

Chapter 03. Sequence Alignment

Activity 3.1: Dynamic Programming

Chapter 04. Patterns in the Data

Activity 4.1: Protein Sequence Motifs
Activity 4.2: Position-Specific Weight Matrices

Chapter 05. RNA Structure Prediction

Activity 5.1: RNA Structure Prediction

Chapter 06. Phylogenetics

Activity 6.1: Phylogenetic Analysis

Chapter 07. Probability: All Mutations are not Equal (-ly Probable)

Activity 7.1: Generating PAM and BLOSUM Substitution Matrices

Chapter 08. Bioinformatics Programming: A Primer

Index

CHAPTER

-1

GETTING STARTED

Using the Website

Direct your browser on your phone, computer, or tablet to the following website:

http://www.kelleybioinfo.org

There you will see the homepage, as shown at right.

Touching or clicking an icon (e.g., “Alignment”) will take you to a new page that has tools related to the icon topic. The Alignment, Motifs, and Phylogeny buttons teach algorithms and tools for many types of sequence analysis with DNA, RNA, and proteins. The Protein and RNA buttons focus on algorithms for predicting structural features of the functional macromolecules, while the Probability button teaches how to generate substitution matrices.

Example: The Alignment Page

Clicking or touching the Alignment button will take you to the following page, which begins with the BLAST algorithm interactive tool. All the pages use this basic design.

General features

Information on the interactive learning tool

Tutorials and test data for online bioinformatics software

While most of the pages look like the Alignment page, the Basics page is organized differently and mostly contains information and tutorials.

How To Use This Book

I will assume you are familiar with how to read/use a book, but remember that the physical book is meant to be used in conjunction with the online component. Throughout the text you will be directed to online modules via URLs and QR codes. The online material is not supplemental, but is a critical portion of this hypertextbook.

CHAPTER

00

INTRODUCTION

The word bioinformatics refers to the computational analysis of complex biological data. The “bio-” prefix indicates biology, of course, while “informatics” is the science of data processing, storage, and retrieval (a.k.a. information science) that first developed in the 1960s. Bioinformatics itself dates back to the early 1970s, when computers were first used to analyze molecular sequences. While our knowledge of biological processes, the amount of molecular data, and the speed and throughput of computation have all expanded dramatically, the field of bioinformatics still primarily focuses on the analysis of three critical biological molecules: DNA, RNA, and protein.

These molecules are critical to the cellular processes of all living organisms, and the analysis of the composition and patterns of these molecules should in theory reveal all the secrets to life. (Or, as Dr. Frankenstein would say, “It’s alive! Bwahaha!”) In fact, because DNA encodes the information for all the RNA and protein in every cell, analysis of DNA sequence patterns comprises the majority of bioinformatics. RNA and protein sequences are also analyzed using specific bioinformatics algorithms, but the sequences of these molecules are often computationally determined from the DNA sequence in one way or another (see below).

The purpose of this chapter is to explain the general properties of these biological molecules and how they are represented and stored in the computer. It is critical to understand the connection between the data you observe in computer files and the biological molecules. Otherwise the data analysis and databases that store these data will make little sense. We also briefly discuss what is known as the central dogma of molecular biology, how the DNA inside cells is “read” by the cellular machinery, and the general structure of the gene. The introductions to each chapter provide additional background information about the structure and function of DNA, RNA, and proteins and how bioinformatics can be used to analyze different aspects of these molecules.

Why Bioinformatics?

When nonscientists ask me what I do for a living, I tell them I’m a computational biologist. This oxymoron usually elicits a confused expression (“You study the biology of computers? Say what?”). I quickly follow this by asking them if they have heard of DNA and the human genome, which most people have by now. Then I tell them that the DNA that makes up the human genome is really just 3 BILLION LETTERS in a computer. Here is a little snippet of DNA information from the human genome:

After examining this DNA sequence, try answering the following questions:

Is this actually human DNA? If not, what organism is it from?
What is its biological function?
Is it going to kill you? (Hey, it could be poliovirus DNA. How would you know?)

My guess is that without a computer and bioinformatics, you don’t stand a chance of answering these questions correctly. The above DNA sequence information codes for a small fragment of human DNA on chromosome 12. (Or it could be from a vampire bat. Keep reading to find out!) In fact, the entire human genome contains 1,000 times this much information. (BTW, this information is called sequence information because it is linked together as a sequence of letters.) And look how boring it is! The same 4 letters—A, G, C, and T—over and over again in different combinations. RNA and protein information looks pretty similar in the computer, except that one can tell RNA sequence data apart because it contains U instead of T. Protein sequence data are also easy to differentiate because up to 21 different letters representing the various amino acids are used in the sequences.

Here is some RNA sequence information:

And here is some protein sequence information:

Granted, the protein sequence is a little more interesting, but it is still pretty mind-numbing to stare at all day. However, mind-numbing tasks are exactly what computers were built for: determining the positions of all the known stars in our galaxy, calculating compound interest for billions of bank accounts, and searching through all the house cat video URLs on the Internet, among other things.

In fact, the amount of molecular sequence data has grown so vast, and the technologies for generating DNA sequence information from organisms have become so efficient, that computer processor and hard drive technologies are starting to fall behind the rate of biological information generation. The graph in Fig. 0.1 shows the exponential growth in the databanks from 1982 to 2008.

FIGURE 0.1. Exponential growth of GenBank database from 1982 to 2008. Courtesy of National Library of Medicine.

In 2008, the amount of information depicted in Fig. 0.1 was considered an extreme amount of information. Now a single researcher can, in a single day, generate DNA sequence information equivalent to all the total sequence information available in 2008.

So, what can one do with all these data? That question is the principal subject of this book, namely, how bioinformatics algorithms and banks of fancy computers can make sense of this growing mountain of molecular sequence data. In the next few sections you will learn a little about these critical biological molecules and how letters of the alphabet can be used to represent and store them in computers. Then, having provided a basic understanding of molecular biology and its computational representation, the rest of the book will focus on teaching you about the algorithms used to analyze these data.

DNA in the Computer

Deoxyribonucleic acid, otherwise (thankfully) known as DNA, is life’s single most important molecule. DNA underpins virtually all of biology. With the exception of a few viruses,1 life encodes itself using the chemical nucleotides of the DNA double helix. Every living cell contains its molecular information in the form of DNA, including the 1 trillion cells in the human body and every animal, plant, fungal, and bacterial cell on the planet. Placed end to end, the DNA from a single human cell could stretch an astonishing 3 meters.2 The amount of raw information contained in this 3-meter length of DNA is similarly remarkable. The unit of molecular information in DNA is the nucleotide. Thus, the human genome, the complete set of all the DNA information contained in each cell, contains approximately 3 billion pieces of information.

Theoretically, the ability to read and interpret this chemical code should allow us to learn a great deal about how cells and organisms function and interact. The chapters of this book show ways in which the combination of experimentation and DNA sequence analysis3 can reveal powerful new insights into molecular processes, cellular mechanisms, disease, and biodiversity. However, before we can analyze DNA sequences in the computer, we must first store the DNA information in a computer. How do we represent the complex biochemical structure of DNA in a computer?

Figure 0.2