Contents
Preface
Part One Introduction to Systems Biology
1 Introduction
1.1 Biology in Time and Space
1.2 Models and Modeling
1.3 Basic Notions for Computational Models
1.4 Data Integration
1.5 Standards
References
2 Modeling of Biochemical Systems
2.1 Kinetic Modeling of Enzymatic Reactions
2.2 Structural Analysis of Biochemical Systems
2.3 Kinetic Models of Biochemical Systems
2.4 Tools and Data Formats for Modeling
References
3 Specific Biochemical Systems
3.1 Metabolic Systems
3.2 Signaling Pathways
3.3 The Cell Cycle
3.4 Spatial Models
3.5 Apoptosis
References
4 Model Fitting
4.1 Data for Small Metabolic and Signaling Systems
4.2 Parameter Estimation
4.3 Reduction and Coupling of Models
4.4 Model Selection
References
5 Analysis of High-Throughput Data
5.1 High-Throughput Experiments
5.2 Analysis of Gene Expression Data
References
6 Gene Expression Models
6.1 Mechanisms of Gene Expression Regulation
6.2 Gene Regulation Functions
6.3 Dynamic Models of Gene Regulation
References
7 Stochastic Systems and Variability
7.1 Stochastic Modeling of Biochemical Reactions
7.2 Fluctuations in Gene Expression
7.3 Variability and Uncertainty
7.4 Robustness
References
8 Network Structures, Dynamics, and Function
8.1 Structure of Biochemical Networks
8.2 Network Motifs
8.3 Modularity
References
9 Optimality and Evolution
9.1 Optimality and Constraint-Based Models
9.2 Optimal Enzyme Concentrations
9.3 Evolutionary Game Theory
References
10 Cell Biology
10.1 Introduction
10.2 The Origin of Life
10.3 Molecular Biology of the Cell
10.4 Structural Cell Biology
10.5 Expression of Genes
References
11 Experimental Techniques in Molecular Biology
11.1 Introduction
11.2 Restriction Enzymes and Gel Electrophoresis
11.3 Cloning Vectors and DNA Libraries
11.4 1D and 2D Protein Gels
11.5 Hybridization and Blotting Techniques
11.6 Further Protein Separation Techniques
11.7 DNA and Protein Chips
11.8 Yeast Two-Hybrid System
11.9 Mass Spectrometry
11.10 Transgenic Animals
11.11 RNA Interference
11.12 ChIP on Chip and ChIP-PET
11.13 Surface Plasmon Resonance
11.14 Population Heterogeneity and Single Entity Experiments
References
12 Mathematics
12.1 Linear Modeling
12.2 Ordinary Differential Equations
12.3 Difference Equations
12.4 Graph and Network Theory
References
13 Statistics
13.1 Basic Concepts of Probability Theory
13.2 Descriptive Statistics
13.3 Testing Statistical Hypotheses
13.4 Linear Models
13.5 Principal Component Analysis
References
14 Stochastic Processes
14.1 Basic Notions for Random Processes
14.2 Markov Processes
14.3 Jump Processes in Continuous Time: The Master Equation
14.4 Continuous Random Processes
References
15 Control of Linear Systems
15.1 Linear Dynamical Systems
15.2 System Response
15.3 The Gramian Matrices
16 Databases
16.1 Databases of the National Center for Biotechnology
16.2 Databases of the European Bioinformatics Institute
16.3 Swiss-Prot, TrEMBL, and UniProt
16.4 Protein Databank
16.5 BioNumbers
16.6 Gene Ontology
16.7 Pathway Databases
References
17 Modeling Tools
17.1 Introduction
17.2 Mathematica and Matlab
17.3 Dizzy
17.4 Systems Biology Workbench
17.5 Tools Compendium
References
Index
The Authors
Prof. Edda Klipp
Humboldt-Universität Berlin
Institut für Biologie
Theoretische Biophysik
Invalidenstr. 42
10115 Berlin
Dr. Wolfram Liebermeister
Humboldt-Universität Berlin
Institut für Biologie
Theoretische Biophysik Invalidenstr. 42
10115 Berlin
Dr. Christoph Wierling
MPI für Molekulare Genetik
Ihnestr. 73
14195 Berlin
Germany
Dr. Axel Kowald
Protagen AG
Otto-Hahn-Str. 15
44227 Dortmund
Prof. Hans Lehrach
MPI für Molekulare Genetik
Ihnestr. 73
14195 Berlin Germany
Prof. Ralf Herwig
MPI für Molekulare Genetik
Ihnestr. 73
14195 Berlin
Germany
Cover
The cover pictures were provided with kind permission by Santiago Ortiz and Dr. Michael Erlowitz
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.: applied for
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Typesetting Thomson Digital, Noida, India
Printing Strauss GmbH, Mörlenbach
Binding Litges & Dopf GmbH, Heppenheim
Cover Design Adam-Design, Weinheim
Printed in the Federal Republic of Germany
Printed on acid-free paper
ISBN: 978-3-527-31874-2
Preface
Life is probably the most complex phenomenon in the universe. We see kids growing, people aging, plants blooming, and microbes degrading their remains. We use yeast for brewery and bakery, and doctors prescribe drugs to cure diseases. But can we understand how life works? Since the 19th century, the processes of life have no longer been explained by special “living forces, ” but by the laws of physics and chemistry. By studying the structure and physiology of living systems more and more in detail, researchers from different disciplines have revealed how the mystery of life arises from the structural and functional organization of cells and from the continuous refinement by mutation and selection.
In recent years, new imaging techniques have opened a completely new perception of the cellular microcosm. If we zoom into the cell, we can observe how structures are built, maintained, and reproduced while various sensing and regulation systems help the cell to respond appropriately to environmental changes. But along with all these fascinating observations, many open questions remain. Why do we age? How does a cell know when to divide? How can severe diseases such as cancer or genetic disorders be cured? How can we convince – i.e., manipulate – microbes to produce a desirable substance? How can the life sciences contribute to environmental safety and sustainable technologies?
This book provides you with a number of tools and approaches that can help you to think in more detail about such questions from a theoretical point of view. A key to tackle such questions is to combine biological experiments with computational modeling in an approach called systems biology: it is the combined study of biological systems through (i) investigating the components of cellular networks and their interactions, (ii) applying experimental high-throughput and whole-genome techniques, and (iii) integrating computational methods with experimental efforts.
The systemic approach in biology is not new, but it recently gained new thrust due to the emergence of powerful experimental and computational methods. It is based on the accumulation of an increasingly detailed biological knowledge, on the emergence of new experimental techniques in genomics and proteomics, on a tradition of mathematical modeling of biological processes, on the exponentially growing computer power (as prerequisite for databases and the calculation of large systems), and on the Internet as the central medium for a quick and comprehensive exchange of information.
Systems Biology has influenced modern biology in two major ways: on the one hand, it offers computational tools for analyzing, integrating and interpreting biological data and hypotheses. On the other hand, it has induced the formulation of new theoretical concepts and the application of existing ones to new questions. Such concepts are, for example, the theory of dynamical systems, control theory, the analysis of molecular noise, robustness and fragility of dynamic systems, and statistical network analysis. As systems biology is still evolving as a scientific field, a central issue is the standardization of experiments, of data exchange, and of mathematical models.
In this book, we attempt to give a survey of this rapidly developing field. We will show you how to formulate your own model of biological processes, how to analyze such models, how to use data and other available information for making your model more precise – and how to interpret the results. This book is designed as an introductory course for students of biology, biophysics and bioinformatics, and for senior scientists approaching Systems Biology from a different discipline. Its nine chapters contain material for about 30 lectures and are organized as follows.
Chapter 1 – Introduction (E. Klipp, W. Liebermeister, A. Kowald, 1 lecture)
Introduction to the subject. Elementary concepts and definitions are presented. Read this if you want to start right from the beginning.
Chapter 2 – Modeling of Biochemical Systems (E. Klipp, C. Wierling, 4 lectures)
This chapter describes kinetic models for biochemical reaction networks, the most common computational technique in Systems Biology. It includes kinetic laws, stoichiometric analysis, elementary flux modes, and metabolic control analysis. Introduces tools and data formats necessary for modeling.
Chapter 3 – Specific Biochemical Systems (E. Klipp, C. Wierling, W. Liebermeister, 5 lectures)
Using specific examples from metabolism, signaling, and cell cycle, a number of popular modeling techniques are discussed. The aim of this chapter is to make the reader familiar with both modeling techniques and biological phenomena.
Chapter 4 – Model Fitting (W. Liebermeister, A. Kowald, 4 lectures)
Models in systems biology usually contain a large number of parameters. Assigning appropriate numerical values to these parameters is an important step in the creation of a quantitative model. This chapter shows how numerical values can be obtained from the literature or by fitting a model to experimental data. It also discusses how model structures can be simplified and how they can be chosen if several different models can potentially describe the experimental observations.
Chapter 5 – Analysis of High-Throughput Data (R. Herwig, 2 lectures)
Several techniques that have been developed in recent years produce large quantities of data (e.g., DNA and protein chips, yeast two-hybrid, mass spectrometry). But such large quantities often go together with a reduced quality of the individual measurement. This chapter describes techniques that can be used to handle this type of data appropriately.
Chapter 6 – Gene Expression Models (R. Herwig, W. Liebermeister, E. Klipp, 3 lectures)
Thousands of gene products are necessary to create a living cell, and the regulation of gene expression is a very complex and important task to keep a cell alive. This chapter discusses how the regulation of gene expression can be modeled, how different input signals can be integrated, and how the structure of gene networks can be inferred from experimental data.
Chapter 7 – Stochastic Systems and Variability (W. Liebermeister, 4 lectures)
Random fluctuations in transcription, translation and metabolic reactions make mathematics complicated, computation costly and interpretation of results not straight forward. But since experimentalists find intriguing examples for macroscopic consequences of random fluctuation at the molecular level, the incorporation of these effects into the simulations becomes more and more important. This chapter gives an overview where and how stochasticity enters cellular life.
Chapter 8 – Network Structures, Dynamics and Function (W. Liebermeister, 3 lectures)
Many complex systems in biology can be represented as networks (reaction networks, interaction networks, regulatory networks). Studying the structure, dynamics, and function of such networks helps to understand design principles of living cells. In this chapter, important network structures such as motifs and modules as well as the dynamics resulting from them are discussed.
Chapter 9 – Optimality and Evolution (W. Liebermeister, E. Klipp, 3 lectures)
Theoretical research suggests that constraints of the evolutionary process should have left their marks in the construction and regulation of genes and metabolic pathways. In some cases, the function of biological systems can be well understood by models based on an optimality hypothesis. This chapter discusses the merits and limitations of such optimality approaches.
Various aspects of systems biology – the biological systems themselves, types of mathematical models to describe them, and practical techniques – reappear in different contexts in various parts of the book. The following diagram, which shows the contents of the book sorted by a number of different aspects, may serve as an orientation.
At the end of the regular course material, you will find a number of additional chapters that summarize important biological and mathematical methods. The first chapters deal with to cell biology (chapter 10, C. Wierling) and molecular biological methods (chapter 11, A. Kowald). For looking up mathematical and statistical definitions and methods, turn to chapters 12 and 13 (R. Herwig, A. Kowald). Chapters 14 and 15 (W. Liebermeister) concentrate on random processes and control theory. The final chapters provide an overview over useful databases (chapter 16, C. Wierling) as well as a huge list of available software tools including a short description of their purposes (chapter 17, A. Kowald).
Further material is available on an accompanying website (www.wiley-vch.de/home/systemsbiology)
Beside additional and more specialized topics, the website also contains solutions to the exercises and problems presented in the book.
We give our thanks to a number of people who helped us in finishing this book. We are especially grateful to Dr. Ulrich Liebermeister, Prof. Dr. Hans Meinhardt, Dr. Timo Reis, Dr. Ulrike Baur, Clemens Kreutz, Dr. Jose Egea, Dr. Maria Rodriguez-Fernandez, Dr. Wilhelm Huisinga, Sabine Hummert, Guy Shinar, Nadav Kashtan, Dr. Ron Milo, Adrian Jinich, Elad Noor, Niv Antonovsky, Bente Kofahl, Dr. Simon Borger, Martina Fröhlich, Christian Waltermann, Susanne Gerber, Thomas Spießer, Szymon Stoma, Christian Diener, Axel Rasche, Hendrik Hache, Dr. Michal Ruth Schweiger, and Elisabeth Maschke-Dutz for reading and commenting on the manuscript.
We thank the Max Planck Society for support and encouragement. We are grateful to the European Commission for funding via different European projects (MEST-CT2004-514169, LSHG-CT-2005-518254, LSHG-CT-2005-018942, LSHG-CT-2006-037469, LSHG-CT-2006-035995-2 NEST-2005-Path2-043310, HEALTH-F4-2007-200767, and LSHB-CT-2006-037712). Further funding was obtained from the Sysmo project “Translucent” and from the German Research Foundation (IRTG 1360) E.K. thanks with love her sons Moritz and Richard for patience and incentive and the Systems Biology community for motivation. W.L. wishes to thank his daughters Hannah and Marlene for various insights and inspiration. A.K. likes to thank Prof. Dr. H.E. Meyer for support and hospitality. This book is dedicated to our teacher Prof. Dr. Reinhart Heinrich (1946–2006), whose works on metabolic control theory in the 1970s paved the way to systems biology and who greatly inspired our minds.
Part One
Introduction to Systems Biology
1
Introduction
1.1 Biology in Time and Space
Biological systems like organisms, cells, or biomolecules are highly organized in their structure and function. They have developed during evolution and can only be fully understood in this context. To study them and to apply mathematical, computational, or theoretical concepts, we have to be aware of the following circumstances.
The continuous reproduction of cell compounds necessary for living and the respective flow of information is captured by the central dogma of molecular biology, which can be summarized as follows: genes code for mRNA, mRNA serves as template for proteins, and proteins perform cellular work. Although information is stored in the genes in form of DNA sequence, it is made available only through the cellular machinery that can decode this sequence and can translate it into structure and function. In this book, this will be explained from various perspectives.
A description of biological entities and their properties encompasses different levels of organization and different time scales. We can study biological phenomena at the level of populations, individuals, tissues, organs, cells, and compartments down to molecules and atoms. Length scales range from the order of meter (e.g., the size of whale or human) to micrometer for many cell types, down to picometer for atom sizes. Time scales include millions of years for evolutionary processes, annual and daily cycles, seconds for many biochemical reactions, and femtoseconds for molecular vibrations. Figure 1.1 gives an overview about scales.
In a unified view of cellular networks, each action of a cell involves different levels of cellular organization, including genes, proteins, metabolism, or signaling pathways. Therefore, the current description of the individual networks must be integrated into a larger framework.
Many current approaches pay tribute to the fact that biological items are subject to evolution. The structure and organization of organisms and their cellular machinery has developed during evolution to fulfill major functions such as growth, proliferation, and survival under changing conditions. If parts of the organism or of the cell fail to perform their function, the individual might become unable to survive or replicate.
One consequence of evolution is the similarity of biological organisms from different species. This similarity allows for the use of model organisms and for the critical transfer of insights gained from one cell type to other cell types. Applications include, e.g., prediction of protein function from similarity, prediction of network properties from optimality principles, reconstruction of phylogenetic trees, or the identification of regulatory DNA sequences through cross-species comparisons. But the evolutionary process also leads to genetic variations within species. Therefore, personalized medicine and research is an important new challenge for biomedical research.
1.2 Models and Modeling
If we observe biological processes, we are confronted with various complex processes that cannot be explained from first principles and the outcome of which cannot reliably be foreseen from intuition. Even if general biochemical principles are well established (e.g., the central dogma of transcription and translation, the biochemistry of enzyme-catalyzed reactions), the biochemistry of individual molecules and systems is often unknown and can vary considerably between species. Experiments lead to biological hypotheses about individual processes, but it often remains unclear if these hypotheses can be combined into a larger coherent picture because it is often difficult to foresee the global behavior of a complex system from knowledge of its parts. Mathematical modeling and computer simulations can help us understand the internal nature and dynamics of these processes and to arrive at predictions about their future development and the effect of interactions with the environment.
1.2.1 What is a Model?
The answer to this question will differ among communities of researchers. In a broad sense, a model is an abstract representation of objects or processes that explains features of these objects or processes (Figure 1.2). A biochemical reaction network can be represented by a graphical sketch showing dots for metabolites and arrows for reactions; the same network could also be described by a system of differential equations, which allows simulating and predicting the dynamic behavior of that network. If a model is used for simulations, it needs to be ensured that it faithfully predicts the system’s behavior – at least those aspects that are supposed to be covered by the model. Systems biology models are often based on well-established physical laws that justify their general form, for instance, the thermodynamics of chemical reactions; besides this, a computational model needs to make specific statements about a system of interest – which are partially justified by experiments and biochemical knowledge, and partially by mere extrapolation from other systems. Such a model can summarize established knowledge about a system in a coherent mathematical formulation. In experimental biology, the term “model” is also used to denote a species that is especially suitable for experiments, for example, a genetically modified mouse may serve as a model for human genetic disorders.
1.2.2 Purpose and Adequateness of Models
Modeling is a subjective and selective procedure. A model represents only specific aspects of reality but, if done properly, this is sufficient since the intention of modeling is to answer particular questions. If the only aim is to predict system outputs from given input signals, a model should display the correct input–output relation, while its interior can be regarded as a black box. But if instead a detailed biological mechanism has to be elucidated, then the system’s structure and the relations between its parts must be described realistically. Some models are meant to be generally applicable to many similar objects (e.g., Michaelis–Menten kinetics holds for many enzymes, the promoter–operator concept is applicable to many genes, and gene regulatory motifs are common), while others are specifically tailored to one particular object (e.g., the 3D structure of a protein, the sequence of a gene, or a model of deteriorating mitochondria during aging). The mathematical part can be kept as simple as possible to allow for easy implementation and comprehensible results. Or it can be modeled very realistically and be much more complicated. None of the characteristics mentioned above makes a model wrong or right, but they determine whether a model is appropriate to the problem to be solved. The phrase “essentially, all models are wrong, but some are useful” coined by the statistician George Box is indeed an appropriate guideline for model building.
1.2.3 Advantages of Computational Modeling
Models gain their reference to reality from comparison with experiments, and their benefits therefore depend on the quality of the experiments used. Nevertheless, modeling combined with experimentation has a lot of advantages compared to purely experimental studies:
The attempt to formulate current knowledge and open problems in mathematical terms often uncovers a lack of knowledge and requirements for clarification. Furthermore, computational models can be used to test whether proposed explanations of biological phenomena are feasible. Computational models serve as repositories of current knowledge, both established and hypothetical, about how systems might operate. At the same time, they provide researchers with quantitative descriptions of this knowledge and allow them to simulate the biological process, which serves as a rigorous consistency test.
1.3 Basic Notions for Computational Models
1.3.1 Model Scope
Systems biology models consist of mathematical elements that describe properties of a biological system, for instance, mathematical variables describing the concentrations of metabolites. As a model can only describe certain aspects of the system, all other properties of the system (e.g., concentrations of other substances or the environment of a cell) are neglected or simplified. It is important – and to some extent, an art – to construct models in such ways that the disregarded properties do not compromise the basic results of the model.
1.3.2 Model Statements
Besides the model elements, a model can contain various kinds of statements and equations describing facts about the model elements, most notably, their temporal behavior. In kinetic models, the basic modeling paradigm considered in this book, the dynamics is determined by a set of ordinary differential equations describing the substance balances. Statements in other model types may have the form of equality or inequality constraints (e.g., in flux balance analysis), maximality postulates, stochastic processes, or probabilistic statements about quantities that vary in time or between cells.
1.3.3 System State
In dynamical systems theory, a system is characterized by its state, a snapshot of the system at a given time. The state of the system is described by the set of variables that must be kept track of in a model: in deterministic models, it needs to contain enough information to predict the behavior of the system for all future times. Each modeling framework defines what is meant by the state of the system. In kinetic rate equation models, for example, the state is a list of substance concentrations. In the corresponding stochastic model, it is a probability distribution or a list of the current number of molecules of a species. In a Boolean model of gene regulation, the state is a string of bits indicating for each gene whether it is expressed (“1”) or not expressed (“0”). Also the temporal behavior can be described in fundamentally different ways. In a dynamical system, the future states are determined by the current state, while in a stochastic process, the future states are not precisely predetermined. Instead, each possibly future history has a certain probability to occur.
1.3.4 Variables, Parameters, and Constants
The quantities in a model can be classified as variables, parameters, and constants. A constant is a quantity with a fixed value, such as the natural number e or Avogadro’s number (number of molecules per mole). Parameters are quantities that have a given value, such as the Km value of an enzyme in a reaction. This value depends on the method used and on the experimental conditions and may change. Variables are quantities with a changeable value for which the model establishes relations. A subset of variables, the state variables, describes the system behavior completely. They can assume independent values and each of them is necessary to define the system state. Their number is equivalent to the dimension of the system. For example, the diameter d and volume V of a sphere obey the relation V = πd3/6, where π and 6 are constants, V and d are variables, but only one of them is a state variable since the relation between them uniquely determines the other one.
Whether a quantity is a variable or a parameter depends on the model. In reaction kinetics, the enzyme concentration appears as a parameter. However, the enzyme concentration itself may change due to gene expression or protein degradation and in an extended model, it may be described by a variable.
1.3.5 Model Behavior
Two fundamental factors that determine the behavior of a system are (i) influences from the environment (input) and (ii) processes within the system. The system structure, that is, the relation among variables, parameters, and constants, determines how endogenous and exogenous forces are processed. However, different system structures may still produce similar system behavior (output); therefore, measurements of the system output often do not suffice to choose between alternative models and to determine the system’s internal organization.
1.3.6 Model Classification
For modeling, processes are classified with respect to a set of criteria.
1.3.7 Steady States
The concept of stationary states is important for the modeling of dynamical systems. Stationary states (other terms are steady states or fixed points) are determined by the fact that the values of all state variables remain constant in time. The asymptotic behavior of dynamic systems, that is, the behavior after a sufficiently long time, is often stationary. Other types of asymptotic behavior are oscillatory or chaotic regimes.
The consideration of steady states is actually an abstraction that is based on a separation of time scales. In nature, everything flows. Fast and slow processes – ranging from formation and breakage of chemical bonds within nanoseconds to growth of individuals within years – are coupled in the biological world. While fast processes often reach a quasi-steady state after a short transition period, the change of the value of slow variables is often negligible in the time window of consideration. Thus, each steady state can be regarded as a quasi-steady state of a system that is embedded in a larger nonstationary environment. Despite this idealization, the concept of stationary states is important in kinetic modeling because it points to typical behavioral modes of the system under study and it often simplifies the mathematical problems.
Other theoretical concepts in systems biology are only rough representations of their biological counterparts. For example, the representation of gene regulatory networks by Boolean networks, the description of complex enzyme kinetics by simple mass action laws, or the representation of multifarious reaction schemes by black boxes proved to be helpful simplification. Although being a simplification, these models elucidate possible network properties and help to check the reliability of basic assumptions and to discover possible design principles in nature. Simplified models can be used to test mathematically formulated hypothesis about system dynamics, and such models are easier to understand and to apply to different questions.
1.3.8 Model Assignment is not Unique
Biological phenomena can be described in mathematical terms. Models developed during the last decades range from the description of glycolytic oscillations with ordinary differential equations to population dynamics models with difference equations, stochastic equations for signaling pathways, and Boolean networks for gene expression. But it is important to realize that a certain process can be described in more than one way: a biological object can be investigated with different experimental methods and each biological process can be described with different (mathematical) models. Sometimes, a modeling framework represents a simplified limiting case (e.g., kinetic models as limiting case of stochastic models). On the other hand, the same mathematical formalism may be applied to various biological instances: statistical network analysis, for example, can be applied to cellular-transcription networks, the circuitry of nerve cells, or food webs.
The choice of a mathematical model or an algorithm to describe a biological object depends on the problem, the purpose, and the intention of the investigator. Modeling has to reflect essential properties of the system and different models may highlight different aspects of the same system. This ambiguity has the advantage that different ways of studying a problem also provide different insights into the system. However, the diversity of modeling approaches makes it still very difficult to merge established models (e.g., for individual metabolic pathways) into larger supermodels (e.g., models of complete cell metabolism).
1.4 Data Integration
Systems biology has evolved rapidly in the last years driven by the new high-throughput technologies. The most important impulse was given by the large sequencing projects such as the human genome project, which resulted in the full sequence of the human and other genomes [1, 2]. Proteomics technologies have been used to identify the translation status of complete cells (2D-gels, mass spectrometry) and to elucidate protein–protein interaction networks involving thousands of components [3]. However, to validate such diverse high-throughput data, one needs to correlate and integrate such information. Thus, an important part of systems biology is data integration.
On the lowest level of complexity, data integration implies common schemes for data storage, data representation, and data transfer. For particular experimental techniques, this has already been established, for example, in the field of transcriptomics with minimum information about a microarray experiment [4], in proteomics with proteomics experiment data repositories [5], and the Human Proteome Organization consortium [6]. On a more complex level, schemes have been defined for biological models and pathways such as Systems Biology Markup Language (SBML) [7] and CellML [8], which use an XML-like language style.
Data integration on the next level of complexity consists of data correlation. This is a growing research field as researchers combine information from multiple diverse data sets to learn about and explain natural processes [9, 10]. For example, methods have been developed to integrate the results of transcriptome or proteome experiments with genome sequence annotations. In the case of complex disease conditions, it is clear that only integrated approaches can link clinical, genetic, behavioral, and environmental data with diverse types of molecular phenotype information and identify correlative associations. Such correlations, if found, are the key to identifying biomarkers and processes that are either causative or indicative of the disease. Importantly, the identification of biomarkers (e.g., proteins, metabolites) associated with the disease will open up the possibility to generate and test hypotheses on the biological processes and genes involved in this condition. The evaluation of disease-relevant data is a multistep procedure involving a complex pipeline of analysis and data handling tools such as data normalization, quality control, multivariate statistics, correlation analysis, visualization techniques, and intelligent database systems [11]. Several pioneering approaches have indicated the power of integrating data sets from different levels: for example, the correlation of gene membership of expression clusters and promoter sequence motifs [12]; the combination of transcriptome and quantitative proteomics data in order to construct models of cellular pathways [10]; and the identification of novel metabolite-transcript correlations [13]. Finally, data can be used to build and refine dynamical models, which represent an even higher level of data integration.
1.5 Standards
As experimental techniques generate rapidly growing amounts of data and large models need to be developed and exchanged, standards for both experimental procedures and modeling are a central practical issue in systems biology. Information exchange necessitates a common language about biological aspects. One seminal example is the gene ontology which provides a controlled vocabulary that can be applied to all organisms, even as the knowledge about genes and proteins continues to accumulate. The SBML [7] has been established as exchange language for mathematical models of biochemical reaction networks. A series of “minimum-information-about” statements based on community agreement defines standards for certain types of experiments. Minimum information requested in the annotation of biochemical models (MIRIAM) [14] describes standards for this specific type of systems biology models.
References
1 Lander, E.S. et al. (2001b) Initial sequencing and analysis of the human genome. Nature, 409, 860–921
2 Venter, J.C. et al. (2001a) The sequence of the human genome. Science, 291, 1304–1351
3 von Mering, C. et al. (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature, 417, 399–403
4 Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics, 29, 365–371
5 Taylor, C.F. et al. (2003) A systematic approach to modeling, capturing, and disseminating proteomics experimental data. Nature Biotechnology, 21, 247–254
6 Hermjakob, H. et al. (2004) The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nature Biotechnology, 22, 177–183
7 Hucka, M. et al. (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19, 524–531
8 Lloyd, C.M. et al. (2004) CellML: its future present and past. Progress in Biophysics and Molecular Biology, 85, 433–450
9 Gitton, Y. et al. (2002) A gene expression map of human chromosome 21 orthologues in the mouse. Nature, 420, 586–590
10 Ideker, T. et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292, 929–934
11 Kanehisa, M. and Bork, P. (2003) Bioinformatics in the post-sequence era. Nature Genetics, 33 (Suppl), 305–310
12 Tavazoie, S. et al. (1999) Systematic determination of genetic network architecture. Nature Genetics, 22, 281–285
13 Urbanczyk-Wochniak, E. et al. (2003) Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. EMBO Reports, 4, 989–993
14 Le Novere, N. et al. (2005) Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23, 1509–1515