Cover Page

Contents

Cover

Half Title page

Title page

Copyright page

Foreword

Preface

Chapter 1: Introduction

1.1 What is system administration?

1.2 What is a system?

1.3 What is administration?

1.4 Studying systems

1.5 What’s in a theory?

1.6 How to use the text

1.7 Some notation used

Chapter 2: Science and its methods

2.1 The aim of science

2.2 Causality, superposition and dependency

2.3 Controversies and philosophies of science

2.4 Technology

2.5 Hypotheses

2.6 The science of technology

2.7 Evaluating a system—dependencies

2.8 Abuses of science

Chapter 3: Experiment and observation

3.1 Data plots and time series

3.2 Constancy of environment during measurement

3.3 Experimental design

3.4 Stochastic (random) variables

3.5 Actual values or characteristic values

3.6 Observational errors

3.7 The mean and standard deviation

3.8 Probability distributions and measurement

3.9 Uncertainty in general formulae

3.10 Fourier analysis and periodic behaviour

3.11 Local averaging procedures

3.12 Reminder

Chapter 4: Simple systems

4.1 The concept of a system

4.2 Data structures and processes

4.3 Representation of variables

4.4 The simplest dynamical systems

4.5 More complex systems

4.6 Freedoms and constraints

4.7 Symmetries

4.8 Algorithms, protocols and standard ‘methods’

4.9 Currencies and value systems

4.10 Open and closed systems: the environment

4.11 Reliable and unreliable systems

Chapter 5: Sets, states and logic

5.1 Sets

5.2 A system as a set of sets

5.3 Addresses and mappings

5.4 Chains and states

5.5 Configurations and macrostates

5.6 Continuum approximation

5.7 Theory of computation and machine language

5.8 A policy-defined state

Chapter 6: Diagrammatical representations

6.1 Diagrams as systems

6.2 The concept of a graph

6.3 Connectivity

6.4 Centrality: maxima and minima in graphs

6.5 Ranking in directed graphs

6.6 Applied diagrammatical methods

Chapter 7: System variables

7.1 Information systems

7.2 Addresses, labels, keys and other resource locators

7.3 Continuous relationships

7.4 Digital comparison

Chapter 8: Change in systems

8.1 Renditions of change

8.2 Determinism and predictability

8.3 Oscillations and fluctuations

8.4 Rate of change

8.5 Applications of the continuum approximation

8.6 Uncertainty in the continuum approximation

Chapter 9: Information

9.1 What is information?

9.2 Transmission

9.3 Information and control

9.4 Classification and resolution

9.5 Statistical uncertainty and entropy

9.6 Properties of the entropy

9.7 Uncertainty in communication

9.8 A geometrical interpretation of information

9.9 Compressibility and size of information

9.10 Information and state

9.11 Maximum entropy principle

9.12 Fluctuation spectra

Chapter 10: Stability

10.1 Basic notions

10.2 Types of stability

10.3 Constancy

10.4 Convergence of behaviour

10.5 Maxima and minima

10.6 Regions of stability in a graph

10.7 Graph stability under random node removal

10.8 Dynamical equilibria: compromise

10.9 Statistical stability

10.10 Scaling stability

10.11 Maximum entropy distributions

10.12 Eigenstates

10.13 Fixed points of maps

10.14 Metastable alternatives and adaptability

10.15 Final remarks

Chapter 11: Resource networks

11.1 What is a system resource?

11.2 Representation of resources

11.3 Resource currency relationships

11.4 Resource allocation, consumption and conservation

11.5 Where to attach resources?

11.6 Access to resources

11.7 Methods of resource allocation

11.8 Directed resources: flow asymmetries

Chapter 12: Task management and services

12.1 Task list scheduling

12.2 Deterministic and non-deterministic schedules

12.3 Human–computer scheduling

12.4 Service provision and policy

12.5 Queue processing

12.6 Models

12.7 The prototype queue M/M/1

12.8 Queue relationships or basic ‘laws’

12.9 Expediting tasks with multiple servers M/M/k

12.10 Maximum entropy input events in periodic systems

12.11 Miscellaneous issues in scheduling

Chapter 13: System architectures

13.1 Policy for organization

13.2 Informative and procedural flows

13.3 Structured systems and ad hoc systems

13.4 Dependence policy

13.5 System design strategy

13.6 Event-driven systems and functional systems

13.7 The organization of human resources

13.8 Principle of minimal dependency

13.9 Decision-making within a system

13.10 Prediction, verification and their limitations

13.11 Graphical methods

Chapter 14: System normalization

14.1 Dependency

14.2 The database model

14.3 Normalized forms

Chapter 15: System integrity

15.1 System administration as communication?

15.2 Extensive or strategic instruction

15.3 Stochastic semi-groups and martingales

15.4 Characterizing probable or average error

15.5 Correcting errors of propagation

15.6 Gaussian continuum approximation formula

Chapter 16: Policy and maintenance

16.1 What is maintenance?

16.2 Average changes in configuration

16.3 The reason for random fluctuations

16.4 Huge fluctuations

16.5 Equivalent configurations and policy

16.6 Policy

16.7 Convergent maintenance

16.8 The maintenance theorem

16.9 Theory of back-up and error correction

Chapter 17: Knowledge, learning and training

17.1 Information and knowledge

17.2 Knowledge as classification

17.3 Bayes’ theorem

17.4 Belief versus truth

17.5 Decisions based on expert knowledge

17.6 Knowledge out of date

17.7 Convergence of the learning process

Chapter 18: Policy transgressions and fault modelling

18.1 Faults and failures

18.2 Deterministic system approximation

18.3 Stochastic system models

18.4 Approximate information flow reliability

18.5 Fault correction by monitoring and instruction

18.6 Policy maintenance architectures

18.7 Diagnostic cause trees

18.8 Probabilistic fault trees

Chapter 19: Decision and strategy

19.1 Causal analysis

19.2 Decision-making

19.3 Game theory

19.4 The strategic form of a game

19.5 The extensive form of a game

19.6 Solving zero-sum games

19.7 Dominated strategies

19.8 Nash equilibria

19.9 A security game

19.10 The garbage collection game

19.11 A social engineering game

19.12 Human elements of policy decision

19.13 Coda: extensive versus strategic configuration management

Chapter 20: Conclusions

Appendix A: Some Boolean formulae

A.1 Conditional probability

A.2 Boolean algebra and logic

Appendix B: Statistical and scaling properties of time-series data

B.1 Local averaging procedure

B.2 Scaling and self-similarity

B.3 Scaling of continuous functions

Appendix C: Percolation conditions

C.1 Random graph condition

C.2 Bi-partite form

C.3 Small-graph corrections

Bibliography

Index

Analytical Network and System Administration

Title Page

Foreword

It is my great honor to introduce a landmark book in the field of network and system administration. For the first time, in one place, one can study the components of network and system administration as an evolving and emerging discipline and science, rather than as a set of recipes, practices or principles. This book represents the step from ‘mastery of the practice’ and ‘scientific understanding’, a step very similar to that between historical alchemy and chemistry.

As recently as ten years ago, many people considered ‘network and system administration’ to comprise remembering and following complex recipes for building and maintaining systems and networks. The complexity of many of these recipes—and the difficulty of explaining them to non-practitioners in simple and understandable terms—encouraged practitioners to treat system administration as an ‘art’ or ‘guild craft’ into which practitioners are initiated through apprenticeship.

Current master practitioners of network and system administration are perhaps best compared with historical master alchemists at the dawn of chemistry as a science. In contrast to the distorted popular image of alchemy as seeking riches through transmutation of base metals, historical research portrays alchemists as master practitioners of the subtle art of combining chemicals towards particular results or ends. Practitioners of alchemy often possessed both precise technique and highly developed observational skills. Likewise, current master practitioners of network and system administration craft highly reliable networks from a mix of precise practice, observational skills and the intuition that comes from careful observation of network behaviour over long time periods. But both alchemists and master practitioners lack the common language that makes it easy to exchange valuable information with others: the language of science.

Alas, the alchemy by which we have so far managed our networks is no longer sufficient. When networks were simple in structure, it was possible to maintain them through the use of relatively straightforward recipes, procedures and practices. In the post-Internet world, the administrator is now faced with managing and controlling networks that can dynamically adapt to changing conditions and requirements quickly and, perhaps, even unpredictably. These adaptive networks can exhibit ‘emergent properties’ that are not predictable in advance. In concert with adapting networks to serve human needs, future administrators must adapt themselves to the task of management by developing an ongoing, perpetually evolving, and shared understanding.

In the past, it was reasonable to consider a computer network as a collection of cooperating machines functioning in isolation. Adaptive networks cannot be analysed in this fashion; their human components must also be considered. Modern networks are not communities of machines, but rather communities of humans inextricably linked by machines; what the author calls ‘cooperating ecologies’ of users and machines. The behaviour of humans must be considered along with the behaviour of the network for making conclusions about network performance and suitability.

These pressures force me to an inescapable conclusion. System administrators cannot continue to be alchemist-practitioners. They must instead develop the language of science and evolve from members of a profession to researchers within a shared scientific discipline. This book shows the way.

Though we live thousands of miles apart, the author and I are ‘kindred spirits’—forged by many of the same experiences, challenges and insights. In the late 1980s and early 1990s, both of us were faculty, managing our own computer networks for teaching and research. Neither of us had access to the contemporary guilds of system administration (or each other), and had to learn how to administer networks the hard way—by reading the documentation and creating our own recipes for success. Both of us realized (completely independently) that there were simple concepts behind the recipes that, once discovered, make the recipes easy to remember, reconstruct and understand. Concurrently and independently, both of us set out to create software tools that would avoid repeated manual configuration.

Although we were trained in radically differing academic traditions (the author from physics and myself from mathematics and computer science), our administrative tools, developed completely in isolation from one another, had very similar capabilities and even accomplished tasks using the same methods. The most striking similarity was that both tools were based upon the same ‘principles’. For the first time, it very much looked like we had found an invariant principle in the art of system and network administration: the ‘principle of convergence’. As people would say in the North Carolina backwoods near where I grew up, ‘if it ain’t broke, don’t fix it’.

The road from alchemy to discipline has many steps. In the author’s previous book, Principles of Network and System Administration, he takes the first step from practice (’what to do’) to principles (’why to do it’). Recipes are not created equal; some are better than others. Many times the difference between good and poor recipes can be expressed in terms of easily understood principles. Good recipes can then be constructed top–down, starting at the principles. Practitioners have approached the same problem bottom-up, working to turn their tested and proven recipes into sets of ‘best practices’ that are guaranteed to work well for a particular site or application. Recently, many practitioners have begun to outline the ‘principles’ underlying their practices. There is remarkable similarity between the results of these two seemingly opposing processes, and the author’s ‘principles’, and the practitioners’ ‘best practices’ are now quickly meeting on a common middle ground of principles.

In this book, for the first time, the author identifies principles of scientific practice and observation that anyone can use to become proficient ‘analysts’ of network and system administration practices. This will not make one a better practitioner, but rather will allow one to discuss and evaluate the practice with others in a clear and concise manner. The reader will not find any recipes in this book. The reader will not find principles of practice. Rather, the book explains the principles behind the science and chemistry of cooking, so that one can efficiently derive one’s own efficient and effective recipes for future networks. Proficient system administrators have always been capable of this kind of alchemy, but have found it challenging to teach the skill to others. This book unlocks the full power of the scientific method to allow sharing of analyses, so that future administrators can look beyond recipe, to shared understanding and discipline. In this way, now-isolated practitioners can form a shared scientific community and discipline whose knowledge is greater than the sum of its parts.

Looking at the table of contents, one will be very surprised to note that the traditional disciplines of ‘computer science’ and ‘computer engineering’—long considered the inseparable partners of system administration—are not the basis of the new science. Rather, experimental physics has proven to be the Rosetta Stone that unlocks the mysteries of complex systems. To understand why, we must examine the fundamental differences in economics between the disciplines of computer science and engineering and the disciplines of network and system administration.

Traditional computer science and engineering (and, particularly, the sciences involved in building the systems that system administrators manage) are based upon either an operational or axiomatic semantic model of computing. Both models express ‘what a program does’ in an ideal computing environment. Software developers build complex systems in layers, where each subsequent layer presumes the correct function of layers upon which it is built. Program correctness at a given layer is a mathematical property based upon axioms that describe the behaviour of underlying layers. Fully understanding a very complex system requires understanding of each layer and its interdependencies and assumptions in dealing with other layers.

System administrators have a differing view of the systems they manage compared to that of the developers who designed the systems. It is not economically feasible to teach the deep knowledge and mathematical understanding necessary to craft and debug software and systems to large populations of human system administrators. System administrators must instead base their actions upon a high-level set of initial experimental hypotheses called the ‘system documentation’. The documentation consists of hypotheses to be tested, not axioms to be trusted. As administrators learn how to manage a system, they refine their understanding top-down, by direct observation and ongoing evaluation of hypotheses.

Turning system and network administration into a discipline requires one to learn some skills, previously considered far removed from the practice. Evaluating hypotheses requires a rudimentary knowledge of statistics and the experimental method. These hypotheses are built not upon operational or axiomatic semantic models of computing, but upon specialized high-level mathematical models that describe behaviour of a complex system. With this machinery in hand, several advanced methods of analysis—prevalent in experimental physics and other scientific disciplines—are applied to the problem of understanding management of complex systems.

Proficient system administrators are already skilled experimental scientists; they just do not acknowledge this fact and cannot effectively communicate their findings to others. This book takes a major step towards understanding the profession of system and network administration as a science rather than as an art. While this step is difficult to take, it is both rewarding and necessary for those pioneers who will manage the next generation of networks and services. Please read on, and seek to understand the true nature of networking—as a process that involves connecting humans, not just computers.

Alva Couch
Tufts University, USA

Preface

This is a research document and a textbook for graduate students and researchers in the field of networking and system administration. It offers a theoretical perspective on human–computer systems and their administration. The book assumes a basic competence in mathematical methods, common to undergraduate courses. Readers looking for a less theoretical introduction to the subject may wish to consult (Burgess (2000b)).

I have striven to write a short book, treating topics briefly rather than succumbing to the temptation to write an encyclopædia that few will read or be able to lift. I have not attempted to survey the literature or provide any historical context to the development of these ideas (see Anderson et al. (2001)). I hope this makes the book accessible to the intelligent lay reader who does not possess an extensive literacy in the field and would be confused by such distractions. The more advanced reader should find sufficient threads to follow to add depth to the material. In my experience, too much attention to detail merely results in one forgetting why one is studying something at all. In this case, we are trying to formulate a descriptive language for systems.

A theoretical synthesis of system administration plays two roles: it provides a descriptive framework for systems that should be available to other areas of computer science and proffers an analytical framework for dealing with the complexities of interacting components. The field of system administration meets an unusual challenge in computer science: that of approximation. Modern computing systems are too complicated to be understood in exact terms.

In the flagship theory of physics, quantum electrodynamics, one builds everything out of two simple principles:

1. Different things can exist at different places and times.
2. For every effect, there must be a cause.

The beauty of this construction is its lack of assumptions and the richness of the results. In this text, I have tried to synthesize something like this for human–computer systems. In order to finish the book, and keep it short and readable, I have had to compromise on many things. I hope that the result nevertheless contributes in some way to a broader scientific understanding of the field and will inspire students to further serious study of this important subject.

Some of this work is based on research performed with my collaborators Geoff Canright, Frode Sandnes and Trond Reitan. I have benefited greatly from discussions with them and others. I am especially grateful for the interest and support of other researchers, most notably Alva Couch for understanding my own contributions when no one else did. Finally, I would like to thank several for reading the draft versions of the manuscript and commenting: Paul Anderson, Lars Kristiansen, Tore Jonassen, Anil Somayaji and Jan Bergstra.

Mark Burgess

Chapter 1

Introduction

Technology: the science of the mechanical and industrial arts.
[Gk. tekhne art and logos speech].

—Odhams dictionary of the English language

1.1 What is system administration?

System administration is about the design, running and maintenance of human–computer systems. Human–computer systems are ‘communities’ of people and machines that collaborate actively to execute a common task. Examples of human–computer systems include business enterprises, service institutions and any extensive machinery that is operated by, or interacts with human beings. The human players in a human–computer system are often called the users and the machines are referred to as hosts, but this suggests an asymmetry of roles, which is not always the case.

System administration is primarily about the technological side of a system: the architecture, construction and optimization of the collaborating parts, but it also occasionally touches on softer factors such as user assistance (help desks), ethical considerations in deploying a system, and the larger implications of its design for others who come into contact with it. System administration deals first and foremost with the system as a whole, treating the individual components as black boxes, to be opened only when it is possible or practical to do so. It does not conventionally consider the design of user-tools such as third-party computer programs, nor does it attempt to design enhancements to the available software, though it does often discuss meta tools and improvised software systems that can be used to monitor, adjust or even govern the system. This omission is mainly because user-software is acquired beyond the control of a system administrator; it is written by third parties, and is not open to local modification. Thus, users’ tools and software are treated as ‘given quantities’ or ‘boundary conditions’.

For historical reasons, the study of system administration has fallen into two camps: those who speak of network management and discuss its problems in terms of software design for the management of black box devices by humans (e.g. using SNMP), and those who speak of system administration and concern themselves with practical strategies of machine and software configuration at all levels, including automation, human–computer issues and ethical considerations. These two viewpoints are complementary, but too often ignore one another. This book considers human–computer systems in general, and refers to specific technologies only by example. It is therefore as much about purely human administrative systems as it is about computers.

1.2 What is a system?

A system is most often an organized effort to fulfil a goal, or at least carry out some predictable behaviour. The concept is of the broadest possible generality. A system could be a mechanical device, a computer, an office of workers, a network of humans and machines, a series of forms and procedures (a bureaucracy) etc. Systems involve themes, such as collaboration and communication between different actors, the use of structure to represent information or to promote efficiency, and the laws of cause and effect. Within any mechanism, specialization of the parts is required to build significant innovation; it is only through strategy of divide and conquer that significant problems can be solved. This implies that each division requires a special solution.

A computer system is usually understood to mean a system composed primarily of computers, using computers or supporting computers. A human–computer system includes the role of humans, such as in a business enterprise where computers are widely used. The principles and theories concerning systems come from a wide range of fields of study. They are synthesized here in a form and language that is suitable for scholars of science and engineering.

1.3 What is administration?

The word administration covers a variety of meanings in common parlance. The American Administration is the government of the United States, that is, a political leadership. A university administration is a bureaucracy and economic resource department that works on behalf of a board of governors to implement the university’s policy and to manage its resources. The administrative department of a company is generally the part that handles economic procedures and payment transactions. In human–computer system administration, the definition is broadened to include all of the organizational aspects and also engineering issues, such as system fault diagnosis. In this regard, it is like the medical profession, which combines checking, management and repair of bodily functions. The main issues are the following:

In order to achieve these goals, it requires

Administration comprises two aspects: technical solutions and arbitrary policies. A technical solution is required to achieve goals and sub-goals, so that a problem can be broken down into manageable pieces. Policy is required to make the system, as far as possible, predictable: it pre-decides the answers to questions on issues that cannot be derived from within the system itself. Policy is therefore an arbitrary choice, perhaps guided by a goal or a principle.

The arbitrary aspect of policy cannot be disregarded from the administration of a system, since it sets the boundary conditions under which the system will operate, and supplies answers to questions that cannot be determined purely on the grounds of efficiency. This is especially important where humans are involved: human welfare, permissions, responsibilities and ethical issues are all parts of policy. Modelling these intangible qualities formally presents some challenges and requires the creative use of abstraction.

The administration of a system is an administration of temporal and resource development. The administration of a network of localized systems (a so-called distributed system) contains all of the above, and, additionally, the administration of the location of and communication between the system’s parts. Administration is thus a flow of activity, information about resources, policy making, record keeping, diagnosis and repair.

1.4 Studying systems

There are many issues to be studied in system administration. Some issues are of a technical nature, while others are of a human nature. System administration confronts the human–machine interaction as few other branches of computer science do. Here are some examples:

Usually, system administrators do not decide the purpose of a system; they are regarded as supporting personnel. As we shall see, this view is, however, somewhat flawed from the viewpoint of system design. It does not always make sense to separate the human and computer components in a system; as we move farther into the information age, the fates of both become more deeply intertwined.

To date, little theory has been applied to the problems of system administration. In a subject that is complex, like system administration, it is easy to fall back on qualitative claims. This is dangerous, however, since one is easily fooled by qualitative descriptions. Analysis proceeds as a dialogue between theory and experiment. We need theory to interpret results of observations and we need observations to back up theory. Any conclusions must be a consistent mixture of the two. At the same time, one must not believe that it is sensible to demand hard-nosed Popper-like falsification of claims in such a complex environment. Any numbers that we can measure, and any models we can make must be considered valuable, provided they actually have a sensible interpretation.

Human–computer interaction

The established field of human–computer interaction (HCI) has grown, in computer science, around the need for reliable interfaces in critical software scenarios (see for instance Sheridan (1996); Zadeh (1973)). For example, in the military, real danger could come of an ill-designed user interface on a nuclear submarine; or in a power plant, a poorly designed system could set off an explosion or result in blackouts.

One can extend the notion of the HCI to think less as a programmer and more as a physicist. The task of physics is to understand and describe what happens when different parts of nature interact. The interaction between fickle humans and rigid machinery leads to many unexpected phenomena, some of which might be predicted by a more detailed functional understanding of this interaction. This does not merely involve human attitudes and habits; it is a problem of systemic complexity—something that physics has its own methods to describe. Many of the problems surrounding computer security enter into the equation through the HCI. Of all the parts of a system, humans bend most easily: they are often both the weakest link and the most adaptable tools in a solution, but there is more to the HCI than psychology and button pushing. The issue reaches out to the very principles of science: what are the relevant timescales for the interactions and for the effects to manifest? What are the sources of predictability and unpredictability? Where is the system immune to this interaction, and where is the interaction very strong? These are not questions that a computer science analysis alone can answer; there are physics questions behind these issues. Thus, in reading this book, you should not be misled into thinking that physics is merely about electrons, heat and motion: it is a broad methodology for ‘understanding phenomena’, no matter where they occur, or how they are described. What computer science lacks from its attachment to technology, it must regain by appealing to the physics of systems.

Policy

The idea policy plays a central role in the administration of systems, whether they are dominated by human or technological concerns.

Definition 1 (Policy—heuristic) A policy is a description of what is intended and desirable about a system. It includes a set of ad hoc choices, goals, compromises, schedules, definitions and limitations about the system. Where humans are involved, compromises often include psychological considerations, and welfare issues.

A policy provides a frame of reference in which a system is understood to operate. It injects a relativistic aspect into the science of systems: we cannot expect to find absolute answers, when different systems play by different rules and have different expectations. A theory of systems must therefore take into account policy as a basic axiom. Much effort is expended in the chapters that follow to find a tenable definition of policy.

Stability and instability

It is in the nature of almost all systems to change with time. The human and machine parts of a system change, both in response to one another, and in response to a larger environment. The system is usually a predictable, known quantity; the environment is, by definition, an unknown quantity. Such changes tend to move the system in one or two directions: either the system falls into disarray or it stagnates. The meaning of these provocative terms is different for the human and the machine parts:

Ideally, a machine will perform, repetitively, the same job over and over again, because that is the function of mechanisms: stagnation is good for machines. For humans, on the other hand, this is usually regarded as a bad thing, since humans are valued for their creativity and adaptability. For a system mechanism to fall into disarray is a bad thing.

The relationship between a system and its environment is often crucial in determining which of the above is the case. The inclusion of human behaviour in systems must be modelled carefully, since humans are not deterministic in the same way that machines (automata) can be. Humans must therefore be considered as being part system and part environment. Finally, policy itself must be our guide as to what is desirable change.

Security

Security is a property of systems that has come to the forefront of our attention in recent times. How shall we include it in a theory of system administration?

Definition 2 (Security) Security concerns the possible ways in which a system’s integrity might be compromised, causing it to fail in its intended purpose. In other words, a breach of security is a failure of a system to meet its specifications.

Security refers to ‘intended purpose’, so it is immediately clear that it relates directly to policy and that it is a property of the entire system in general. Note also that, while we associate security with ‘attacks’ or ‘criminal activity’, natural disasters or other occurrences are equally to be blamed for the external perturbations that break systems.

A loss of integrity can come from a variety of sources, for example, an internal fault, an accident or a malicious attack on the system. Security is a property that requires the analysis of assumptions that underpin the system, since it is these areas that one tends to disregard and that can be exploited by attackers, or fail for diverse reasons. The system depends on its components in order to function. Security is thus about an analysis of dependencies. We can sum this up in a second definition:

Definition 3 (Secure system) A secure system is one in which every possible threat has been analysed and where all the risks have been assessed and accepted as a matter of policy.

1.5 What’s in a theory?

This book is not a finished theory, like the theory of relativity, or the theory of genetic replication. It is not the end of a story, but a beginning. System administration is at the start of its scientific journey, not at its end.

Dramatis personae

The players in system administration are the following:

We seek a clear and flexible language (rooted in mathematics) in which to write their script. It will deal with basic themes of

It must answer questions that are of interest to the management of systems. We can use two strategies:

A snapshot of reality

The system administrator rises and heads for the computer, grabs coffee or cola and proceeds to catch up on e-mail. There are questions, bug reports, automatic replies from scripted programs, spam and lengthy discussions from mailing lists.

The day proceeds to planning, fault finding, installing software, modifying system parameters to implement (often ad hoc) policy that enables the system to solve a problem for a user, or which makes the running smoother (more predictable)—see fig. 1.1. On top of all of this, the administrator must be thinking about what users are doing. After all, they are the ones who need the system and the ones who most often break it. How does ‘the system’ cope with them and their activities as they feed off it and feed back on it? They are, in every sense, a part of the system. How can their habits and skills be changed to make it all work more smoothly? This will require an appreciation of the social interactions of the system and how they, in turn, affect the structures of the logical networks and demands placed on the machines.

Figure 1.1: The floating islands of system administration move around on a daily basis and touch each other in different ways. In what framework shall we place these? How can we break them down into simpler problems that can be ‘solved’? In courier font, we find some primitive concepts that help to describe the broader ideas. These will be our starting points.

There are decisions to be made, but many of them seem too uncertain to be able to make a reliable judgement on the available evidence. Experimentation is required, and searching for advice from others. Unfortunately, you never know how reliable others’ opinions and assertions will be. It would be cool if there were a method for turning the creative energy into the optimal answer. There is ample opportunity and a wealth of tools to collect information, but how should that information be organized and interpreted? What is lacking is not software, but theoretical tools.

What view or philosophy could unify the different facets of system administration: design, economics, efficiency, verification, fault-finding, maintenance, security and so on? Each of these issues is based on something more primitive or fundamental. Our task is therefore to use the power of abstraction to break down the familiar problems into simpler units that we can master and then reassemble into an approximation of reality. There is no unique point of view here (see next chapter).

Theory might lead to better tools and also to better procedures. If it is to be of any use, it must have predictive power as well as descriptive power. We have to end up with formulae and procedures that make criticism and re-evaluation easier and more effective. We must be able to summarize simple ‘laws’ about system management (thumb-rules) that are not based only on vague experience, but have a theoretical explanation based on reasonable cause and effect.

How could such a thing be done? For instance, how might we measure how much work will be involved in a task?

By starting down the road of analysis, we gain many small insights that can be assembled into a deeper understanding. That is what this book attempts to do.

The system administrator wonders if he or she will ever become redundant, but there is no sign of that happening. The external conditions and requirements of users are changing too quickly for a system to adapt automatically, and policy has to be adjusted to new goals and crises. Humans are the only technology on the planet that can address that problem for the foreseeable future. Besides, the pursuit of pleasure is a human condition, and part of the enjoyment of the job is that creative and analytical pursuit.

The purpose of this book is to offer a framework in which to analyse and understand the phenomenon of human–computer management. It is only with the help of theoretical models that we can truly obtain a deeper understanding of system behaviour.

Studies

The forthcoming chapters describe a variety of languages for discussing systems, and present some methods and issues that are the basis of the author’s own work. Analysis is the scientific method in action, so this book is about analysis. It has many themes:

1. Observe—we must establish a factual basis for discussing systems.
2. Deduce cause—we establish probable causes of observed phenomena.
3. Establish goals—what do we want from this information?
4. Diagnose ‘faults’—what is a fault? It implies a value judgement, based on policy.
5. Correct faults—devise and apply strategies.

Again, these concepts are intimately connected with ‘policy’, that is, a specification of right and wrong. In some sense, we need to know the ‘distance’ between what we would like to see and what we actually see.

This is all very abstract. In the day-to-day running of systems, few administrators think in such generalized, abstract terms—yet this is what this book asks you to do.

Example 1 (A backup method) A basic duty of system administrators is to perform a backup of data and procedures: to ensure the integrity of the system under natural or unnatural threats. How shall we abstract this and turn it into a scientific enquiry?

We might begin by examining how data can be copied from one place to another. This adds a chain of questions: (i) how can the copying be made efficient? (ii) what does efficient mean? (iii) how often do the data change, and in what way? What is the best strategy for making a copy: immediately after every change, once per day, once per hour? We can introduce a model for the change, for example, a mass of data that is more or less constant, with small random fluctuating changes to some files, driven by random user activity. This gives us something to test against reality. Now we need to know how users behave, and what they are likely to do. We then ask: what do these fluctuations look like over time? Can they be characterized, so that we can tune a copying algorithm to fit them? What is the best strategy for copying the files?

The chain of questions never stops: analysis is a process, not an answer.

Example 2 (Resource management) Planning a system’s resources, and deploying them so that the system functions optimally is another task for a system administrator. How can we measure, or even discuss, the operation of a system to see how it is operating? Can important (centrally important) places be identified in the system, where extra resources are needed, or the system might be vulnerable to failure? How shall we model demand and load? Is the arrival of load (traffic) predictable or stochastic? How does this affect our ability to handle it? If one part of the system depends on another, what does this mean for the efficiency or reliability? How do we even start asking these questions analytically?

Example 3 (Pattern detection) Patterns of activity manifest themselves over time in systems. How do we measure the change, and what is the uncertainty in our measurement? What are their causes? How can they be described and modelled? If a system changes its pattern of behaviour, what does this mean? Is it a fault or a feature?

In computer security, intrusion detection systems often make use of this kind of idea, but how can the idea be described, quantified and generalized, hence evaluated?

Example 4 (Configuration management) The initial construction and implementation of a system, in terms of its basic building blocks, is referred to as its configuration. It is a measure of the system’s state or condition. How should we measure this state? Is it a fixed pattern, or a statistical phenomenon? How quickly should it change? What might cause it to change unexpectedly? How big a change can occur before the system is damaged? Is it possible to guarantee that every configuration will be stable, perform its intended function, and be implementable according to the constraints of a policy?

In each of the examples above, an apparently straightforward issue generates a stream of questions that we would like to answer. Asking these questions is what science is about; answering them involves the language of mathematics and logic in concert with a scientific inquiry: science is about extracting the essential features from complex observable phenomena and modelling them in order to make predictions. It is based on observation and approximate verification. There is no ‘exact science’ as we sometimes hear about in connection with physics or chemistry; it is always about suitably idealized approximations to the truth, or ‘uncertainty management’. Mathematics, on the other hand, is not to be confused with science—it is about rewriting assumptions in different ways: that is, if one begins with a statement that is assumed true (an axiom) and manipulates it according to the rules of mathematics, the resulting statement is also true by the same axiom. It contains no more information than the assumptions on which it rests. Clearly, mathematics is an important language for expressing science.

1.6 How to use the text

Readers should not expect to understand or appreciate everything in this book in the short term. Many subtle and deep-lying connections are sewn in these pages that will take even the most experienced reader some time to unravel. It is my hope that there are issues sketched out here that will provide fodder for research for at least a decade, probably several. Many ideas about the administration of systems are general and have been discussed many times in different contexts, but not in the manner or context of system administration.

The text can be read in several ways. To gain a software-engineering perspective, one can replace ‘the system’ with ‘the software’. To gain a business management perspective, replace ‘the system’ with ‘the business’, or ‘the organization’. For human–computer administration, read ‘the system’ as ‘the network of computers and its users’.

The first part of the book is about observing and recording observations about systems, since we aim to take a scientific approach to systems. Part 2 concerns abstracting and naming the concepts of a system’s operation and administration in order to place them into a formal framework. In the final part of the book, we discuss the physics of information systems, that is, the problem of how to model the time-development of all the resources in order to determine the effect of policy. This reflects the cycle of development of a system:

1.7 Some notation used

A few generic symbols and notations are used frequently in this book and might be unfamiliar.

The function q(t) is always used to represent a ‘signal’ or quality that is varying in the system, that is, a scalar function describing any value that changes in time. I have found it more useful to call all such quantities by the same symbol, since they all have the same status.

q(x, t) is a function of time and a label x that normally represents a spatial position, such as a memory location. In structured memory, composed of multiple objects with finite size, the addresses are multi-dimensional and we write q(, t), where = (x1, …, x) is an -dimensional vector that specifies location within a structured system, for example, (6,3,8) meaning perhaps bit 6 of component 3 in object 8.

In describing averages, the notation is used for mean and expectation values, for example, X would mean an average over values of X. In statistics literature, this is often written as E(X).