Table of Contents

Table of Contents

This edition first published 2011

© 2011 John Wiley & Sons, Ltd

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at .

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.

Library of Congress Cataloging-in-Publication Data

Gittins, John C., 1938-

Multi-armed bandit allocation indices / John Gittins, Richard Weber, Kevin Glazebrook.—2nd ed.

p. cm.

Includes bibliographical references and index.

ISBN 978-0-470-67002-6(cloth)

1. Resource allocation—Mathematical models. 2. Mathematical optimization. 3. Programming (Mathematics) I. Weber, Richard, 1953- II. Glazebrook, Kevin D., 1950- III. Title.

QA279.G55 2011

519.5—dc22

2010044409

A catalogue record for this book is available from the British Library.

Print ISBN: 978-0-470-67002-6

ePDF ISBN: 978-0-470-98004-0

oBook ISBN: 978-0-470-98003-3

ePub ISBN: 978-1-119-99021-5

Foreword

John Gittins' first edition of this book marked the end of an era, the era in which a succession of investigators struggled for an understanding and ‘solution’ of the multi-armed bandit problem. My foreword to that edition celebrated the gaining of this understanding, and so it seems fitting that this should be retained.

The opening of a new era was like the stirring of an ant-heap, with the sudden emergence of an avid multitude and a rush of scurrying activity. The first phase was one of exploitation, in which each worker tried to apply his special expertise in this new context. This yielded, among other things, a remarkable array of proofs of optimality of the Gittins index policy. The most elegant and insightful was certainly Richard Weber's ‘prevailing charge’ proof (see section 2.5), expressible in a single paragraph of verbal reasoning. I must confess, however, to a lingering attachment to the dynamic programming proof (see section 4.3) which provided also the value function of the Gittins policy and a treatment immediately generalizable to the case of superprocesses.

The phase of real interest was the subsequent one, of exploration. To what range of models can the Gittins technique be extended? Here the simile latent in the terms ‘bandit’ and ‘arm’ (with their gambling machine origins) begins to become quite strained. I myself spoke rather of a population of ‘projects’, only one of which could be engaged at any given time. One wishes to operate only the high-value projects, but can determine which these are only by trying them all—it is to this that the phrase ‘exploitation and exploration’ first referred. The process is then one of allocation and learning. Any project does not itself change, but its ‘informational state’ does—one's knowledge of it as expressed by a Bayesian updating.

However, situations occur for which the projects do also have a physical state, which may change by stochastic rules, or for which the population of projects changes by immigration or emigration. These are cases one would wish to cover. It is then more natural to think of the projects as ‘activities’, having their own dynamic structure, and whose performance one would wish to optimize by the appropriate allocation of given resources over activities. This is the classical economic activity analysis in a dynamic setting. However, we are now effectively adding to this the feature that the current physical states of the activities are incompletely known, and must be inferred from observation. ‘Observation time’ is then an additional resource to be allocated. Section 6.8 gives the clearest indication of such an ambition.

Extension to such cases requires new tools, and Chapters 5 and 6 consider two such: study of the achievable region and Lagrangian relaxation of the optimization. These central chapters present very clearly the current state of theory in these areas, a considerable enhancement of understanding and many examples of cases of practical interest for which an optimal policy—of index character—is determined. It must be confessed, however, that the general problems of indexability and optimality remain beached on the research frontier, although one senses a rising tide which will lift them.

Explicitness is also a feature of Chapters 7 and 8, which go into hard detail on the determination of indices under different statistical assumptions.

This scholarly and modern work gives a satisfyingly complete, rigorous and illuminating account of the current state of the subject. It also conveys a historical perspective, from the work of Klimov and von Olivier (which appeared in print before appreciation of John Gittins' advance had become general) to the present day. All three authors of the text have made key and recent contributions to the subject, and are in the best position to lay its immediate course.

Peter Whittle

Foreword to the First Edition

The term ‘Gittins index’ now has firm currency in the literature, denoting the concept which first proved so crucial in the solution of the long-standing multi-armed bandit problem and since then has provided a guide for the deeper understanding of all such problems. The author is, nevertheless, too modest to use the term so I regard it as my sole role to reassure the potential reader that the author is indeed the Gittins of the index, and that this book sets forth his pioneering work on the solution of the multi-armed bandit problem and his subsequent investigation of a wide class of sequential allocation problems.

Such allocation problems are concerned with the optimal division of resources between projects which need resources in order to develop and which yield benefit at a rate depending upon their degree of development. They embody in essential form a conflict evident in all human action. This is the conflict between taking those actions which yield immediate reward and those (such as acquiring information or skill, or preparing the ground) whose benefit will come only later.

The multi-armed bandit is a prototype of this class of problems, propounded during the Second World War, and soon recognized as so difficult that it quickly became a classic, and a byword for intransigence. In fact, John Gittins had solved the problem by the late sixties, although the fact that he had done so was not generally recognized until the early eighties. I can illustrate the mode of propagation of this news, when it began to propagate, by telling of an American friend of mine, a colleague of high repute, who asked an equally well-known colleague ‘What would you say if you were told that the multi-armed bandit problem had been solved?’ The reply was somewhat in the Johnsonian form: ‘Sir, the multi-armed bandit problem is not of such a nature that it can be solved’. My friend then undertook to convince the doubter in a quarter of an hour. This is indeed a feature of John's solution: that, once explained, it carries conviction even before it is proved.

John Gittins gives here an account which unifies his original pioneering contributions with the considerable development achieved by both by himself and other workers subsequently. I recommend the book as the authentic and authoritative source-work.

Peter Whittle

Preface

The first edition of this book was published in 1989. A shortened version of the preface to that edition follows this preface. The uninitiated reader might find it helpful to read it at this point.

Since 1989 the field has developed apace. There has been a remarkable flowering of different proofs of the main index theorem, each giving its own particular insight as to why the result is true. Major bodies of related theory have also emerged, notably the achievable region and restless bandit methodologies, and the discussion in economics of the appropriate balance between exploration and exploitation. These have led, for example, to improved algorithms for calculating indices and to improved bounds on the performance of index strategies when they are not necessarily optimal. These developments form the case for a new edition, plus the fact that there is an ongoing interest in the book, which is now difficult to buy.

There are now three authors rather than just John Gittins. Kevin Glazebrook and Richard Weber bring familiarity with more recent developments, to which they have made important contributions. Their expertise has allowed us to include new chapters on achievable regions and on restless bandits.

We have also taken the opportunity to substantially revise the core earlier chapters. Our aim has been to provide an accessible introduction to the main ideas, taking advantage of more recent work, before proceeding to the more challenging material in Chapters 4, 5 and 6. Overall we have tried to produce an expository account rather than just a research monograph. The exercises are designed as an aid to a deeper understanding of the material.

The Gittins index, as it is commonly known, for a project competing for investment with other projects allows both for the immediate expected reward, and for the value of the information gained from the initial investment, which may be useful in securing later rewards. These factors are often called exploitation and exploration, respectively.

In this spirit Chapter 1, which is a taster for the rest of the book, is subtitled ‘Exploration’. The mainstream of the book then continues through Chapters 2 to 4. It breaks into independent substreams represented by Chapter 5, Chapter 6, Chapters 7 and 8, and Chapter 9, which looks briefly at five further application areas.

Readers should have some prior knowledge of university-level probability theory, including Markov processes, and for a full understanding of Chapters 5 and 6 will need to know the basic ideas of linear programming. We hope the book will be of interest to researchers in chemometrics, combinatorics, economics, management science, numerical analysis, operational research, probability theory and statistics. Few readers will want to read from cover to cover. For example, both chemometricians and numerical analysts are likely to be interested mainly in Chapter 8 and the tables which it describes: chemometricians as potential users of the tables, and numerical analysts for the sake of the methods of calculation. As another example, Chapters 1, 2 and 3 could form the basis for a specialized lecture course in applied probability for final-year undergraduates.

In addition to those colleagues acknowledged in the preface to the first edition we are very grateful to Anne-Marie Oreskovich and Morton Sorenson for further help in the calculation of index values.

John Gittins

Department of Statistics, University of Oxford

Kevin Glazebrook

Department of Management Science, Lancaster University

Richard Weber

Statistical Laboratory, University of Cambridge

August 2010

Preface to the First Edition

A prospector looking for gold in the Australian outback has to decide where to look, and if he prospects for any length of time must repeatedly take further such decisions in the light of his success to date. His decision problem is how to allocate his time, in a sequential manner, so as to maximize his likely reward. A similar problem faces a student about to graduate from university who is looking for employment, or the manager of a commercial laboratory with several research projects competing for the attention of the scientists at his or her disposal.

It is to problems like these that the indices described in this book may be applied. The problems are characterized by alternative independent ways in which time or effort may be consumed, for each of which the outcome is uncertain and may be determined only in the course of applying the effort. The choice at each stage is therefore determined partly on the basis of maximizing the expected immediate rate of return, and partly by the need to reduce uncertainty, and thereby provide a basis for better choices later on. It is the tension between these two requirements that makes the decision problem both interesting and difficult.

The classic problem of this type is the multi-armed bandit problem, in which several Bernoulli processes with different unknown success probabilities are available, each success yielding the same reward. There are already two good books on this subject: by Presman and Sonin (1982), and by Berry and Fristedt (1985). Both books go well beyond the simple version of the problem just stated: most notably to include, respectively, an exponentially distributed rather than constant interval between trials, and a study of the consequences of discounting rewards in different ways.

The aims of this book are to expound the theory of dynamic allocation indices, and to explore the class of problems for which they define optimal policies. These include sampling problems, like the multi-armed bandit problem, and stochastic scheduling problems, such as the research manager's problem. Tables of index values are given for a number of sampling problems, including the classical Bayesian multi-armed bandit problem. For the most part, and except where otherwise indicated, the book is an account of original work, though much of it has appeared before in the publications bearing the author's name which are listed in the references.

It is a pleasure to acknowledge the stimulus and encouragement which this work has received over the years by conversations, and in some cases collaboration, with Joao Amaral, Tony Baker, John Bather, Sten Bergman, Owen Davies, Kevin Glazebrook, David Jones, Frank Kelly, Aamer Khan, Dennis Lindley, Peter Nash, David Roberts and Peter Whittle.

I should particularly like to thank Joao Amaral, Frank Geisler and David Jones for the substantial effort involved in calculating index values, and Brenda Willoughby for her painstaking work in typing the manuscript. The book is dedicated to Hugh and Anna, who have grown up during its preparation.

John Gittins

Mathematical Institute, Oxford University

March 1989