image

CONTENTS

PREFACE

CONTRIBUTORS

CHAPTER 1: WHAT ARE OUR MODELS REALLY TELLING US? A PRACTICAL TUTORIAL ON AVOIDING COMMON MISTAKES WHEN BUILDING PREDICTIVE MODELS

1.1 INTRODUCTION

1.2 PRELIMINARIES

1.3 DATASETS

1.4 BUILDING PREDICTIVE MODELS

1.5 EVALUATING THE PERFORMANCE OF PREDICTIVE MODELS

1.6 MOLECULAR DESCRIPTORS

1.7 BUILDING AND TESTING A RANDOM FOREST MODEL

1.8 EXPERIMENTAL ERROR AND MODEL PERFORMANCE

1.9 MODEL APPLICABILITY

1.10 COMPARING PREDICTIVE MODELS

1.11 CONCLUSION

REFERENCES

SOURCE CODE LISTINGS

CHAPTER 2: THE CHALLENGE OF CREATIVITY IN DRUG DESIGN

2.1 DRUG DESIGN HISTORY: INCREMENTALISM AND SERENDIPITY

2.2 PHYSICAL REALITY AND COMPUTATIONAL METHODS

2.3 SUMMARY

REFERENCES

CHAPTER 3: A ROUGH SET THEORY APPROACH TO THE ANALYSIS OF GENE EXPRESSION PROFILES

3.1 INTRODUCTION

3.2 METHODOLOGY

3.3 DRUG-INDUCED GENE EXPRESSION AND PHOSPHOLIPIDOSIS IN HUMAN HEPATOMA HEPG2 CELLS

3.4 DISCUSSION

3.5 SUMMARY AND CONCLUSIONS

NOTES

REFERENCES

CHAPTER 4: BIMODAL PARTIAL LEAST-SQUARES APPROACH AND ITS APPLICATION TO CHEMOGENOMICS STUDIES FOR MOLECULAR DESIGN

4.1 INTRODUCTION

4.2 MATERIAL AND METHODS

4.3 RESULTS AND DISCUSSION

4.4 CONCLUSION

4.5 ACKNOWLEDGMENTS

REFERENCES

CHAPTER 5: STABILITY IN MOLECULAR FINGERPRINT COMPARISON

5.1 INTRODUCTION

5.2 METHODS

5.3 RESULTS

5.4 CONCLUSIONS AND DIRECTIONS

REFERENCES

CHAPTER 6: CRITICAL ASSESSMENT OF VIRTUAL SCREENING FOR HIT IDENTIFICATION

6.1 INTRODUCTION

6.2 FACTORS AFFECTING THE OUTCOME AND EVALUATION OF VIRTUAL SCREENING CAMPAIGNS

6.3 HOW TO EVALUATE VIRTUAL SCREENING PERFORMANCE?

6.4 VIRTUAL VERSUS HIGH-THROUGHPUT SCREENING

6.5 STRUCTURAL NOVELTY REVISITED: EXEMPLARY CASES

6.6 EXPECTATIONS AND SELECTED APPLICATIONS

6.7 CONCLUSIONS: WHAT IS POSSIBLE? WHAT IS NOT?

REFERENCES

CHAPTER 7: CHEMOMETRIC APPLICATIONS OF NAÏVE BAYESIAN MODELS IN DRUG DISCOVERY: BEYOND COMPOUND RANKING

7.1 INTRODUCTION

7.2 VIRTUAL SCREENING USING BAYESIAN MODELS

7.3 DATA TYPES AND DATA QUALITY REQUIREMENTS

7.4 TARGET AND PHENOTYPE COMPARISON IN CHEMICAL AND BIOLOGICAL ACTIVITY SPACE

7.5 MINING FOR ENRICHED FEATURES AND INTERPRETING THEM

7.6 SHORTCOMINGS OF NBM

7.7 SUMMARY

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 8: CHEMOINFORMATICS IN LEAD OPTIMIZATION

8.1 HISTORICAL INTRODUCTION

8.2 LEAD OPTIMIZATION IS A LARGE, COMPLEX, MULTIOBJECTIVE PROCESS

8.3 CHEMOINFORMATICS METHODS FOR MULTIOBJECTIVE OPTIMIZATION

8.4 CASE STUDIES

8.5 CONCLUSION

REFERENCES

CHAPTER 9: USING CHEMOINFORMATICS TOOLS TO ANALYZE CHEMICAL ARRAYS IN LEAD OPTIMIZATION

9.1 INTRODUCTION

9.2 LEAD OPTIMIZATION PROJECTS

9.3 COVERAGE OF CHEMISTRY AND PROPERTY SPACE (ΔS–ΔA PLOTS)

9.4 TEMPORAL ANALYSIS OF LEAD OPTIMIZATION

9.5 MODELING LEAD OPTIMIZATION AS A SELF-AVOIDING RANDOM WALK

9.6 INSIGHTS FROM THE DATA ANALYSIS

9.7 EXTRACTING INFORMATION ON ARRAYS FROM THE ARCHIVE

9.8 CONCLUSIONS

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 10: EXPLORATION OF STRUCTURE–ACTIVITY RELATIONSHIPS (SARs) AND TRANSFER OF KEY ELEMENTS IN LEAD OPTIMIZATION

10.1 INTRODUCTION

10.2 METHODS FOR SAR ANALYSIS

10.3 SAR TRANSFER IN RESCAFFOLDING

10.4 ADDRESSING ANTITARGET ACTIVITY

10.5 CONCLUSION

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 11: DEVELOPMENT AND APPLICATIONS OF GLOBAL ADMET MODELS: IN SILICO PREDICTION OF HUMAN MICROSOMAL LABILITY

11.1 INTRODUCTION

11.2 CASE STUDY ON METABOLIC LABILITY

11.3 CONCLUSION

REFERENCES

CHAPTER 12: CHEMOINFORMATICS AND BEYOND: MOVING FROM SIMPLE MODELS TO COMPLEX RELATIONSHIPS IN PHARMACEUTICAL COMPUTATIONAL TOXICOLOGY

12.1 INTRODUCTION

12.2 DATA-DRIVEN MODELING

12.3 DELIVERING IMPACT: BRINGING IT TO THE CUSTOMER

12.4 SUMMARY AND OUTLOOK

ACKNOWLEDGMENTS

REFERENCES

CHAPTER 13: APPLICATIONS OF CHEMINFORMATICS IN PHARMACEUTICAL RESEARCH: EXPERIENCES AT BOEHRINGER INGELHEIM IN GERMANY

13.1 INTRODUCTION

13.2 INFRASTRUCTURE AND SYSTEMS

13.3 METHODS AND APPLICATIONS

13.4 DISCUSSION

REFERENCES

CHAPTER 14: LESSONS LEARNED FROM 30 YEARS OF DEVELOPING SUCCESSFUL INTEGRATED CHEMINFORMATIC SYSTEMS

14.1 INTRODUCTION

14.2 HISTORY

14.3 KEYS TO THE SUCCESS OF MOBIUS: A TECHNICAL PERSPECTIVE

14.4 LESSONS LEARNED: THE BOTTOM LINE

14.5 BUILD VERSUS BUY VERSUS OPEN SOURCE

14.6 CONCLUSIONS AND SUMMARY

REFERENCES

CHAPTER 15: MOLECULAR SIMILARITY ANALYSIS

15.1 INTRODUCTION

15.2 A BRIEF HISTORY OF MOLECULAR SIMILARITY ANALYSIS

15.3 COGNITIVE ASPECTS OF SIMILARITY

15.4 MOLECULAR SIMILARITY MEASURES

15.5 SOME ISSUES IN MOLECULAR SIMILARITY ANALYSIS

15.6 AN APPLICATION OF MOLECULAR SIMILARITY ANALYSIS TO CHEMICAL SPACE AND ACTIVITY LANDSCAPES

15.7 FINAL THOUGHTS

ACKNOWLEDGMENTS

NOTES

REFERENCES

SUPPLEMENTAL IMAGES

INDEX

image

2013018927

PREFACE

Chemoinformatics: From methods and models to pharmaceutical applications

Chem(o)informatics is a relatively young and still evolving discipline, although some of its scientific origins can be traced back at least five decades. It continues to be challenging to clearly define chemoinformatics as a scientific field. Essentially, ­chemoinformatics uses algorithms and computational methods, often adapted from computer science, to organize and process chemical data, analyze and predict ­structure–property relationships of small molecules, and design compounds. Although chemoinformatics is not confined to questions and tasks that are relevant for pharmaceutical research, this field has firm roots in drug discovery. In fact, when the term chemoinformatics was first introduced in the literature in 1998 (Brown FK. Chemoinformatics: What is it and how does it impact drug discovery. Ann. Rep. Med. Chem. 1998;33:375–384), there was a strong focus on drug discovery research—and this has been a characteristic of this field ever since. Accordingly, the study of biological activities of chemical compounds and analysis of their structure–activity relationships (SARs) are hallmarks of chemoinformatics as we understand it today. As a consequence, methodological boundaries between chemoinformatics, computational chemistry, and drug design are rather fluid. In more specific terms, chemoinformatics has been defined to cover a wide range of scientific topics, from chemical data collection, management, and analysis to the exploration of SARs and prediction of compound activity or in vivo properties (Bajorath J. Understanding chemoinformatics: A unifying approach. Drug Discov. Today 2004;9:13–14). The scientific diversity of the field is high (Warr WA. Some trends in chem(o)informatics. Meth. Mol. Biol. 2011;672:1–37) and likely to even further increase, given the advent of research disciplines such as chemical biology or nanoscience, for which concepts from chemoinformatics are also relevant. Despite the presence of fluid scientific boundaries, characteristic features of chemoinformatics include its large-scale character (i.e., very large numbers of compounds and activity data are processed and analyzed) and its dual purpose of generating computational infrastructures and predictive models or data mining methods. Given its roots, another characteristic feature of chemoinformatics is that many important developments have originated from pharmaceutical environments, in addition to research carried out in academia. It is evident that the pharmaceutical industry is the place where the need for ­chemoinformatics technologies and experts has been and continues to be the greatest. One should also note that the chemoinformatics literature is dominated by reports of computational methods and benchmark investigations, rather than practical applications. This is not very surprising, given that the majority of pharmaceutical ­applications are a part of drug discovery campaigns and hence proprietary (at the least for the duration of a discovery project). However, there clearly is a need to evaluate and better understand what chemoinformatics can actually accomplish in ­practical drug discovery situations. This need is not sufficiently met by the current scientific literature.

Having briefly introduced chemoinformatics as a scientific discipline, I should address the question why this book was originally planned and ultimately written. What was the prime motivation? Different from other currently available textbooks on chemoinformatics (there are not many), this book was envisioned to mostly (but not exclusively) focus on practical applications of chemoinformatics approaches in pharmaceutical research, hence addressing the need referred to earlier. It was intended to bring together leading experts from the pharmaceutical industry and selected academic institutions to describe the practice of chemoinformatics, illustrate the interplay between academic and pharmaceutical research, and showcase collaborations. Among others, key questions for authors included: What does chemoinformatics mean to you? How is it applied in your specific research environment? How does chemoinformatics contribute to pharmaceutical research? What works? What does not? Hence, special emphasis was put on expert views and experience values that might reflect the “true” impact of chemoinformatics approaches in drug discovery. In addition, a few selected methodological concepts were considered to further expand the spectrum of the presentations.

The 15 chapters presented herein include contributions from major pharmaceutical companies, a leading software firm, and several academic groups. They also cover collaborative efforts between academia and pharma. The chapters are arranged to follow a conceptual path from the description of methods and models to drug discovery applications and the design of chemoinformatics infrastructures. Hence, they span a wide range of topics.

Chapter 1 by W. Patrick Walters from Vertex presents a practical guide to the generation and evaluation of predictive models. It emphasizes common pitfalls in model building and assessment and shows how to avoid them. Many practical examples are provided including source code, which results in an instructive and much needed contribution. In Chapter 2 by Ajay Jain of the University of California at San Francisco, computational methods and models are considered from a principal point of view. The argument is made—and well supported—that the success of computational models often depends on the incorporation of sound physical principles (termed physical reality), although their consideration inevitably also introduces approximations. A number of well-selected methodological examples are presented.

Chapter 3 by Gerald M. Maggiora of the University of Arizona and collaborators of the Mayo Clinic and the Torrey Pines Institute for Molecular Studies reports the adaptation of a new approach for chemoinformatics, that is, rough set theory, and discusses opportunities of this approach for drug discovery applications. In Chapter 4, Kiyoshi Hasegawa of the Chugai Pharmaceutical Company and Kimito Funatsu of the University of Tokyo also introduce new methodology. Their collaborative effort describes the application of the bimodal partial least-squares regression technique to analyze compound activity data by taking both ligand and target representations into account. Furthermore, in Chapter 5, Anthony Nicholls and Brian Kelley of OpenEye Scientific Software investigate search characteristics of different types of two-dimensional fingerprints, which are among the most popular molecular representations for chemical similarity searching and ligand-based virtual screening. Nicholls and Kelley pay particular attention to the way molecular similarity relationships are accounted for by different fingerprint representations and analyze how ­similarity assessment might be biased by fingerprints having high or low chemical resolution. On the basis of their findings, differences in search characteristic between fingerprints of alternative design can be rationalized. Practical implications of these results and possible methodological extensions are also discussed. Chapter 6, a ­contribution from our research group, further expands on ligand-based virtual screening, puts the approach into scientific context, and presents a critical assessment of practical virtual screening applications. Then, Meir Glick and colleagues of the Novartis Institutes for Biomedical Research, the authors of Chapter 7, describe a variety of applications of Bayesian modeling methods in drug discovery. Bayesian methods currently are among the most popular chemoinformatics approaches for compound classification, activity prediction, and target assignment. The topics discussed in this contribution include the analysis of phenotypic screening data and the prediction of off-target effects of drugs.

The contributions described thus far largely focus on approaches for the identification and characterization of active compounds. Once new active molecules have been identified, early-phase drug discovery projects transition into the hit-to-lead and lead optimization phases. Chapter 8 by Darren Green of GlaxoSmithKline and Matthew Segall of Optibrium Ltd. presents a thoughtful account of the evolution of lead optimization strategies and illustrates how different chemoinformatics concepts are adapted to aid in the optimization process. This contribution is very well complemented by Chapter 9 that reports on lead optimization collaborations between academia and the pharmaceutical industry. This work involved Valerie Gillet and Peter Willett of the University of Sheffield and George Papadatos et al. of GlaxoSmithKline. Here, the use of compound arrays for lead optimization is the major topic. A variety of chemoinformatics approaches have been employed to aid in the design of compound arrays and analyze progress made over time in lead optimization projects. This contribution also illustrates practical constraints involved in data assembly that affect medicinal chemistry projects and often work against a systematic and timely application of computational methods during lead optimization. In Chapter 10, Hans Matter and colleagues of Sanofi-Aventis further extend the lead optimization theme. They present a thorough and extensively referenced review of chemoinformatics methodologies for the analysis and prediction of SARs and demonstrate how such approaches have specifically been adapted for in-house applications. The chapter also contains a discussion of methods to transfer SARs from one chemical series to another, which is a topic of high interest in medicinal chemistry.

The optimization of leads and generation of clinical candidates is a complex multiparametric process in which in vivo compound characteristics such as absorption, distribution, metabolism, extraction, and toxicology (ADMET) properties are as important as compound potency and specificity. The following two contributions address these issues. In Chapter 11, Karl-Heinz Baringhaus et al., also of Sanofi-Aventis, discuss how different types of computational ADMET models are generated and present a case study in which a model of human liver microsomal lability (a measure of metabolic instability of compounds) was derived for in-house use. Then, in Chapter 12, Scott Boyer and colleagues of AstraZeneca further expand the discussion of ADMET models with a focus on toxicology assessment. Their contribution also highlights the critically important role primary in vivo data play for predictive model building, given their sparseness and expected error margins. Both contributions cover a wide range of chemoinformatics methodologies for the derivation of ADMET models. With a concluding discussion of data delivery and communication issues, Chapter 12 also represents a transition point to another important thematic section of the book.

The contributions described thus far introduce scientific concepts, derive increasingly complex prediction models, and illustrate how such models are practically applied in drug discovery. As such, they represent a major category of chemoinformatics approaches in pharmaceutical research, that is, modeling and prediction of various compound properties. Another major category includes the design and implementation of computational infrastructures and information systems that is equally important for drug discovery as data mining and predictive modeling. In fact, pharmaceutical research environments heavily rely on the availability of specialized database structures and information systems to enable data warehousing with consistent deposition, distribution, access, and use across an organization. For large pharmaceutical companies, these requirements represent challenging tasks. The last two contributions in this book address these challenges. In Chapter 13, Nils Weskamp et al. describe how comprehensive chemoinformatics and database structures have been designed and implemented at Boehringer-Ingelheim. Here, it becomes clear that data archiving and handling is only a part of the equation—it is equally important to provide general access to modeling tools to, for example, analyze high-throughput screening data or characterize SARs. This presents considerable challenges for chemoinformaticians because such computational tools must not only be generated or adopted but also be made accessible to nonexpert users in the form of automated and easy-to-use workflows. In addition, results must be communicated in an intuitive and interpretable manner. Furthermore, in Chapter 14, Michael S. Lajiness and Thomas R. Hagadone of Eli Lilly and Company discuss lessons learned from over three decades of design and implementation of different generations of chemoinformatics systems for pharmaceutical research. These investigators are among the pioneers in building and maintaining such computational infrastructures in different company-specific environments. Their contribution illustrates how such systems have evolved, and continue to evolve, as computational resources and requirements rapidly change and data volumes and drug discovery demands further increase. On the basis of their long experience, Lajiness and Hagadone comment on a number of practical aspects associated with system design that should be taken into consideration to ensure quality, accessibility, and utility of chemoinformatics infrastructures in drug discovery settings.

The book begins with chemoinformatics methodology and so it ends. To close the circle, in the final chapter (Chapter 15), José L. Medina-Franco of the Torrey Pines Institute for Molecular Studies and Gerald M. Maggiora of the University of Arizona describe foundations of molecular similarity analysis, one of the central themes in chemoinformatics. The evaluation and quantification of molecular similarity as an indicator of activity similarity is at the core of many chemoinformatics methods and an intensely investigated research topic to this date, conceptually linked to the design and navigation of chemical feature spaces.

Taken together, the contributions in this book highlight—from different points of view—key issues for the practice of chemoinformatics. The initial goals of this book project were quite ambitious and potential complications were expected. On the one hand, it was anticipated that it might be difficult for researchers in academia to present studies that are of high practical relevance for drug discovery; on the other hand, that it might be even more difficult for many investigators in the pharmaceutical industry to elaborate on details of their chemoinformatics work, given the proprietary nature of most of their projects. However, the chapters in this book have clearly exceeded initial expectations. Hence, I am very grateful to all authors who have spent their time and efforts to put together these excellent contributions! Without their early commitment and dedication, this project would not have been possible.

The contents of the book should be of interest to experts and practitioners in this field as well as to newcomers; there will be interesting materials for individuals with different motivations and levels of experience. Many of the questions that were initially asked have been answered in different ways and from different perspectives, which is highly desirable—after all, authors should have the last word.

Last but not least, given the critical expert views presented in this book and its practical drug discovery orientation, it is hoped that this publication will represent another important step forward in further defining and supporting chem(o)informatics as a scientific discipline at the interface between chemistry, computer science, and drug discovery.

JÜRGEN BAJORATH

CONTRIBUTORS

ERNST AHLBERG, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

SAMUEL ANDERSSON, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

JÜRGEN BAJORATH, Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany

KARL-HEINZ BARINGHAUS, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

BERND BECK, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

MICHAEL BIELER, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

SCOTT BOYER, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

LARS CARLSSON, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

KIMITO FUNATSU, Department of Chemical System Engineering, University of Tokyo, Tokyo, Japan

VALERIE J. GILLET, Information School, University of Sheffield, Sheffield, UK

MEIR GLICK, Novartis Institutes for BioMedical Research, Cambridge, MA, USA

DARREN V. S. GREEN, GlaxoSmithKline Medicines Research Centre, Stevenage, Herts, UK

STEFAN GüSSREGEN, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

PETER HAEBEL, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

THOMAS R. HAGADONE, Eli Lilly and Company, Indianapolis, IN, USA

KIYOSHI HASEGAWA, Chugai Pharmaceutical Company, Kamakura Research Laboratories, Kamakura, Kanagawa, Japan

CATRIN HASSELGREN, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

GERHARD HESSLER, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

AJAY N. JAIN, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA

BRIAN KELLEY, OpenEye Scientific Software, Inc., Santa Fe, NM, USA

PETER S. KUTCHUKIAN, Novartis Institutes for BioMedical Research, Cambridge, MA, USA

MICHAEL S. LAJINESS, Eli Lilly and Company, Indianapolis, IN, USA

EUGEN LOUNKINE, Novartis Institutes for BioMedical Research, Cambridge, MA, USA

CHRISTOPHER N. LUSCOMBE, GlaxoSmithKline, Medicines Research Centre, Stevenage, UK

GERALD M. MAGGIORA, College of Pharmacy and BIO5 Institute, University of Arizona, Tucson, AZ, USA; Translational Genomics Research Institute, Phoenix, AZ, USA

HANS MATTER, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

IAIN M. MCLAY, The Open University, Cardiff, UK

JOSÉ LUIS MEDINA-FRANCO, Torrey Pines Institute for Molecular Studies, Port St. Lucie, FL, USA

NATHALIE MEURICE, Mayo Clinic, Scottsdale, AZ, USA

DANIEL MUTHAS, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

THORSTEN NAUMANN, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

ANTHONY Nicholls, OpenEye Scientific Software, Inc., Santa Fe, NM, USA

TOBIAS NOESKE, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

GEORGE PAPADATOS, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, UK

JOACHIM PETIT, Mayo Clinic, Scottsdale, AZ, USA

STEPHEN D. Pickett, GlaxoSmithKline, Medicines Research Centre, Stevenage, UK

FRIEDEMANN SCHMIDT, R&D, LGCR, Structure, Design and Informatics, Sanofi-Aventis Deutschland GmbH, Frankfurt am Main, Germany

MATTHEW SEGALL, Optibrium Ltd., Cambridge, UK

JONNA STÅLRING, Global Safety Assessment, AstraZeneca R&D Mölndal, Mölndal, Sweden

DAGMAR STUMPFE, Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany

ANDREAS TECKENTRUP, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

W. PATRICK WALTERS, Vertex Pharmaceuticals Incorporated, Cambridge, MA, USA

ALEXANDER WEBER, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

NILS WESKAMP, Department of Lead Identification and Optimization Support, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riss, Germany

PETER WILLETT, Information School, University of Sheffield, Sheffield, UK

CHAPTER 1

WHAT ARE OUR MODELS REALLY TELLING US? A PRACTICAL TUTORIAL ON AVOIDING COMMON MISTAKES WHEN BUILDING PREDICTIVE MODELS

W. PATRICK WALTERS

1.1 INTRODUCTION

Predictive models have become a common part of modern day drug discovery [1]. Models are used to predict a range of key parameters including:

When building these models, it is essential that the cheminformatics practitioner be aware of factors that could potentially mislead and confuse those using the models. In this chapter, we will focus on some common traps and pitfalls and discuss strategies for realistic evaluation of models.

We will consider a few important, and often overlooked, issues in the model-building process.

The chapter will take a tutorial format. We will analyze some commonly used datasets and use this analysis to make a few points about the process of building and evaluating predictive models. One of the most important aspects of scientific investigation is reproducibility. As such, all of the analyses discussed in this chapter were performed using readily available, open source software. This makes it possible for the reader to follow along, carry out the analyses, and experiment with the datasets. All of the code used to perform the analyses is available in the listings section at the end of the chapter. The datasets and scripts used in this chapter can also be downloaded from the author’s website https://github.com/PatWalters/cheminformaticsbook. It is hoped that these scripts will kindle an appreciation for aspects of the model-building process and will provide the basis for further exploration.

The software tools required for the analyses are

 

The Python programming language – http://www.python.org

The RDKit cheminformatics programming library – http://www.rdkit.org

The R statistics program – http://www.r-project.org

Python scripts can be run by executing the command

python script_name.py (Unix and OS-X)
python.exe script_name.py (Windows)

where script_name.py is the name of the script to run.

R scripts can be run by executing the following two commands within the R console.

setwd(“directory_path”)
source(“script.R”)

In the aforementioned commands, “directory_path” is the full path to the directory (folder) containing the scripts and data, and “script.R” is the name of the script to execute. The R scripts used in this chapter utilize a number of libraries that are not included as part of the base R distribution. These libraries can be easily installed by typing the command

source(“install_libraries.R”)

in the R console. Since these libraries are being downloaded from the Internet, it is necessary for your computer to be connected to the Internet when executing the aforementioned command.

Those unfamiliar with Python or R are urged to consult references associated with those languages [9–12]. We now live in a data rich world where every cheminformatics practitioner should possess at least rudimentary programming skills.

1.2 PRELIMINARIES

In order to better understand some of the nuances associated with the construction and evaluation of predictive models, it is useful to consider actual examples. In this chapter, we will examine a number of datasets containing measured values for aqueous solubility and use these datasets to build and evaluate predictive models. Solubility in water or buffer is an important parameter in drug discovery [13]. Poorly soluble compounds tend to have poor pharmacokinetics and can precipitate or cause other problems in assays. As such, the prediction of aqueous solubility has been an area of high interest in the pharmaceutical industry. Over the last 15 years, numerous papers have been published on methods for predicting aqueous solubility [2, 3, 14]. Although many papers have been published and commercial software for predicting aqueous solubility has been released, reliable solubility prediction remains a challenge.

The challenges in developing models for predicting solubility can arise from a number of experimental factors. The aqueous solubility of a compound can vary depending on a number of factors including:

In addition to confounding experimental factors, a number of published solubility models are somewhat misleading due to a lack of proper computational controls. While we sometimes have limited control over the experimental data used to build models, we have complete control over the way models are evaluated and should always employ appropriate means of evaluating our models. In subsequent sections, we will use solubility datasets to examine some of these control strategies.

1.3 DATASETS

In this chapter, we will consider three different, publicly available, solubility datasets.

The Huuskonen Dataset This set of 1274 experimental solubility values (Log S) was one of the first large solubility datasets published [15, 16] and has subsequently been used in a number of other publications [14, 17]. The data in this set was extracted from the AQUASOL [18, 19] database, compiled by the Yalkowsky group at the University of Arizona and the PHYSPROP [20] database, compiled by the Syracuse Research Corporation.

The JCIM Dataset This is a set of 94 experimental solubility values that were published as the training set for a “blind challenge” published in 2008 [21]. All of the solubility values reported in this paper were measured by a single group under a consistent set of conditions. The objective of this challenge was for groups to use a consistently measured set of solubility values to build a model that could subsequently be used to predict the solubility of a set of test compounds. Results of the challenge were reported in a subsequent paper in 2009 [22].

The PubChem Dataset A randomly selected subset of 1000 measured solubility values selected from a set of 58,000 values that were experimentally determined using chemilumenescent nitrogen detection (CLND) by the Sanford-Burnham Medical Research Institute and deposited in the PubChem database (AID 1996) [23] This dataset is composed primarily of screening compounds from the NIH Molecular Libraries initiative and can be considered representative of the types of compounds typically found in early stage drug discovery programs. Values in this dataset were reported with a qualifier “>”, “=”, “<” to indicate whether the values were below, within, or above the limit of detection for the assay. Only values within the limit of detection (designated by “=”) were selected for the subset used in this analysis.

In order to compare predictions with these three datasets, we first need to format the data in a consistent fashion. We begin by formatting all of the data as Log S, the log of the molar solubility of the compounds. Data in the PubChem and JCIM datasets were originally reported in µg/ml, so the data was transformed to Log S using the formula

LogS = log10((solubility in µg/ml)/(1000.0 * MW))

Where log10 is the base 10 logarithm and MW is the molecular weight.

1.3.1 Exploring Datasets

One of the first things to consider in evaluating a new dataset is the range and distribution of values reported. An excellent tool for visualizing data distributions is the boxplot [24, 25]. The “box” at the center of the boxplot shows the range covered by the middle 50% of the data, while the “whiskers” show the maximum and minimum values (discounting the presence of outliers). Outliers in the boxplot are drawn as circles. More information on boxplots can be found on the Wikipedia page [26] and references therein. The anatomy of a boxplot is detailed in Figure 1.1.

Figure 1.2 shows a boxplot of the data distributions for the three solubility datasets mentioned earlier. Numeric summaries of the same data are shown in Table 1.1. Listing 1 provides the R code for loading and annotating the data, as well as generating Figure 1.2 and Table 1.1. The lower whisker and lower hinge in Table 1.1 define the lower extents of the boxplot, while the upper hinge and upper whisker define the upper extents. The interquartile range (IQR) defines the distance between the upper and lower hinges, while the range defines the distance between the upper and lower whiskers.

FIGURE 1.1 The anatomy of a boxplot.

image

FIGURE 1.2 A boxplot comparison of Log S for the three datasets studied in this chapter.

image

In examining the datasets, we can see that the Huuskonen dataset spans more than 9 logs, while the JCIM dataset set spans 5 logs, and the PubChem dataset spans a much smaller 2.4 logs. Note that the IQR for the PubChem dataset is only about one log. Measured solublities in drug discovery programs typically range between 1 and 100 μM (Log S −6 to −4), so the PubChem dataset can be considered more representative of data that is commonly encountered in drug discovery than the other two datasets. As we will see, the range of data covered by a dataset can have a significant impact on the perceived performance of a model.

Table 1.1 Boxplot Statistics for the Three Datasets Studied in This Chapter

images

1.4 BUILDING PREDICTIVE MODELS

In order to build a predictive model, we need three things:

Models can take a variety of forms, but are typically divided into two categories.

1.5 EVALUATING THE PERFORMANCE OF PREDICTIVE MODELS

1.5.1 Pearson’s r

By far, the most common method for evaluating regression models in the cheminformatics literature is Pearson’s product–moment correlation [30], more commonly referred to as Pearson’s r, or its square r2. Pearson’s r can be calculated in a number of ways; one of the most straightforward is shown next.

If we have paired values X and Y (e.g. predicted and corresponding experimental values), then we can calculate Pearson’s r as:

image

where image and image are the means for X and Y, Sx and Sy are the corresponding standard deviations, and n is the number of data points.

Values of r can vary between −1 and 1, with 1 indicating a perfect linear correlation, −1 being a perfect inverse linear correlation, and 0 indicating an absence of correlation. The definition of what constitutes a “good” value for r is somewhat subjective and situation dependent, and could easily fill an entire chapter or even a book. As we will see in subsequent sections, the dynamic range of the data being considered can have a dramatic effect on Pearson’s r. We will also see that when comparing values of Pearson’s r for different models, we must consider the confidence intervals around r.

1.5.2 Kendall’s Tau

One of the drawbacks of Pearson’s r is that it is sensitive to outliers and to the distribution of the underlying data [31]. More recently, many in the cheminformatics community have begun to follow an example set many years ago by statisticians, and reporting nonparametric measures of correlation like Kendall’s tau [32]. Because these nonparametric methods employ the rank orders of values rather than the values themselves, they are less sensitive to data distribution or outliers. If we have a paired set of values X and Y, we can define Kendall’s tau by counting the number of concordant and discordant pairs in the data. Pairs are considered concordant if their rank orders agree

image

Pairs are considered discordant if their rank orders disagree

image

Kendall’s tau is then evaluated by considering all pairs,

image

where n is the number of pairs.

As we will see in subsequent code listings, Kendall’s tau can be easily calculated by using the “Kendall” function in the “Kendall” library in R.

1.5.3 Root-Mean-Square Deviation (RMSD)

In addition to defining the correlation between predicted and experimental values, we need a means of measuring the magnitude of the error in the prediction. The most commonly used error measure in the cheminformatics literature is the root-mean-square deviation (RMSD), which is also known as the RMS error [33]. If we consider paired values X and Y, RMSD can be calculated using the following equation

image

where xi and yi are paired values (predicted and experimental) and n is the number of data points.

1.6 MOLECULAR DESCRIPTORS

Over the last 20 years, a vast array of molecular descriptors and machine-learning methods have been applied to the problem of building predictive models [27, 28]. A complete treatment of molecular descriptors and machine-learning methods is beyond the scope of this chapter. Rather than explore the pros and cons of various descriptors and machine-learning methods, we will focus on model-building approaches that can be applied with any descriptors or machine-learning method. In the interest of reproducibility, we will employ a set of molecular descriptors and a machine-learning method that have been widely used and are readily available. The molecular descriptors we will use are the default set calculated by the RDKit cheminformatics programming library [34, 35]. This descriptor set contains a variety of topological indices and encodings for atom environments. As mentioned earlier, our focus here is on the analysis of the results rather than the specifics of the descriptors. The RDKit user manual contains a complete listing of the descriptors as well as the corresponding literature references. Listing 2 provides the Python code for the descriptor calculations used in this chapter.

1.7 BUILDING AND TESTING A RANDOM FOREST MODEL

A variety of machine-learning approaches have been applied to develop models relating chemical structure to physical properties and biological activity. These methods range from relatively simple approaches like linear regression to more sophisticated methods like support-vector machines. In this example, we will utilize the random forest machine-learning method that was originally published by Brieman [36]. The random forest method works by constructing an ensemble of decision trees. In practice, the method typically employs hundreds of decision trees and the consensus of multiple trees is used to generate a prediction. The random forest method as implemented in the “randomForest” library for the R statistical software can be used to perform either classification or regression [37, 38]. The choice of classification versus regression is made based on the type of the variable being predicted. If a categorical variable (e.g. “soluble” or “insoluble”) is being predicted, the method will perform classification. If a real valued variable is being predicted, the method will perform regression. In this example, we will generate a regression model. Listing 3 provides R code that demonstrates the construction and testing of a regression model using the random forest method. The steps involved in training and testing the model are listed in the following.

1. Integrate the experimental data and molecular descriptors
2. Divide the data into training and test sets
3. Build a model from the training set
4. Use this model to predict the test set

Note that training and test sets are produced by randomly sampling the dataset. In practice, we would perform this sampling multiple times (typically 50–100) to get a better idea of the overall performance of the model. Some authors have proposed other strategies for constructing training and test sets. There are publications and commercial software packages that advocate clustering the data and selecting one training and one test compound from each cluster [39, 40] In the opinion of this chapter’s author, this is a terrible idea. By sampling training and test sets in this fashion, one artificially inflates the performance of the method and creates unrealistic expectations from those using the model.

Table 1.2 Statistics for Models Build Based on the Full Huuskonen Dataset and Subset with LogS Between −6 and −3

Full Huuskonen SetRealistic Subset
Training set size891303
Test set size383130
Pearson r0.950.75
Kendall tau0.810.56
RMS error0.640.52

FIGURE 1.3 A plot of experimental versus predicted Log S for the Huuskonen test set.

image

Now that we have built the random forest model, we can evaluate its performance on training and test sets generated from the Huuskonen dataset. Statistics for the test set performance are shown in the first column of Table 1.2. Figure 1.3 shows a plot of the predicted versus experimental Log S for the Huuskonen test set. The plot shows the performance of a model generated using the code in Listing 3. The model was trained using 70% of the compounds in the Huuskonen dataset and tested on the remaining 30%. At first glance, the performance appears quite good. The r for the test set is 0.95 and the RMS error is 0.62. These results are similar to the performance of literature methods on this dataset.

As an aside to those following along with the code in the listings, please note that your results will not exactly match those listed here. There is a stochastic component to the selection of training and test sets, as well as the construction of the random forest models. While your results probably won’t match exactly what is reported here, they should be similar.

As we mentioned earlier, the Huuskonen dataset spans a wide range of solubility values. The range observed in this dataset is much larger than what is typically seen in drug discovery projects. Let’s take a look at what would happen if we reduced this dataset to only those compounds with Log S between −6 and −3, a more typical range. Listing 4 demonstrates the selection of only those values within this range, and the subsequent construction of a model using a procedure identical to that described earlier for the full Huuskonen dataset. Statistics for model performance are shown in the second column of Table 1.2. In this case, the RMS error of the prediction for the subset is lower at 0.53 than that obtained with the larger set. However, the r for the subset is now 0.76, lower than the value for the full Huuskonen set that spans a larger dynamic range.

As we can see, the dynamic range in a dataset can have a large impact on the apparent correlation between experimental and predicted activity. The literature is replete with examples of what appear to be impressive correlations on datasets that span an unrealistically high range. Authors who generate predictive models for protein–ligand binding affinity often use datasets that span up to 12 orders of magnitude [8]. The reality is that the binding affinity of compounds typically encountered in drug discovery programs may span 5–6 orders of magnitude. When data within this typical range is considered, these apparent correlations decrease dramatically. In practice, the utility of models for predicting–binding affinity is extremely limited.

Unfortunately, many cheminformatics practitioners have become enamored with r values, and linear plots of model performance. This has caused them to, consciously or unconsciously, choose datasets that often provide an unrealistic view of model performance. When building a predictive model, one should consider the dynamic range of the data being used to build the model and how this range compares with the range of the data to be predicted.

1.8 EXPERIMENTAL ERROR AND MODEL PERFORMANCE

S