<p>Foreword xxi</p> <p>List of Contributors xxv</p> <p><b>1 Introduction 1</b><i><br />Thomas Engel and Johann Gasteiger</i></p> <p>1.1 The Rationale for the Books 1</p> <p>1.2 The Objectives of Chemoinformatics 2</p> <p>1.3 Learning in Chemoinformatics 4</p> <p>1.4 Outline of the Book 5</p> <p>1.5 The Scope of the Book 7</p> <p>1.6 Teaching Chemoinformatics 8</p> <p>References 8</p> <p><b>2 Principles of Molecular Representations 9<br /></b><i>Thomas Engel</i></p> <p>2.1 Introduction 9</p> <p>2.2 Chemical Nomenclature 11</p> <p>2.2.1 Non-systematic Nomenclature (Trivial Names) 11</p> <p>2.2.2 Systematic Nomenclature of Chemical Compounds 12</p> <p>2.2.3 Drawbacks of Chemical Nomenclature for Data Processing 12</p> <p>2.3 Chemical Notations 12</p> <p>2.3.1 Empirical Formulas of Inorganic and Organic Compounds 12</p> <p>2.3.2 Line Notations 14</p> <p>2.4 Mathematical Notations 14</p> <p>2.4.1 Introduction into Graph Theory 15</p> <p>2.4.2 Matrix Representations 18</p> <p>2.4.2.1 Adjacency Matrix 18</p> <p>2.4.2.2 Incidence Matrix 19</p> <p>2.4.2.3 Distance Matrix 20</p> <p>2.4.2.4 Bond Matrix 21</p> <p>2.4.2.5 Bond–Electron Matrix 21</p> <p>2.4.2.6 Summary on Matrix Representations 23</p> <p>2.4.3 Connection Table 23</p> <p>2.5 Specific Types of Chemical Structures 25</p> <p>2.5.1 General Concepts of Isomerism 25</p> <p>2.5.2 Tautomerism 26</p> <p>2.5.3 Markush Structures 27</p> <p>2.5.4 Beyond a Connection Table Representation 28</p> <p>2.5.4.1 Representation of Molecular Structures by Electron Systems 28</p> <p>2.6 Spatial Representation of Structures 31</p> <p>2.6.1 Representation of Configurational Isomers 32</p> <p>2.6.2 Chirality 33</p> <p>2.6.3 3D Coordinate Systems 36</p> <p>2.7 Molecular Surfaces 37</p> <p>Selected Reading 38</p> <p>References 393</p> <p><b>3 Computer Processing of Chemical Structure Information 43<br /></b><i>Thomas Engel</i></p> <p>3.1 Introduction 43</p> <p>3.2 Standard File Formats for Chemical Structure Information 44</p> <p>3.2.1 SMILES 44</p> <p>3.2.1.1 Stereochemistry in SMILES 47</p> <p>3.2.1.2 Summary on SMILES 47</p> <p>3.2.2 SMARTS 47</p> <p>3.2.3 SYBYL Line Notation 48</p> <p>3.2.4 The International Chemical Identifier (InChI) and InChIKey 48</p> <p>3.2.5 XYZ Format 50</p> <p>3.2.6 Z-Matrix 51</p> <p>3.2.7 The Molfile Format Family 52</p> <p>3.2.7.1 Structure of a Molfile 53</p> <p>3.2.7.2 Stereochemistry in the Molfile 57</p> <p>3.2.7.3 Structure of an SDfile 57</p> <p>3.2.8 The PDB File Format 58</p> <p>3.2.8.1 Introduction/History 58</p> <p>3.2.8.2 General Description 58</p> <p>3.2.8.3 Analysis of a Sample PDB File 60</p> <p>3.2.9 Metadata Formats 65</p> <p>3.2.9.1 STAR-Based File Formats and Dictionaries 65</p> <p>3.2.9.2 CIF File Format 66</p> <p>3.2.9.3 mmCIF File Format 67</p> <p>3.2.9.4 CML 68</p> <p>3.2.9.5 CSRML 68</p> <p>3.2.10 Libraries for Handling Information in Structure File Formats 69</p> <p>3.3 Input and Output of Chemical Structures 70</p> <p>3.3.1 Molecule Editors 72</p> <p>3.3.2 Molecule Viewers 73</p> <p>3.4 Processing Constitutional Information 73</p> <p>3.4.1 Structure Isomers and Isomorphism 73</p> <p>3.4.2 Tautomerism 74</p> <p>3.4.3 Unambiguous and Biunique Representation by Canonicalization 76</p> <p>3.4.3.1 The Morgan Algorithm 77</p> <p>3.4.4 Ring Perception 79</p> <p>3.4.4.1 Introduction 79</p> <p>3.4.4.2 Graph Terminology 80</p> <p>3.4.4.3 Ring Perception Strategies 81</p> <p>3.5 Processing 3D Structure Information 86</p> <p>3.5.1 Detection and Specification of Chirality 86</p> <p>3.5.1.1 Detection of Chirality 87</p> <p>3.5.1.2 Specification of Chirality 87</p> <p>3.5.2 Automatic Generation of 3D Structures 90</p> <p>3.5.3 Automatic Generation of Ensemble of Conformations 94</p> <p>3.6 Visualization of Molecular Models 100</p> <p>3.6.1 Introduction 100</p> <p>3.6.2 Models of the 3D Structure 101</p> <p>3.6.2.1 Wire Frame and Capped Sticks Model 101</p> <p>3.6.2.2 Ball-and-Stick Model 101</p> <p>3.6.2.3 Space-Filling Model 102</p> <p>3.6.2.4 Crystallographic Models 102</p> <p>3.6.3 Models of Biological Macromolecules 102</p> <p>3.6.4 Virtual Reality 103</p> <p>3.6.5 3D Printing 103</p> <p>3.7 Calculation of Molecular Surfaces 103</p> <p>3.7.1 Van der Waals Surface 104</p> <p>3.7.2 Connolly Surface 104</p> <p>3.7.3 Solvent-Accessible Surface 105</p> <p>3.7.4 Enzyme Cavity Surface (Union Surface) 106</p> <p>3.7.5 Isovalue-Based Electron Density Surface 106</p> <p>3.7.6 Experimentally Determined Surfaces 106</p> <p>3.7.7 Visualization of Molecular Surface Properties 107</p> <p>3.7.8 Property-based Isosurfaces 107</p> <p>3.7.8.1 Electrostatic Potentials 108</p> <p>3.7.8.2 Hydrogen Bonding Potential 108</p> <p>3.7.8.3 Polarizability and Hydrophobicity Potential 108</p> <p>3.7.8.4 Spin Density 108</p> <p>3.7.8.5 Vector Fields 108</p> <p>3.7.8.6 Volumetric Properties 108</p> <p>3.8 Chemoinformatic Toolkits and Workflow Environments 109</p> <p>Selected Reading 111</p> <p>References 111</p> <p><b>4 Representation of Chemical Reactions 121<br /></b><i>Oliver Sacher and Johann Gasteiger</i></p> <p>4.1 Introduction 121</p> <p>4.2 Reaction Equation 122</p> <p>4.3 Reaction Types 123</p> <p>4.4 Reaction Center and Reaction Mechanisms 125</p> <p>4.5 Chemical Reactivity 126</p> <p>4.5.1 Physicochemical Effects 126</p> <p>4.5.1.1 Charge Distribution 126</p> <p>4.5.1.2 Inductive Effect 127</p> <p>4.5.1.3 Resonance Effect 127</p> <p>4.5.1.4 Polarizability Effect 128</p> <p>4.5.1.5 Steric Effect 128</p> <p>4.5.1.6 Stereoelectronic Effects 128</p> <p>4.5.2 Simple Methods for Quantifying Chemical Reactivity 128</p> <p>4.5.2.1 Frontier Molecular Orbital Theory 128</p> <p>4.5.2.2 Linear Free Energy Relationships 130</p> <p>4.6 Learning from Reaction Information 132</p> <p>4.7 Building of Reaction Databases 133</p> <p>4.7.1 Contents 133</p> <p>4.7.2 Reaction Data Exchange Formats 134</p> <p>4.7.2.1 RXN/RDF format by MDL/Symyx 134</p> <p>4.7.2.2 Reaction SMILES/SMIRKS by Daylight Chemical Information Systems 134</p> <p>4.7.2.3 Chemical Markup Language 135</p> <p>4.7.2.4 International Chemical Identifier for Reactions (RinChI) 135</p> <p>4.7.3 Input and Output of Reactions 135</p> <p>4.8 Reaction Center Perception 138</p> <p>4.9 Reaction Classification 139</p> <p>4.9.1 Model-Driven Approaches 139</p> <p>4.9.1.1 Ugi’s Scheme and Some Follow-Ups 140</p> <p>4.9.1.2 InfoChem’s Reaction Classification 143</p> <p>4.9.2 Data-Driven Approaches 145</p> <p>4.9.2.1 HORACE 145</p> <p>4.9.2.2 Reaction Landscapes 146</p> <p>4.10 Stereochemistry of Reactions 148</p> <p>4.11 Reaction Networks 149</p> <p>Selected Reading 151</p> <p>References 152</p> <p><b>5 The Data 155</b></p> <p>5.1 Introduction 155</p> <p>5.2 Data Types 156</p> <p>5.2.1 Numerical Data 157</p> <p>5.2.2 Molecular Structures 159</p> <p>5.2.3 Bit Vectors 160</p> <p>5.2.3.1 Hash Codes 160</p> <p>5.2.3.2 Structural Keys 162</p> <p>5.2.3.3 Fingerprints 163</p> <p>5.2.4 Chemical Reactions 164</p> <p>5.2.5 Molecular Spectra 165</p> <p>5.3 Storage and Manipulation of Data 169</p> <p>5.3.1 Experimental Data 169</p> <p>5.3.1.1 Types of Data on Properties 170</p> <p>5.3.1.2 Accuracy of the Data 170</p> <p>5.3.2 Data Storage and Exchange 171</p> <p>5.3.2.1 DAT File 171</p> <p>5.3.2.2 JCAMP-DX 171</p> <p>5.3.2.3 Predictive Model Markup Language (PMML) 172</p> <p>5.3.3 Real-World Data 173</p> <p>5.3.3.1 Data Complexity 173</p> <p>5.3.3.2 Outliers and Redundant Objects 174</p> <p>5.3.4 Data Transformation 175</p> <p>5.3.4.1 Fast Fourier Transformation 175</p> <p>5.3.4.2 Wavelet Transformation 175</p> <p>5.3.5 Preparation of Datasets for Building of Models and Validations of Their Quality 176</p> <p>5.4 Conclusions 177</p> <p>Selected Reading 178</p> <p>References 179</p> <p><b>6 Databases and Data Sources in Chemistry 185</b><i><br />Engelbert Zass and Thomas Engel</i></p> <p>6.1 Introduction 185</p> <p>6.2 Chemical Literature and Databases 186</p> <p>6.2.1 Classification of Chemical Literature 186</p> <p>6.2.2 The Origin of Chemical Databases 187</p> <p>6.2.3 Evolution of Database Systems and User Interfaces 187</p> <p>6.3 Major Chemical Database Systems 188</p> <p>6.3.1 SciFinder 188</p> <p>6.3.2 Reaxys 189</p> <p>6.3.3 SciFinder versus Reaxys 190</p> <p>6.4 Compound Databases 191</p> <p>6.4.1 2D Structures 191</p> <p>6.4.1.1 Searching Organic Compounds 192</p> <p>6.4.1.2 Searching Inorganic and Coordination Compounds 194</p> <p>6.4.2 Sequences of Biopolymers 195</p> <p>6.4.3 3D Structures 198</p> <p>6.4.4 Catalog Databases 200</p> <p>6.5 Databases with Properties of Compounds 200</p> <p>6.5.1 Physical Properties 201</p> <p>6.5.2 Thermodynamic and Thermochemical Data 202</p> <p>6.5.3 Spectra 204</p> <p>6.5.3.1 Spectroscopic Databases 205</p> <p>6.5.3.2 Compound Databases with Spectroscopic Information 205</p> <p>6.5.4 Biological, Environmental, and Safety Information Sources 206</p> <p>6.5.4.1 Biological Information 207</p> <p>6.5.4.2 Pharmaceutical and Medical Information 208</p> <p>6.5.4.3 Toxicity, Environmental, and Safety Information 209</p> <p>6.6 Reaction Databases 210</p> <p>6.6.1 Comprehensive Reaction Databases 210</p> <p>6.6.2 Synthetic Methodology Databases 212</p> <p>6.7 Bibliographic and Citation Databases 212</p> <p>6.7.1 Bibliographic Databases 213</p> <p>6.7.1.1 Special Bibliographic Databases 213</p> <p>6.7.1.2 Patent Bibliographic Databases 214</p> <p>6.7.1.3 Searching Bibliographic Databases 216</p> <p>6.7.1.4 Linking to Full Text 216</p> <p>6.7.2 Citation Databases 217</p> <p>6.7.2.1 General Citation Databases 218</p> <p>6.7.2.2 Patent Citation Databases 219</p> <p>6.8 Full-Text Databases 219</p> <p>6.8.1 Electronic Journals 219</p> <p>6.8.2 Patents 220</p> <p>6.8.3 Lexika and Encyclopedias 221</p> <p>6.9 Architecture of a Structure-Searchable Database 222</p> <p>Selected Reading 224</p> <p>References 224</p> <p><b>7 Searching Chemical Structures 231</b><i><br />Nikolay Kochev, Valentin Monev, and Ivan Bangov</i></p> <p>7.1 Introduction 231</p> <p>7.2 Full Structure Search 232</p> <p>7.3 Substructure Search 235</p> <p>7.3.1 Basic Concepts 235</p> <p>7.3.2 Backtracking Algorithm 236</p> <p>7.3.3 Optimization of the Backtracking Algorithm 238</p> <p>7.3.4 Screening 239</p> <p>7.3.5 Superstructure Searching 241</p> <p>7.3.6 Automorphism Searching 241</p> <p>7.3.7 Maximum Common Substructure Searching 242</p> <p>7.3.8 Specific Line Notations for Substructure Searching 243</p> <p>7.3.9 Chemotypes for Database Searching 244</p> <p>7.4 Similarity Search 245</p> <p>7.4.1 Similarity Basics 245</p> <p>7.4.2 Similarity Measures 247</p> <p>7.4.3 Descriptor Selection and Coding 249</p> <p>7.4.4 Similarity Measures Based on Maximum Common Substructure 250</p> <p>7.5 Three-Dimensional Structure Search Methods 250</p> <p>7.5.1 Pharmacophore Searching 251</p> <p>7.5.2 3D Similarity Searching 252</p> <p>7.6 Sequence Searching in Protein and Nucleic Acid Databases 254</p> <p>7.6.1 Sequence Similarity Definition 255</p> <p>7.6.2 Dynamic Programming Algorithm 256</p> <p>7.6.3 Fast Sequence Searching in Large Databases 258</p> <p>7.7 Summary 259</p> <p>Selected Reading 261</p> <p>References 262</p> <p><b>8 Computational Chemistry 267</b></p> <p><b>8.1 Empirical Approaches to the Calculation of Properties 269</b><i><br />Johann Gasteiger</i></p> <p>8.1.1 Introduction 269</p> <p>8.1.2 Additivity of Atomic Contributions 269</p> <p>8.1.3 Attenuation Models 271</p> <p>8.1.3.1 Calculation of Charge Distribution 271</p> <p>8.1.3.2 Polarizability Effect 275</p> <p>Selected Reading 277</p> <p>References 277</p> <p><b>8.2 Molecular Mechanics 279<br /></b><i>Harald Lanig</i></p> <p>8.2.1 Introduction 279</p> <p>8.2.2 No Force Field Calculation without Atom Types 280</p> <p>8.2.3 The Functional Form of Common Force Fields 281</p> <p>8.2.3.1 Bond Stretching 282</p> <p>8.2.3.2 Angle Bending 283</p> <p>8.2.3.3 Torsional Terms 284</p> <p>8.2.3.4 Out-of-Plane Bending 285</p> <p>8.2.3.5 Electrostatic Interactions 286</p> <p>8.2.3.6 Van der Waals Interactions 287</p> <p>8.2.3.7 Cross Terms 289</p> <p>8.2.3.8 Advanced Interatomic Potentials and Future Development 290</p> <p>8.2.4 Available Force Fields 291</p> <p>8.2.4.1 Force Fields for Small Molecules 292</p> <p>8.2.4.2 Force Fields for Biomolecules 293</p> <p>Selected Readings 296</p> <p>References 296</p> <p><b>8.3 Molecular Dynamics 301<br /></b><i>Harald Lanig</i></p> <p>8.3.1 Introduction 301</p> <p>8.3.2 The Continuous Movement of Molecules 302</p> <p>8.3.3 Methods 302</p> <p>8.3.3.1 Algorithms 303</p> <p>8.3.3.2 Ways for Speeding up the Calculations 304</p> <p>8.3.3.3 Solvent Effects 305</p> <p>8.3.3.4 Periodic Boundary Conditions 308</p> <p>8.3.4 Constant Energy, Temperature, or Pressure? 308</p> <p>8.3.5 Long-Range Forces 310</p> <p>8.3.6 Application of Molecular Dynamics Techniques 311</p> <p>8.3.7 Future Perspectives 315</p> <p>Selected Readings 317</p> <p>References 317</p> <p><b>8.4 Quantum Mechanics 320</b><i><br />Tim Clark</i></p> <p>8.4.1 Hückel Molecular Orbital Theory 320</p> <p>8.4.2 Semiempirical MO Theory 324</p> <p>8.4.3 Ab Initio Molecular Orbital Theory 327</p> <p>8.4.4 Density Functional Theory 332</p> <p>8.4.5 Properties from Quantum Mechanical Calculations 334</p> <p>8.4.5.1 Net Atomic Charges 334</p> <p>8.4.5.2 Dipole and Higher Multipole Moments 335</p> <p>8.4.5.3 Polarizabilities 335</p> <p>8.4.5.4 Orbital Energies 336</p> <p>8.4.5.5 Surface Descriptors 336</p> <p>8.4.5.6 Local Ionization Potential 336</p> <p>8.4.6 Quantum Mechanical Techniques for Very Largen Molecules 337</p> <p>8.4.6.1 Linear Scaling Methods 337</p> <p>8.4.6.2 Hybrid QM/MM Calculations 338</p> <p>8.4.7 The Future of Quantum Mechanical Methods in Chemoinformatics 338</p> <p>Selected Reading 340</p> <p>References 341</p> <p><b>9 Modeling and Prediction of Properties (QSPR/QSAR) 345<br /></b><i>Johann Gasteiger</i></p> <p><b>10 Calculation of Structure Descriptors 349<br /></b><i>Lothar Terfloth and Johann Gasteiger</i></p> <p>10.1 Introduction 349</p> <p>10.1.1 QSPR/QSAR Modeling 349</p> <p>10.1.2 Overview 349</p> <p>10.1.3 Classification of Compounds and Similarity Searching 350</p> <p>10.1.4 Definition of the Terms “Structure Descriptor” and “Molecular Descriptor” 351</p> <p>10.1.5 Classification of Structure Descriptors 351</p> <p>10.1.6 Structure Descriptors with a Fixed Length 351</p> <p>10.2 Structure Descriptors for Classification and Similarity Searching 352</p> <p>10.2.1 2D Structure Descriptors (Topological Descriptors) 352</p> <p>10.2.1.1 Structural Keys 352</p> <p>10.2.1.2 Fingerprints 353</p> <p>10.2.1.3 Distance and Similarity Measures 354</p> <p>10.2.1.4 Chemotypes: Data Mining for Compounds with Structural Features 356</p> <p>10.2.1.5 Multilevel Neighborhoods of Atoms 358</p> <p>10.2.1.6 Descriptors from Shannon Entropy Calculations 359</p> <p>10.2.1.7 Chemically Advanced Template Search (CATS2D) Descriptors 360</p> <p>10.2.1.8 Descriptors from Chemical Bond Information 360</p> <p>10.2.2 3D Descriptors 361</p> <p>10.2.2.1 Geometric Atom Pair Descriptors 361</p> <p>10.2.2.2 CATS3D and CHARGE3D 361</p> <p>10.2.2.3 Pharmacophores 362</p> <p>10.2.3 Field-Based Molecular Similarity 362</p> <p>10.2.3.1 Electron Density 362</p> <p>10.2.3.2 General Field-Based Similarity Indices 363</p> <p>10.3 Structure Descriptors for Quantitative Modeling 363</p> <p>10.3.1 0-D Molecular Descriptors 363</p> <p>10.3.2 1D Molecular Descriptors 363</p> <p>10.3.3 2D Molecular Descriptors (Topological Descriptors) 365</p> <p>10.3.3.1 Single-Valued Descriptors 365</p> <p>10.3.3.2 Topological Descriptors as Vectors 366</p> <p>10.3.4 3D Descriptors 369</p> <p>10.3.4.1 3D Structure Generation 369</p> <p>10.3.4.2 3D Autocorrelation Vector 370</p> <p>10.3.4.3 3D Molecule Representation of Structures Based on Electron Diffraction Code (3D MoRSE Code) 370</p> <p>10.3.4.4 Radial Distribution Function Code 371</p> <p>10.3.4.5 Other 3D Descriptors 375</p> <p>10.3.5 Chirality Descriptors 375</p> <p>10.3.5.1 Chirality Codes 376</p> <p>10.3.5.2 Conformation-Independent Chirality Code (CICC) 376</p> <p>10.3.5.3 Conformation-Dependent Chirality Code (CDCC) 377</p> <p>10.3.5.4 Descriptors of Molecular Shape and Molecular Surfaces 377</p> <p>10.3.5.5 Global Shape Descriptors 378</p> <p>10.3.5.6 Autocorrelation of Molecular Surface Properties 378</p> <p>10.3.5.7 2D Maps of Molecular Surfaces 379</p> <p>10.3.5.8 Charged Partial Surface Area 382</p> <p>10.3.6 Field-Based Methods 383</p> <p>10.3.6.1 Comparative Molecular Field Analysis (CoMFA) 383</p> <p>10.3.6.2 Comparative Molecular Similarity Analysis (CoMSIA) 384</p> <p>10.3.6.3 3D Molecular Interaction Fields 384</p> <p>10.3.7 Descriptors for an Ensemble of Conformations (4D Descriptors) 384</p> <p>10.3.7.1 4D-QSAR 384</p> <p>10.3.8 Quantum Chemical Descriptors 385</p> <p>10.4 Descriptors That Are Not Calculated from the Chemical Structure 385</p> <p>10.5 Summary and Outlook 387</p> <p>Selected Reading 390</p> <p>References 390</p> <p><b>11 Data Analysis and Data Handling (QSPR/QSAR) 397</b></p> <p><b>11.1 Methods for Multivariate Data Analysis 399<br /></b><i>Kurt Varmuza</i></p> <p>11.1.1 Introduction into Multivariate Data Analysis 399</p> <p>11.1.1.1 Aims 399</p> <p>11.1.1.2 Notation and Symbols 400</p> <p>11.1.2 Basics of Statistical Data Evaluation 401</p> <p>11.1.2.1 Data Distribution, Central Value, and Spread 401</p> <p>11.1.2.2 Correlation 404</p> <p>11.1.2.3 Discrimination 405</p> <p>11.1.3 Multivariate Data 406</p> <p>11.1.3.1 Overview 406</p> <p>11.1.3.2 Preprocessing 407</p> <p>11.1.3.3 Distances and Similarities 408</p> <p>11.1.3.4 Linear Latent Variables 410</p> <p>11.1.4 Evaluation of Empirical Models 412</p> <p>11.1.4.1 Overview 412</p> <p>11.1.4.2 Optimum Model Complexity 412</p> <p>11.1.4.3 Performance Criteria for Calibration Models 413</p> <p>11.1.4.4 Performance Criteria for Classification Models 414</p> <p>11.1.4.5 Cross-Validation 415</p> <p>11.1.4.6 Bootstrap 416</p> <p>11.1.5 Exploration: Analyzing the Independent Variables 417</p> <p>11.1.5.1 Overview 417</p> <p>11.1.5.2 Principal Component Analysis (PCA) 417</p> <p>11.1.5.3 Nonlinear Mapping 419</p> <p>11.1.5.4 Cluster Analysis 419</p> <p>11.1.5.5 Example: Exploratory Data Analysis of Mass Spectra from Meteorite Samples 421</p> <p>11.1.6 Calibration: Building a Quantitative Model 423</p> <p>11.1.6.1 Overview 423</p> <p>11.1.6.2 Ordinary Least Squares (OLS) Regression 424</p> <p>11.1.6.3 Principal Component Regression (PCR) 424</p> <p>11.1.6.4 Partial Least Squares (PLS) Regression 425</p> <p>11.1.6.5 Variable Selection 426</p> <p>11.1.6.6 Example: Prediction of Gas Chromatographic Retention Indices for Polycyclic Aromatic Hydrocarbons 427</p> <p>11.1.7 Classification: Discriminating Samples 428</p> <p>11.1.7.1 Overview 428</p> <p>11.1.7.2 Linear Discriminant Analysis (LDA) 430</p> <p>11.1.7.3 Discriminant Partial Least Squares (D-PLS) Analysis 430</p> <p>11.1.7.4 k-Nearest Neighbor (KNN) Classification 430</p> <p>11.1.7.5 Support Vector Machine (SVM) 431</p> <p>11.1.7.6 Classification Trees (CART) 432</p> <p>11.1.7.7 Example: Classification of Meteorite Samples Using Mass Spectral Data 432</p> <p>Acknowledgements 434</p> <p>Selected Reading 435</p> <p>References 435</p> <p><b>11.2 Artificial Neural Networks (ANNs) 438<br /></b><i>Jure Zupan</i></p> <p>11.2.1 How to Learn a New Method? 438</p> <p>11.2.2 Multivariate Representation of Data 439</p> <p>11.2.3 Overview of Artificial Neural Networks (ANNs) 442</p> <p>11.2.4 Error Back-Propagation ANNs 443</p> <p>11.2.5 Kohonen and Counter-Propagation ANN 445</p> <p>11.2.6 Training of the ANN: Adapting the Weights 448</p> <p>11.2.7 Controlling Model Complexity and Optimizing Predictivity 450</p> <p>11.2.8 Few General Remarks about ANNs 450</p> <p>Selected Reading 451</p> <p>References 451</p> <p><b>11.3 Deep and Shallow Neural Networks 453<br /></b><i>David A. Winkler</i></p> <p>11.3.1 Drug Design in the Era of Big Data and Artificial Intelligence (AI) 453</p> <p>11.3.2 Deep Learning 454</p> <p>11.3.3 Controlling Model Complexity and Optimizing Predictivity Using Regularization 455</p> <p>11.3.4 Universal Approximation Theorem 458</p> <p>11.3.5 Do QSAR Models Generated by Neural Networks Meet the Requirements of the Universal Approximation Theorem? 458</p> <p>11.3.6 Comparison of the Performance of Deep and Shallow Regularized Neural Networks on Drug Datasets 459</p> <p>11.3.7 A Few General Remarks about Neural Networks for Drug Discovery 460</p> <p>Selected Reading 462</p> <p>References 462</p> <p><b>12 QSAR/QSPR Revisited 465<br /></b><i>Alexander Golbraikh and Alexander Tropsha</i></p> <p>12.1 Best Practices of QSAR Modeling 466</p> <p>12.1.1 Introduction 466</p> <p>12.1.2 Key Concepts 467</p> <p>12.1.3 Predictive QSAR Modeling Workflow 468</p> <p>12.1.4 Dataset Curation 469</p> <p>12.1.5 Modelability Studies 470</p> <p>12.1.6 Development of QSAR Models: Internal and External Validation 471</p> <p>12.1.7 Prediction Accuracy Criteria for QSAR Models for a Continuous Response Variable 472</p> <p>12.1.8 Prediction Accuracy Criteria for Category QSAR Models 473</p> <p>12.1.9 Time-Split Validation 475</p> <p>12.1.10 Validation by Y-Randomization 475</p> <p>12.1.11 Applicability Domain of QSAR Models 475</p> <p>12.1.11.1 Leverage AD for Regression QSAR Models 476</p> <p>12.1.11.2 Residual Standard Deviation (RSD) as AD 476</p> <p>12.1.11.3 Other widely Used ADs 476</p> <p>12.1.12 Ensemble Modeling 478</p> <p>12.1.13 Model Interpretation: Structural Alerts 478</p> <p>12.1.14 Virtual Screening 479</p> <p>12.1.15 Conclusions 480</p> <p>12.2 The Data Science of QSAR Modeling 480</p> <p>12.2.1 Introduction 480</p> <p>12.2.2 Data Curation: Trust but Verify! 482</p> <p>12.2.3 Models as Decision Support Tools 487</p> <p>12.2.4 Conclusions 487</p> <p>Selected Reading 489</p> <p>References 489</p> <p><b>13 Bioinformatics 497<br /></b><i>Heinrich Sticht</i></p> <p>13.1 Introduction 497</p> <p>13.2 Sequence Databases 499</p> <p>13.2.1 GenBank 499</p> <p>13.2.2 UniProt 501</p> <p>13.3 Searching Sequence Databases 502</p> <p>13.3.1 Tools for Sequence Database Searches 503</p> <p>13.3.2 Scoring Matrices 503</p> <p>13.3.3 Interpretation of the Results of a Database Search 507</p> <p>13.4 Characterization of Protein Families 509</p> <p>13.4.1 Multiple Sequence Alignment 509</p> <p>13.4.2 Sequence Signatures 512</p> <p>13.5 Homology Modeling 515</p> <p>Selected Reading 520</p> <p>References 520</p> <p><b>14 Future Directions 525<br /></b><i>Johann Gasteiger</i></p> <p>14.1 Access to Chemical Information 525</p> <p>14.2 Representation of Chemical Compounds 527</p> <p>14.3 Representation of Chemical Reactions 527</p> <p>14.4 Learning from Chemical Information 528</p> <p>14.5 Training in Chemoinformatics 529</p> <p>Answers Section 531</p> <p>Index 555</p>