Table of Contents

Cover

Title Page

Preface to Second Edition

Preface to First Edition

Acknowledgements

About the Companion Website

Chapter 1: Introduction

1.1 Historical Parentage
1.2 Developments since the 1970s
1.3 Software and Calculations
1.4 Further Reading
References

Chapter 2: Experimental Design

2.1 Introduction
2.2 Basic Principles
2.3 Factorial Designs
2.4 Central Composite or Response Surface Designs
2.5 Mixture Designs
2.6 Simplex Optimisation
Problems

Chapter 3: Signal Processing

3.1 Introduction
3.2 Basics
3.3 Linear Filters
3.4 Correlograms and Time Series Analysis
3.5 Fourier Transform Techniques
3.6 Additional Methods
Problems

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

4.1 Introduction
4.2 The Concept and Need for Principal Components Analysis
4.3 Principal Components Analysis: The Method
4.4 Factor Analysis
4.5 Graphical Representation of Scores and Loadings
4.6 Pre-processing
4.7 Comparing Multivariate Patterns
4.8 Unsupervised Pattern Recognition: Cluster Analysis
4.9 Multi-way Pattern Recognition
Problems

Chapter 5: Classification and Supervised Pattern Recognition

5.1 Introduction
5.2 Two-Class Classifiers
5.3 One-Class Classifiers
5.4 Multi-Class Classifiers
5.5 Optimisation and Validation
5.6 Significant Variables
Problems

Chapter 6: Calibration

6.1 Introduction
6.2 Univariate Calibration
6.3 Multiple Linear Regression
6.4 Principal Components Regression
6.5 Partial Least Squares Regression
6.6 Model Validation and Optimisation
Problems

Chapter 7: Evolutionary Multivariate Signals

7.1 Introduction
7.2 Exploratory Data Analysis and Pre-processing
7.3 Determining Composition
7.4 Resolution
Problems

Appendix

A.1 Vectors and Matrices
A.2 Algorithms
A.3 Basic Statistical Concepts
A.4 Excel for Chemometrics
A.5 Matlab for Chemometrics

Answers to the Multiple Choice Questions

Index

End User License Agreement

List of Illustrations

Chapter 2: Experimental Design

Figure 2.1 Yield of a reaction as a function of pH and catalyst concentration.
Figure 2.2 Cross-section through surface in Figure 2.1 at 2 mM catalyst concentration.
Figure 2.3 Cross-section through surface in Figure 2.1 at pH 3.4.
Figure 2.4 Choice of nine molecules based on two properties.
Figure 2.5 Graph of spectroscopic peak height against concentration at five concentrations.
Figure 2.6 Experiment with high instrumental errors.
Figure 2.7 Experiment with low instrumental errors.
Figure 2.8 Degree-of-freedom tree.
Figure 2.9 Graph of peak height against concentration for ANOVA example, data set A.
Figure 2.10 Graph of peak height against concentration for ANOVA example, data set B.
Figure 2.11 Design matrix.
Figure 2.12 Relationship between response, design matrix and coefficients.
Figure 2.13 Graph of estimated response versus pH at the central temperature and concentration of the design in Table 4.6.
Figure 2.14 Seven lines, equally spaced in area, dividing the normal distribution into eight regions, including six central regions and two extreme regions whose summed area equals those of the central regions.
Figure 2.15 Normal probability plot for data in Table 2.13 with significant factors marked.
Figure 2.16 Method of calculating equation for leverage term for the coefficient of $c02-math-011$ , sum the shaded areas.
Figure 2.17 Graph of leverage for designs in Table 2.15, from top to bottom, designs A, B and C.
Figure 2.18 Two-factor design consisting of five experiments.
Figure 2.19 Two experimental arrangements together with the corresponding leverage for a linear model.
Figure 2.20 Graph of levels of one term against another in the design in Table 2.18.
Figure 2.21 Three- and four-level full factorial designs.
Figure 2.22 Representation of a three-factor, two-level design.
Figure 2.23 Fractional factorial design.
Figure 2.24 Poorly designed calibration experiment.
Figure 2.25 Well-designed calibration experiment.
Figure 2.26 Cyclic permuter.
Figure 2.27 Graph of factor levels for design in Table 2.29: top factors 1 versus 2, bottom factors 1 versus 7.
Figure 2.28 Elements of a central composite design: each axis represents a factor.
Figure 2.29 Degrees of freedom for central composite design.
Figure 2.30 Three-component mixture space.
Figure 2.31 Simplex in one, two and three dimensions.
Figure 2.32 Three-component simplex centroid design.
Figure 2.33 Three-component simplex lattice design.
Figure 2.34 Four situations encountered in constrained mixture designs. (a) Lower bounds defined, (b) upper bounds defined, (c) upper and lower bounds defined, fourth factor as filler and (d) upper and lower bounds defined.
Figure 2.35 Mixture design with process variables.
Figure 2.36 Initial experiments (a, b and c) on the edge of a simplex: two factors and the new conditions if experiment results in the worst response.
Figure 2.37 Progress of a fixed sized simplex.
Figure 2.38 Modified simplex; the original simplex is indicated in bold, with the responses ordered from 1 (worse) to 3 (best). The test conditions are indicated.

Chapter 3: Signal Processing

Figure 3.1 Main parameters that characterise a symmetric peak.
Figure 3.2 Main parameters that characterise an asymmetric peak.
Figure 3.3 Gaussian and Lorentzian peak shapes of equal half heights.
Figure 3.4 Asymmetric peak shapes often described by a Gaussian/Lorentzian model. (a) Tailing: left is Gaussian and right is Lorentzian (b) Fronting: left is Lorentzian and right is Gaussian.
Figure 3.5 Three peaks forming a cluster.
Figure 3.6 Influence on the appearance of a peak as digital resolution is reduced corresponding to Table 3.1.
Figure 3.7 Examples of noise. From top to bottom: underlying signal, homoscedastic and heteroscedastic.
Figure 3.8 Selection of points to be used in a three-point moving average filter.
Figure 3.9 Filtering of data. (a) Raw data, (b) moving average filters, (c) quadratic/cubic Savitzky–Golay filters.
Figure 3.10 Comparison of moving average and running median smoothing.
Figure 3.11 A Gaussian together with its first and second derivative.
Figure 3.12 Two closely overlapping peaks together with their first and second derivatives.
Figure 3.13 From top to bottom, a three-point moving average, a Hanning window and a five-point Savitzky–Golay quadratic second-derivative window convolution functions.
Figure 3.14 A time series.
Figure 3.15 Auto-correlogram of the data in Figure 3.14.
Figure 3.16 Two time series (a,b) and their corresponding cross-correlogram (c).
Figure 3.17 Fourier transformation from a time domain to a frequency domain.
Figure 3.18 Typical time series consisting of several components.
Figure 3.19 Transformation of a real time series to real and imaginary pairs.
Figure 3.20 Fourier transform of a spike.
Figure 3.21 Absorption and dispersion line shapes.
Figure 3.22 Illustration of phase errors (time series (a–d) and real transform (e–h)).
Figure 3.23 A sparsely sampled time series sampled at the Nyquist frequency. Blue: underlying time series, red observed time series if sparsely sampled.
Figure 3.24 Fourier transformation of a rapidly decaying time series.
Figure 3.25 Result of multiplying the time series in Figure 3.24 by a positive exponential, the transform of the original time series being represented by a dotted blue line.
Figure 3.26 Result of multiplying a noisy time series by a positive exponential and transforming the new signal.
Figure 3.27 Multiplying the data in Figure 3.25 by a double exponential.
Figure 3.28 Use of a double exponential filter.
Figure 3.29 Fourier self-deconvolution of a peak cluster.
Figure 3.30 Progress of the Kalman filter, showing the fitted and raw data.
Figure 3.31 Change in the three coefficients predicted by the Kalman filter with time.
Figure 3.32 Raw data and wavelet filtered data in Table 3.11.
Figure 3.33 Haar wavelets of levels 1 and 2 corresponding to data in Table 3.11.
Figure 3.34 Frequency distribution for the toss of a die.
Figure 3.35 Another, but less likely, frequency distribution for toss of a die.

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

Figure 4.1 Factor analysis in psychology.
Figure 4.2 Matrix representation of results from a metabolomics experiment.
Figure 4.3 Case study 1: chromatographic peak profiles, involving summing intensities of the data from Table 4.1 over all wavelengths.
Figure 4.4 Case study 2: superimposed NIR spectra corresponding to the data in Table 4.2.
Figure 4.5 Principal components analysis.
Figure 4.6 PCA as a form of variable reduction.
Figure 4.7 Graph of log of PRESS (top) and RSS (bottom) for data set in Table 4.8.
Figure 4.8 Relationship between PCA and factor analysis in coupled chromatography.
Figure 4.9 Plot of scores of PC2 versus PC1 for case study 2.
Figure 4.10 3D plot for the scores of case study 2.
Figure 4.11 1D plot of the scores of PCs 1 and 2 for case study 1.
Figure 4.12 Scores of PC2 (vertical axis) versus PC1 (horizontal axis) for case study 1.
Figure 4.13 Scores of the first two PCs of case study 1 versus sample number.
Figure 4.14 Scores of principal component 2 (vertical axis) versus principal component 1 (horizontal axis) for the standardised data of case study 3.
Figure 4.15 Loadings plot of PC2 (vertical axis) against PC1 (horizontal axis) for case study 1, with wavelengths indicated in nanometre.
Figure 4.16 Pure spectra of compounds in case study 1.
Figure 4.17 Loadings of PC2 versus PC1 for case study 2.
Figure 4.18 Loadings of the first two PCs against wavelength for case study 2.
Figure 4.19 Loadings of principal component 2 versus principal component 1 for the standardised data of case study 3.
Figure 4.20 Scores of the first two PCs of the data in Table 4.11, (a) raw data (b) log scaled data.
Figure 4.21 Scores of the first two PCs of the data in Table 4.12 (a) raw data (b) row scaled data to constant total.
Figure 4.22 PC scores plot of PC2 versus PC1 for raw data of Table 4.12.
Figure 4.23 PC scores plot of PC2 versus PC1 for data after centring of Table 4.12.
Figure 4.24 Scores plot of PC2 versus PC1 for case study 1 after centring.
Figure 4.25 Plot of the scores of the first two PCs of the standardised data in Table 4.15.
Figure 4.26 Biplot of scores of the first two PCs of case study 3.
Figure 4.27 (a) Euclidean and (b) Manhattan distances.
Figure 4.28 Dendrogram for cluster analysis example.
Figure 4.29 Two-way and three-way data.
Figure 4.30 Possible method of arranging environmental sampling data.
Figure 4.31 Tucker3 decomposition.
Figure 4.32 Parallel factor analysis (PARAFAC).
Figure 4.33 Unfolding.

Chapter 5: Classification and Supervised Pattern Recognition

Figure 5.1 Data set in Table 5.1.
Figure 5.2 Two-class classifiers. (a) Linearly separable classes. (b) Linear inseparable classes.
Figure 5.3 A class distance plot. Top illustrates two classes, with their centroids marked by crosses. A sample is indicated, with its distances to the centroids of the blue and red classes. Bottom projects onto a class distance plot, with the specific sample noted.
Figure 5.4 Boundaries between groups A and B in Table 5.1, (a) EDC, (b) LDA and (c) QDA together with equidistant contours from the centroids for each criterion.
Figure 5.5 kNN boundaries for data set in Table 5.1; (a) k = 3 and (b) k = 5.
Figure 5.6 Appearance of kNN boundaries if the distance of a sample to itself is excluded for k = 3 and data in Table 5.1; sample 4 and its three neighbours marked.
Figure 5.7 One-class classifiers: (a) separable class; (b) classes with ambiguous and outlying samples.
Figure 5.8 Typical Gaussian density estimator for a data set characterised by two variables, with contour lines at different levels of certainty indicated.
Figure 5.9 QDA one-class boundaries using 90% confidence (p = 0.1) for data in Table 5.1.
Figure 5.10 Principles of disjoint PC models.
Figure 5.11 Class A disjoint model for PC1 for data set in Table 5.1, centred according to class A.
Figure 5.12 Multi-class classifiers.
Figure 5.13 PLS1 multi-class models.
Figure 5.14 Division of data into training and test sets.
Figure 5.15 Two different seating plans.
Figure 5.16 Dividing the data in Figure 5.15 into training and test sets.
Figure 5.17 Monte Carlo methods: bars represent frequency of results of several permutations for the %CC, whereas the red line represents the unpermuted data.
Figure 5.18 Typical bootstrap sampling.
Figure 5.19 Division of data into test set and bootstrap test set and a typical iterative approach.
Figure 5.20 Distribution of Heads if an unbiased coin is tossed 10 times.
Figure 5.21 PLS-DA scores and loadings of component 1 for the standardised data in Table 5.13.
Figure 5.22 Values of t for the 10 variables in Table 5.13.

Chapter 6: Calibration

Figure 6.1 Different notations for calibration and experimental design as used in this book.
Figure 6.2 Absorbance at 335 nm for the PAH case study plotted against concentration of pyrene.
Figure 6.3 Spectra of pure standards, digitised at 5 nm intervals, pyrene indicated in bold.
Figure 6.4 Difference between errors in (a) classical and (b) inverse calibration.
Figure 6.5 Best-fit straight lines for classical and inverse calibration, data for pyrene at 335 nm, no intercept, forcing the model through the origin.
Figure 6.6 Best-fit straight line using inverse calibration and an intercept term.
Figure 6.7 Predicted (vertical) versus known (horizontal) concentrations using methods of Section 6.2.3.
Figure 6.8 Absorbances of Pyr, Fluor, Benz and Ace between 330 and 345 nm.
Figure 6.9 Predicted versus known concentration of pyrene, using a four-component model and the wavelengths 330, 335, 340 and 345 nm (uncentred).
Figure 6.10 Spectra of the 10 PAHs estimated by MLR, with pyrene indicated in bold.
Figure 6.11 Root mean square errors of estimation of pyrene using uncentred PCR between 1 and 15 PCs.
Figure 6.12 Principles of PLS1.
Figure 6.13 Root mean square errors in x and c blocks, PLS1 centred and pyrene using between 1 and 15 PCs.
Figure 6.14 Residual errors in x and c blocks, PLS1 centred and acenaphthene.
Figure 6.15 Principles of PLS2.
Figure 6.16 Unfolding a data matrix.
Figure 6.17 Representation of tri-linear PLS1.
Figure 6.18 Matricisation in three-way calibration (x block only illustrated).
Figure 6.19 RMSEC auto-predictive errors for acenaphthylene using PLS1.
Figure 6.20 RMSECV for acenaphthylene using PLS1.
Figure 6.21 RMSEP using data in Table 6.1 as a training set and data in Table 6.20 as a test set, PLS1 (centred) and acenaphthylene.
Figure 6.22 RMSEP using data in Table 6.20 as a training set and data in Table 6.1 as a test set, PLS1 (centred) and acenaphthylene.

Chapter 7: Evolutionary Multivariate Signals

Figure 7.1 Sequential multivariate data matrix.
Figure 7.2 Three possible sequential patterns that would be treated identically using standard multivariate techniques.
Figure 7.3 Dividing data into regions before baseline correction.
Figure 7.4 Profile of data from data set A.
Figure 7.5 Scores and loadings plots of raw data from data set A for PC2 versus PC1.
Figure 7.6 Profile of data set B.
Figure 7.7 Scores and loadings plots of data set B from Table 7.2 for PC2 versus PC1.
Figure 7.8 Three-dimensional projections of scores (a) and loadings - top (b) for data set A - bottom.
Figure 7.9 Three-dimensional projections of scores (a) and loadings - top (b) for data set B - bottom.
Figure 7.10 Scores plots of data set A with each PC normalised.
Figure 7.11 Scores plots of data set A, each row summed to a constant total. (a) Entire data set, (b) expansion of region data points 5–19 and (c) performing the scaling and then PCA exclusively over points 5–19.
Figure 7.12 Scores plot of data set B with rows summed to a constant total between data points 5 and 20 and three main directions indicated. (a) Two PCs and (b) three PCs.
Figure 7.13 Scores and loadings after data set B has been standardised.
Figure 7.14 Intensity profile and unscaled scores and loadings from data set C in Table 7.3.
Figure 7.15 Scores and loadings after the data set C in Table 7.3 has been standardised.
Figure 7.16 Scores and loadings of the ranked data in Table 7.4.
Figure 7.17 Optimum size for variable reduction.
Figure 7.18 Different types of problems in chromatography.
Figure 7.19 Ratios of peak intensities for the case studies (a)–(d) assuming ideal peak shapes and peaks detectable over an indefinite region.
Figure 7.20 Regions of chromatogram (a) in Figure 7.18. Region a is where the ratio of the two components is between 50:1 and 1:50 and region b where the overall intensity is more than 1% of the maximum intensity.
Figure 7.21 Ratio of intensity of measurements D to F for data set A. (a) Raw information, (b) logarithmic scale between points 5 and 18 and (c) the minimum of the ratio of intensity D:F and F:D between points 5 and 18.
Figure 7.22 Intensities for wavelengths C and G using data of data set A summing the measurements at each successive point to constant total of 1.
Figure 7.23 Graph of correlation between successive points in the data of data set A.
Figure 7.24 Correlation between point 15 and the data of data set A.
Figure 7.25 Graph corresponding to that of Figure 7.23 for data set B in Table 7.2.
Figure 7.26 Forward and backward EFA plots of the first three eigenvalues from data set A.
Figure 7.27 Three-point FSW graph for data set A.
Figure 7.28 Derivative purity plot for data set A with purest points indicated.
Figure 7.29 Composition of regions in chromatogram deriving from data set A.
Figure 7.30 Profiles of variables C and F in Table 7.3.
Figure 7.31 Reconstructed profiles for data set A using MLR.
Figure 7.32 Profiles obtained as described in Section 7.4.1.3.
Figure 7.33 Profiles of three peaks obtained as in Section 7.4.2.

Appendix

Figure A.1 Changing to numeric cell addresses.
Figure A.2 The range A2:C3.
Figure A.3 The operation =AVERAGE(A1:B5,C8,B9:D11).
Figure A.4 Dragging a cell so that the reference is invariant.
Figure A.5 Naming a range.
Figure A.6 Matrix multiplication in Excel.
Figure A.7 Matrix transpose in Excel.
Figure A.8 Matrix inverse in Excel.
Figure A.9 Pseudo-inverse of a matrix.
Figure A.10 Correlation between two ranges.
Figure A.11 Finding the slope and intercept when fitting a linear model to two ranges.
Figure A.12 Use of IF in Excel.
Figure A.13 Finding the Analysis Toolpak.
Figure A.14 Data Analysis Add-in dialog box.
Figure A.15 Linear regression using the Excel Data Analysis Add-in.
Figure A.16 Generating random numbers in Excel.
Figure A.17 Adding an extra series in Excel.
Figure A.18 Finalised chart from Excel.
Figure A.19 Labelling a graph in Excel.
Figure A.20 Setup screen for the Excel chemometrics add-in.
Figure A.21 Selecting the Multivariate Analysis Add-in
Figure A.22 Multivariate analysis dialog box.
Figure A.23 PCA dialog box.
Figure A.24 PCR dialog box.
Figure A.25 PLS dialog box.
Figure A.26 MLR dialog box.
Figure A.27 Default Matlab window.
Figure A.28 File and array listing in Matlab.
Figure A.29 Running an m file script in Matlab.
Figure A.30 Running an m file function in Matlab.
Figure A.31 Obtaining vectors from matrices.
Figure A.32 Simple matrix operations in Matlab.
Figure A.33 Calculating a pseudo-inverse in Matlab.
Figure A.34 Mean function in Matlab.
Figure A.35 Calculating standard deviations in Matlab: the second calculation is preferred for most chemometric calculations where the aim is to scale a matrix.
Figure A.36 Mean centring a matrix in Matlab.
Figure A.37 Importing from Excel to Matlab.
Figure A.38 A simple loop used for mean centring.
Figure A.39 Blank Figure window.
Figure A.40 Use of hold on.
Figure A.41 Use of multiple plot facility.
Figure A.42 Use of specifiers to change the properties of a graph in Matlab.
Figure A.43 Use of axis square statement to view correct angles between vectors.
Figure A.44 Matlab Property Editor.
Figure A.45 Use of text command in Matlab.
Figure A.46 A 3D scores plot.
Figure A.47 Using the rotation icon to obtain a better view.
Figure A.48 Changing the appearance of the 3D plot.
Figure A.49 Loadings plot with identical orientation to the scores plot, labelled and copied into Word.

List of Tables

Chapter 2: Experimental Design

Table 2.1 Three experimental designs
Table 2.2 Numerical information for data sets A and B
Table 2.3 Calculation of errors for data set A, model including intercept
Table 2.4 Error analysis for data sets A and B
Table 2.5 ANOVA table: two-parameter model, data set B
Table 2.6 Typical experimental design
Table 2.7 Design matrix for the experiment in Table 2.6 using the model discussed in Section 2.2.3.1
Table 2.8 The vectors b and ŷ for data in Table 2.6
Table 2.9 Coding of data
Table 2.10 Coded design matrix together with estimated values of coded coefficients
Table 2.11 Calculation of t-statistic
Table 2.12 F-ratio for experiment with low experimental error
Table 2.13 Normal probability calculation
Table 2.14 Leverage values for a two-factor design and a model of the form
Table 2.15 Leverage for three possible single-variable designs using a two-parameter linear model
Table 2.16 Coding of a simple two factor, two level design and corresponding responses
Table 2.17 Design matrix
Table 2.18 Four-factor, two-level full factorial design
Table 2.19 Correlated factors
Table 2.20 Full factorial designs corresponding to Figure 2.21
Table 2.21 Full factorial design for three factors together with the design matrix
Table 2.22 Fractional factorial design
Table 2.23 Confounding factor 5 with the product of factors 1–4
Table 2.24 Confounding interaction terms in design in Table 2.23
Table 2.25 Quarter factorial design
Table 2.26 A Plackett–Burman design for 11 factors, generator outlined by a box
Table 2.27 Generators for Plackett–Burman design, first row is at − level
Table 2.28 Equivalence of Plackett–Burman and fractional factorial designs for seven factors, the arrows showing how the rows are related
Table 2.30 Parameters for construction of a multi-level calibration design
Table 2.29 Development of a multi-level partial factorial design
Table 2.31 Construction of a central composite design
Table 2.32 Three possible two-factor central composite designs
Table 2.33 Position of the axial points for rotatability and orthogonality for central composite designs with varying number of replicates (one less than the number of central points)
Table 2.34 Three-component simplex centroid mixture design
Table 2.36 A {5,2} simplex centroid design
Table 2.37 Two-component simplex lattice design
Table 2.38 Number of experiments required for various simplex lattice designs, with different numbers of components and interactions
Table 2.39 Constrained mixture design with three lower bounds
Table 2.41 Example of simultaneous constraints in mixture designs
Table 2.42 Constrained mixture design where both upper and lower limits are known in advance

Chapter 3: Signal Processing

Table 3.1 Reducing digital resolution
Table 3.2 Stationary and moving average noise
Table 3.3 Savitzky–Golay coefficients c_i+j for smoothing
Table 3.4 Results of various filters on a data set
Table 3.5 A sequential process: illustration of moving average and median smoothing
Table 3.6 Savitzky–Golay coefficients for derivatives
Table 3.7 Data in Figure 3.14 together with the data lagged by five points in time
Table 3.8 Two time series, for which the cross-correlogram is presented in Figure 3.16
Table 3.9 Equivalence between parameters in the time domain and frequency domain
Table 3.10 Kalman filter calculation
Table 3.11 Numerical example for wavelet transform: left raw data, centre transformed data after level 1 wavelet and right after level 2 wavelet, without scaling
Table 3.12 Maximum entropy calculation for unbiased die, logarithms to the base 10
Table 3.13 Maximum entropy calculation for biased die

Chapter 4: Principal Component Analysis and Unsupervised Pattern Recognition

Table 4.1 Case study 1: a chromatogram recorded at 30 points in time and 28 wavelengths
Table 4.2 Case study 2: NIR spectra of 72 oils in AU recorded at 32 wavelengths, consisting of four groups A: corn oil, B: olive oil, C: safflower oil, D: corn margarine, after baseline correction and suiTable pre-processing
Table 4.3 Case study 3: properties of some elements
Table 4.4 Scores and loadings for case study 1
Table 4.5 Eigenvalues for case study 1 (raw data)
Table 4.6 Eigenvalues for case study 3 (standardised data)
Table 4.7 Size of eigenvalues for case study 1 after column centring
Table 4.8 Cross-validation example
Table 4.9 Calculation of cross-validated error for sample 1
Table 4.10 Calculation of RSS and PRESS
Table 4.11 Example for logarithmic scaling; the first five samples belong to one group and the last five to a separate group
Table 4.12 Example for row scaling
Table 4.13 How the data in Table 4.12 were simulated as discussed in the text
Table 4.14 Example for Mean Centring
Table 4.15 Standardising the data of Table 4.11
Table 4.16 Example for cluster analysis
Table 4.17 Correlation matrix.
Table 4.18 Euclidean distance matrix.
Table 4.19 Manhattan distance matrix.
Table 4.20 Nearest neighbour cluster analysis, using correlation coefficients for similarity measures, and data in Table 4.16

Chapter 5: Classification and Supervised Pattern Recognition

Table 5.1 Case study in Section 5.1.2: the data involve 20 samples in two classes (first 10 = class A, second 10 = class B) recorded using two variables
Table 5.2 Class distances for the data in Table 5.1 using EDC, LDA and QDA together with the predicted class memberships
Table 5.3 PLS-DA components of data in Table 5.1
Table 5.4 PLS-DA predictions of c for one-component and two-component models for the centred data in Table 5.4
Table 5.5 kNN for data in Table 5.1; the five nearest neighbours are listed and the assignments using k = 3 and k = 5
Table 5.6 QDA Mahalanobis distance to classes A and B for data in Table 5.1 together with the classification at a confidence limit of 90% (cut-off 2.146); shaded cells are outside the limits
Table 5.7 Class A model using one PC (centred) for SIMCA and data in Table 5.1
Table 5.8 Q and D one-PC class A models for data in Table 5.1
Table 5.9 Division into training and test set
Table 5.10 EDC model of data in Table 5.1 divided into training and test sets
Table 5.11 A simple contingency table
Table 5.12 A 2 × 2 contingency table
Table 5.13 Data set mentioned in Section 5.6

Chapter 6: Calibration

Table 6.1 Case study consisting of 25 spectra recorded at 27 wavelengths in nanometre, absorbances in AU
Table 6.2 Concentrations of the 10 PAHs in the data in Table 6.1
Table 6.3 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter classical calibration using method of Section 2.2.1
Table 6.4 Concentration of pyrene, absorbance at 335 nm and predictions of absorbance, using single-parameter inverse calibration using method of Section 2.2.2
Table 6.5 Matrices for four components
Table 6.6 Matrix B for Section 6.3.2
Table 6.7 Estimated concentration for four components as described in Section 6.3.2
Table 6.8 Estimated concentrations for the case study using uncentred MLR and all wavelengths
Table 6.9 Estimates for three PAHs using the full data set and MLR but including only three compounds in the model
Table 6.10 Scores of the first 10 PCs for PAH case study
Table 6.11 Vector r for pyrene
Table 6.12 Concentration estimates of the PAHs using PCR and 10 components (uncentred)
Table 6.13 Calculation of concentration estimates for pyrene using two PLS components
Table 6.14 Magnitudes of first 15 PLS1 components (centred data) for pyrene
Table 6.15 Concentration estimates of the PAHs using PLS1 and 10 components (centred)
Table 6.16 Concentration estimates of the PAHs using PLS2 and 10 components (centred)
Table 6.17 Three-way calibration data set
Table 6.18 Four methods of mean centring the data in Table 6.17, illustrated by the variable x_i,1,1 as discussed in Section 6.5.3.1
Table 6.19 Calculation of three tri-linear PLS1 components for the data in Table 6.17 and residuals for sample 1
Table 6.20 Independent test set.

Chapter 7: Evolutionary Multivariate Signals

Table 7.1 Data set A
Table 7.2 Data set B
Table 7.3 Data set C
Table 7.4 Method for ranking variables using data set C
Table 7.5 Correlation coefficients for data set A between successive points (left-hand column) and between point 15 (right-hand column)
Table 7.6 Results of forward and backward EFA for the data set A
Table 7.7 Fixed sized window factor analysis applied to data set A using a three-point window
Table 7.8 Derivative calculation for determining purity of regions in data set A
Table 7.9 Estimated spectra obtained from the composition 1 regions in data set A
Table 7.10 Estimation of profiles using PCA for the data in Table 7.9
Table 7.11 Key steps in the calculation of rotation matrix for data set A using scores in composition 1 regions
Table 7.12 Determining spectrum and elution profiles of an embedded peak

Appendix

Table A.1 Cumulative standardised normal distribution
Table A.2 Critical values of χ²
Table A.3 Critical values of two-tailed t-distribution
Table A.4 One-tailed critical values of the F-distribution at 1% level
Table A.5 One-tailed critical values of the F-distribution at 5% level

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Richard G. Brereton to be identified as the author of this work has been asserted in accordance with law.

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Description: Second edition. | Hoboken, NJ : John Wiley & Sons, 2018. | Originally published in 2003 as: Chemometrics : data analysis for the laboratory and chemical plant. |

Identifiers: LCCN 2017054468 (print) | LCCN 2017059486 (ebook) | ISBN 9781118904688 (epub) | ISBN 9781118904671 (pdf) | ISBN 9781118904664 (pbk.)

Subjects: LCSH: Chemometrics-Data processing. | Chemical processes-Statistical methods-Data processing.

Classification: LCC QD75.4.C45 (ebook) | LCC QD75.4.C45 B74 2018 (print) | DDC 543.01/5195-dc23

Preface to Second Edition

The first edition of this book has been well received, with a special emphasis on numerical illustration of a wide range of chemometric methods. Of particular importance were the problems at the end of each chapter that readers could work through in their own favourite environment, such as Excel or Matlab, but also R or Python or Fortran or any number of languages or computational packages if desired. I have performed calculations in both Matlab and Excel, but readers should not feel restricted if they have an alternative.

The reader of this book is likely to be an applied scientist or statistician who wishes to understand the basis and motivation of many of the main methods used in chemometrics.

Since the first edition, chemometrics has become much more widespread, including outside mainstream chemistry. In the early 2000s, the major applications were quantitative laboratory analytical science and chemical engineering including process control. Over the past few years, application areas have broadened, as large analytical laboratory-generated data sets become more widely available, for example, in metabolomics, heritage science and food science, reflecting a larger emphasis on pattern recognition in the second edition including some practical case studies from metabolomics in the form of worked problem sets.

Despite this, many of the original building blocks of the subject remain unchanged. A factorial design and a principal component is still the same, so parts of the text only involve small changes from the first edition. Nevertheless, feedback both from students and co-workers of mine and also from comments via the Internet have provided valuable guidance as to what changes are desirable for a second edition. Important structural changes such as multiple choice questions throughout the book and colour printing update the original edition as a modern day textbook.

Some major updates are as follows.

• Short multiple choice questions at the end of every section of the main text.
• Colour printing involving redrawing many figures.
• New chapter on supervised pattern recognition (classification) involving enhanced discussions of SIMCA, PLS-DA, LDA, QDA, EDC, kNN as well as validation.
• New case studies on NIR for distinguishing edible oils, and properties of elements, to illustrate unsupervised pattern recognition methods.
• New case studies in metabolomics, including Arabidopsis genotyping by MS, Raman of cancerous lymph nodes and NMR for diagnosing diabetes, as new problem sets.
• Additional description of MCR and ITTFA.
• New and expanded discussions of wavelets and of Bayesian methods in signal analysis.
• Updated description of Matlab R2016a under Windows 10, and Excel 2016 under Windows 10, in the context of the needs of the chemometrician.
• Enhanced discussion of the main statistical distributions.
• Enhanced discussions on validation and optimisation, including description of the bootstrap and of performance indicators.

To supplement this book, all data sets in this book, both from the main text and the problems at the end of each chapter, are downloadable. In addition, there is a downloadable Excel add-in to perform most of the common multivariate methods and a macro for labelling graphs. Matlab routines corresponding to many of the main methods are also available. The answers to the problems at the end of each chapter can also be found. These are available on the Wiley website associated with this book.

It is hoped that this text will be useful for students wishing to obtain a fundamental understanding of many chemometric methods. It will also be useful for any practicing chemometrician who needs to work through methods they may have only recently encountered, using numerical examples: as a researcher, when I encounter an unfamiliar approach, I usually like to reproduce numerical data from published case studies to check how it works before I am confident to use the method. For people encountering chemometrics for the first time, for example, in metabolomics and heritage science, this book presents many of the most widespread methods and so will serve as a good reference. And as a refresher, the multiple choice questions test the basic understanding. The worked case studies can be collected together and are helpful for courses.

Finally, I thank the publishers who have encouraged the development of this rather complex project, especially Jenny Cossham, through many stages and also colleagues who have provided data as listed in the acknowledgements.

Bristol, May 2017

Richard G. Brereton

Preface to First Edition

This book is a product of several years' activities from myself. First and foremost, the task of educating graduate students in my research group from a large variety of backgrounds over the past 10 years has been a significant formative experience, and this has allowed me to develop a large series of problems which we set every 3 weeks and present answers in seminars. From my experience, this is the best way to learn chemometrics! In addition, I have had the privilege to organise international quality courses mainly for industrialists with the participation of many representatives as tutors of the best organisations and institutes around the world, and I have learnt from them. Different approaches are normally taken while teaching industrialists who may be encountering chemometrics for the first time in mid-career and have a limited period of a few days to attend a condensed course, and university students that have several months or even years to practice and improve. However, it is hoped that this book represents a symbiosis of both needs.

In addition, it has been a great inspiration for me to write a regular fortnightly column for Chemweb (available to all registered users on www.chemweb.com) and some of the material in this book is based on articles first available in this format. Chemweb brings a large reader base to chemometrics, and feedback via e-mails or even travels around the world have helped me formulate my ideas. There is a very wide interest in this subject, but it is somewhat fragmented. For example, there is a strong group of Near Infrared Spectroscopists, primarily in the USA, that has led to the application of advanced ideas in process monitoring who see chemometrics as a quite technical industrially oriented subject. There are other groups of mainstream chemists that see chemometrics as applicable to almost all branches of research, ranging from kinetics to titrations to synthesis optimisation. Satisfying all these diverse people is not an easy task.

This book relies mainly on numerical examples: many in the body of the text come from my favourite research interests that are primarily in analytical chromatography and spectroscopy, to expand the text more to produce a huge book of twice the size, so I ask the indulgence of readers if your area of application differs. Certain chapters such as those on calibration could be approached from widely different viewpoints, but the methodological principles are the most important, and if you understand how the ideas can be applied in one area, you will be able to translate to your own favourite application. In the problems at the end of each chapter, I cover a wider range of applications to illustrate the broad basis of these methods. The emphasis of this book is on understanding ideas, which can then be applied to a wide variety of problems in chemistry, chemical engineering and allied disciplines.

It is difficult to select what material to include in this book without making it too long. Every expert I have shown this book to has made suggestions for new material. Some I have taken into account and I am most grateful for every proposal, and others I have mentioned briefly or not at all, mainly for the reason of length and also to ensure that this book sees the light of day rather than constantly expands without an end. There are many outstanding specialist books for the enthusiast. It is my experience, although, that if you understand the main principles (which are quite a few in number), and constantly apply them to a variety of problems, you will soon pick up the more advanced techniques, so it is the building blocks that are most important.

In a book of this nature, it is very difficult to decide on what detail is required for the various algorithms, some readers will have no real interest in the algorithms, whereas others will feel the text is incomplete without comprehensive descriptions. The main algorithms for common chemometric methods are presented in Appendix A.2. Step by step descriptions of methods, rather than algorithms, are presented in the text. A few approaches that will interest some readers such as cross-validation in PLS are described in the problems at the end of appropriate chapters which supplement the text. It is expected that readers will approach this book with different levels of knowledge and expectations, so it is possible to gain a great deal without having an in-depth appreciation of computational algorithms, but for interested readers, the information is nevertheless available. People rarely read texts in a linear fashion, they often dip in and out of parts of it according to their background and aspirations, and chemometrics is a subject which people approach with very different previous knowledge and skills, so it is possible to gain from this book without covering every topic in full. Many readers will simply use add-ins or Matlab commands and be able to produce all the results in this text.

Chemometrics uses a very large variety of software. In this book, we recommend two main environments, Excel and Matlab, the examples have been tried using both environments, and you should be able to get the same answers in both cases. Users of this book will vary from people that simply want to plug the data into existing packages to those that are curious and want to reproduce the methods in their own favourite language such as Matlab, VBA or even C. In some cases, instructors may use the information available with this book to tailor examples for problem classes. Extra software supplements are available via the publishers' website www.SpectroscopyNOW.com, together with all the data sets in this book.

The problems at the end of each chapter form an important part of the text, the examples being a mixture of simulations (which have an important role in chemometrics) and real case studies from a wide variety of sources. For each problem, the relevant sections of the text that provide further information are referenced. However, a few problems build on the existing material and take the reader further: a good chemometrician should be able to use the basic building blocks to understand and use new methods. The problems are of various types; thus, not every reader will to solve all the problems. In addition, instructors can use the data sets to construct workshops or course material that goes further than the book.

I am very grateful for the tremendous support I have had from many people when asking for information and help with data sets and permission where required. I thank Chemweb for agreement to present material modified from articles originally published in their e-zine, The Alchemist, and the RSC for permission to base the text of Chapter 5 on material originally published in the Analyst (125, 2125–2154 (2000)). A full list of acknowledgements for the data sets used in this text is presented after this foreword.

I thank Tom Thurston and Les Erskine for a superb job on the Excel add-in, and Hailin Shen for outstanding help in Matlab. Numerous people have tested the answers to the problems. Special mention should be given to Christian Airiau, Kostas Zissis, Tom Thurston, Conrad Bessant and Cevdet Demir for access to a comprehensive set of answers on disc for a large number of exercises so I can check mine. In addition, several people have read chapters and made detailed comments particularly checking numerical examples; in particular, I thank Hailin Shen for suggestions about improving Chapter 6 and Mohammed Wasim for careful checking of errors. In some ways, the best critics are the students and postdocs working with me because they are the people that have to read and understand a book of this nature, and it gives me great confidence that my co-workers in Bristol have found this approach useful and have been able to learn from the examples.

Finally, I thank the publishers for taking a germ on an idea and making valuable suggestions as to how this could be expanded and improved to produce what I hope is a successful textbook and having faith and patience over a protracted period.

Bristol, February 2002

Richard G. Brereton

Data Driven Extraction for Science