reading

Cover
Title Page
About the Authors
Preface
1. Why Did We Write This Book?
2. Structure and Contents
Acknowledgements
List of Abbreviations
Part I: Fundamentals and Basic Elements
1. 1 From Signal Processing to Machine Learning
  1. 1.1 A New Science is Born: Signal Processing
  2. 1.2 From Analog to Digital Signal Processing
  3. 1.3 Digital Signal Processing Meets Machine Learning
  4. 1.4 Recent Machine Learning in Digital Signal Processing
2. 2 Introduction to Digital Signal Processing
  1. 2.1 Outline of the Signal Processing Field
  2. 2.2 From Time–Frequency to Compressed Sensing
  3. 2.3 Multidimensional Signals and Systems
  4. 2.4 Spectral Analysis on Manifolds
  5. 2.5 Tutorials and Application Examples
  6. 2.6 Questions and Problems
3. 3 Signal Processing Models
  1. 3.1 Introduction
  2. 3.2 Vector Spaces, Basis, and Signal Models
  3. 3.3 Digital Signal Processing Models
  4. 3.4 Tutorials and Application Examples
  5. 3.5 Questions and Problems
  6. 3.A MATLAB simpleInterp Toolbox Structure
4. 4 Kernel Functions and Reproducing Kernel Hilbert Spaces
  1. 4.1 Introduction
  2. 4.2 Kernel Functions and Mappings
  3. 4.3 Kernel Properties
  4. 4.4 Constructing Kernel Functions
  5. 4.5 Complex Reproducing Kernel in Hilbert Spaces
  6. 4.6 Support Vector Machine Elements for Regression and Estimation
  7. 4.7 Tutorials and Application Examples
  8. 4.8 Concluding Remarks
  9. 4.9 Questions and Problems
Part II: Function Approximation and Adaptive Filtering
1. 5 A Support Vector Machine Signal Estimation Framework
  1. 5.1 Introduction
  2. 5.2 A Framework for Support Vector Machine Signal Estimation
  3. 5.3 Primal Signal Models for Support Vector Machine Signal Processing
  4. 5.4 Tutorials and Application Examples
  5. 5.5 Questions and Problems
2. 6 Reproducing Kernel Hilbert Space Models for Signal Processing
  1. 6.1 Introduction
  2. 6.2 Reproducing Kernel Hilbert Space Signal Models
  3. 6.3 Tutorials and Application Examples
  4. 6.4 Questions and Problems
3. 7 Dual Signal Models for Signal Processing
  1. 7.1 Introduction
  2. 7.2 Dual Signal Model Elements
  3. 7.3 Dual Signal Model Instantiations
  4. 7.4 Tutorials and Application Examples
  5. 7.5 Questions and Problems
4. 8 Advances in Kernel Regression and Function Approximation
  1. 8.1 Introduction
  2. 8.2 Kernel‐Based Regression Methods
  3. 8.3 Bayesian Nonparametric Kernel Regression Models
  4. 8.4 Tutorials and Application Examples
  5. 8.5 Concluding Remarks
  6. 8.6 Questions and Problems
5. 9 Adaptive Kernel Learning for Signal Processing
  1. 9.1 Introduction
  2. 9.2 Linear Adaptive Filtering
  3. 9.3 Kernel Adaptive Filtering
  4. 9.4 Kernel Least Mean Squares
  5. 9.5 Kernel Recursive Least Squares
  6. 9.6 Explicit Recursivity for Adaptive Kernel Models
  7. 9.7 Online Sparsification with Kernels
  8. 9.8 Probabilistic Approaches to Kernel Adaptive Filtering
  9. 9.9 Further Reading
  10. 9.10 Tutorials and Application Examples
  11. 9.11 Questions and Problems
Part III: Classification, Detection, and Feature Extraction
1. 10 Support Vector Machine and Kernel Classification Algorithms
  1. 10.1 Introduction
  2. 10.2 Support Vector Machine and Kernel Classifiers
  3. 10.3 Advances in Kernel‐Based Classification
  4. 10.4 Large‐Scale Support Vector Machines
  5. 10.5 Tutorials and Application Examples
  6. 10.6 Concluding Remarks
  7. 10.7 Questions and Problems
2. 11 Clustering and Anomaly Detection with Kernels
  1. 11.1 Introduction
  2. 11.2 Kernel Clustering
  3. 11.3 Domain Description Via Support Vectors
  4. 11.4 Kernel Matched Subspace Detectors
  5. 11.5 Kernel Anomaly Change Detection
  6. 11.6 Hypothesis Testing with Kernels
  7. 11.7 Tutorials and Application Examples
  8. 11.8 Concluding Remarks
  9. 11.9 Questions and Problems
3. 12 Kernel Feature Extraction in Signal Processing
  1. 12.1 Introduction
  2. 12.2 Multivariate Analysis in Reproducing Kernel Hilbert Spaces
  3. 12.3 Feature Extraction with Kernel Dependence Estimates
  4. 12.4 Extensions for Large‐Scale and Semi‐supervised Problems
  5. 12.5 Domain Adaptation with Kernels
  6. 12.6 Concluding Remarks
  7. 12.7 Questions and Problems
References
Index
End User License Agreement

List of Tables

Acknowledgements
1. Table 1 List of coauthors in specific chapters.
Chapter 3
1. Table 3.1 Common cost functions and their assumed noise models.
2. Table 3.2 Combination of cost functions and regularizers result in different forms of statistical inference models.
3. Table 3.3 S/E ratios () for Gaussian noise.
4. Table 3.4 S/E ratios (mean ± std) for Gaussian.
Chapter 4
1. Table 4.1 Main kernel functions used in the literature.
2. Table 4.2 Results in the test set in the two datasets.
Chapter 5
1. Table 5.1 Scheme of the DSP‐SVM framework (I): equations of the time‐series models for signal estimation.
2. Table 5.2 Scheme of the DSP‐SVM framework (II): equations for signal models in antenna array processing.
3. Table 5.3 SUI‐3 channel model parameters for multipath fading.
Chapter 6
1. Table 6.1 ME, MSE, MAE, and RMSE ρ of models in the test set.
2. Table 6.2 The nMSE for the kernel γ‐filters in nonlinear feedback system identification (NLSYS), Mackey–Glass time series prediction with Δ = 17 and Δ = 30, and EEG prediction. Bold and italics respectively indicate the best and second best results for each problem. The left side of the table includes the results of Casdagli and Eubank (1992) for comparison.
3. Table 6.3 Qualitative and quantitative prices (in pesetas) of coffee brands considered in the study.
4. Table 6.4 Significance for the inclusion of metric variables.
Chapter 7
1. Table 7.1 S/E ratios (mean plus/minus std) for Gaussian and sinc kernels.
2. Table 7.2 S/E ratios (mean plus/minus std) for Gaussian.
3. Table 7.3 List of algorithms benchmarked in the experiments.
4. Table 7.4 Mean S/E and std (in parentheses) with SNR for a band‐pass signal interpolation, T = 0.5 s, L = 32, and u = T/10.
5. Table 7.5 Mean S/E and std (in parentheses) with the number of samples L, for a band‐pass signal interpolation, and u = T/10.
6. Table 7.6 Parameter values of color‐Doppler transmitral flow model (a_i cm/s, t_i s).
7. Table 7.7 Synthetic image. MSE in CDMMI approximation considering different kernels.
8. Table 7.8 Synthetic image. MSE in CDMMI approximation considering different sample selection criteria.
9. Table 7.9 Real image. MSE in CDMMI approximation considering different sample selection criteria.
10. Table 7.10 Average and standard deviation MSE between the electrical feature provided by the CNS and the interpolated feature with LI, TPS, and ν‐SVR methods using 10 bipolar voltage EAMs and 10 activation time EAMs.
Chapter 8
1. Table 8.1 Predictive distribution for the low‐rank approximation methods described in Section 8.3.1, including the computational complexity for training, predictive mean, and predictive variance. N is the number of samples, M is the number of latent inducing variables, and B = M/N is the number of blocks for methods that use them. We denote Q_a,b ≡ K_a,uK_u,b.
2. Table 8.2 ME, MSE, and correlation coefficient ρ of models in the validation set. Recognition rates (RRs) with admissible error of 0.5 and 0.25 g/dL are also given.
3. Table 8.3 Results for the estimation of the daily solar irradiation of linear and nonlinear regression models. Subscript Method_t indicates that the method includes time as input variable. Best results are highlighted in bold, and the second best in italics.
4. Table 8.4 Bias (ME), accuracy (RMSE, MAE) and fitness (Pearson’s ρ) for different rates of training samples using both raw and empirically transformed observation variables.
5. Table 8.5 Results for the MG time series prediction problem.
Chapter 9
1. Table 9.1 Parameters used for predicting the respiratory motion trace, MSE result for three‐step ahead prediction, and measured training time.
2. Table 9.2 Optimal hyperparameters found for the KIN40K regression problem.
3. Table 9.3 Additional parameters used in the KIN40K regression experiment, final dictionary size, and measured training time.
4. Table 9.4 Parameters used in the Mackey–Glass time‐series prediction. A Gaussian kernel with σ = 1 was used.
5. Table 9.5 Computational complexity at iteration t.
Chapter 10
1. Table 10.1 Code for a three class problem (Dietterich and Bakiri, 1995).
2. Table 10.2 Pruned code for a three‐class problem.
3. Table 10.3 Mean and standard deviation of estimated κ statistic, precision, recall, F‐measure, and rate of support vectors for the 10 realizations. Best results are in bold.
4. Table 10.4 Computational and memory costs for different approximate kernel methods in problems with d dimensions, D features, N samples.
5. Table 10.5 Classification error rate for the BCI dataset for different methods (see details in Flamary et al. (2012)) and filter length τ. Results are given for the linear model (top) and for the nonlinear model (bottom). The three best methods are highlighted in bold.
Chapter 11
1. Table 11.1 A family of ACD algorithms (HACD: hyperbolic ACD).
2. Table 11.2 The coefficient of accuracy κ and the overall accuracy in parentheses (percent) obtained for the synthetic problems.
Chapter 12
1. Table 12.1 Summary of linear and KMVA methods. We state for all methods treated in this section the objective to maximize, constraints for the optimization, and maximum number of features n_p that can be extracted, and how the data are projected onto the extracted features.
2. Table 12.2 Relation between dimensionality reduction methods and HSIC, HSIC := Tr(HK_xHK_y), where all techniques use the orthogonality constraint Y^TY = I.
3. Table 12.3 Main properties of (K)MVA methods. Computational complexity and implementation issues are categorized for the considered dense and sparse methods in terms of the free parameters, number of kernel evaluations (KEs) during training, and storage requirements. Notation: N refers to the size of the labeled dataset, while d and m are the dimensions of the input and output spaces respectively.
4. Table 12.4 Properties of manifold alignment and domain adaptation methods.
5. Table 12.5 Accuracy in the visual object recognition study (C: Caltech, A: Amazon, D: DSLR, W: Webcam). 1‐NN classification testing on all samples from the target domain.

List of Illustrations

Chapter 2
1. Figure 2.1 Icosahedron mesh (a) and its vertex and face matrices (b).
2. Figure 2.2 Angles in the cotangent weights.
3. Figure 2.3 Example of representation of real signals. The first panel shows a Hanning pulse. The second panel shows the discrete‐time FT of the pulse. In the third panel, a binary modulation is shown, consisting of a train of pulses multiplied by + 1 or − 1 to represent bits 1 and 0. The fourth panel shows the real (left) and imaginary (right) parts of the FT of the signal in the third panel. Since this signal is real, its transform is Hermitian.
4. Figure 2.4 Example of modulation and demodulation of the signal of Figure 2.3, panel 3. The first panel shows the modulation of the signal using a sinusoid function. The corresponding FT can be seen in the second panel. Since the signal is real, its FT is Hermitian. In the third panel, the negative components of the frequency‐domain panel show the real and imaginary parts of the FT of the signal are removed, resulting in a complex version of the signal in time. The pulse train is recovered by simply shifting the signal in time toward the origin, which produces a Hermitian FT, and hence a real signal in time (fourth and fifth panels).
5. Figure 2.5 Example of single side‐band modulation and demodulation of the signal of Figure 2.3, third panel. The original signal is in the first panel. Subsequent panels show the FT of the signal with the negative part removed. This results in a complex signal whose real part is identical to the original one (third panel), but which has an imaginary part (shown in fourth panel).
6. Figure 2.6 Example of modulation and demodulation of the signal of Figure 2.3, panel 3. The first panel shows the modulation, which contains a shifted version of one side of the base‐band frequency components of the pulse train (second panel). The demodulation consists of discarding the negative frequency components (third panel) plus a frequency shifting back to the origin (fourth panel). The real part of the signal is identical to the original pulse train.
7. Figure 2.7 Convolution of signal x[n] with system h[n]. The impulse response of the system h[k] is shown in the upper panel. The middle panel shows x[n − k] for different values of n. The lower panel shows the result of the convolution. The circles show the values of the convolution for the values of n shown in the middle panel.
8. Figure 2.8 FT of a pulse with 256 samples length and five samples width.
9. Figure 2.9 Examples for filtering (I). (a) Demo for Fourier series on a cardiac and near‐to‐periodic signal. (b–d) Demo for taking the infinite limit on a periodic signal and its effect on the Fourier coefficients, leading to the FT. (e,f) Simple examples for low‐pass and high‐pass filtering on a pulse signal.
10. Figure 2.10 Examples for filtering (II). (a) Simple example for band‐pass filtering on a pulse signal. (b,c) Examples for filtering intracardiac recorded signals, cardiac signal, and its FT. (d–f) Filtered signal, frequency response of the filter, and resulting spectrum for cutting frequency f_c = 300 Hz. (g–l) Same as previous for f_c = 100 Hz (g–i) and f_c = 75 Hz (j–l).
11. Figure 2.11 Welch spectrum of a signal containing signals of frequencies 0.2π, 0.7π, and 0.8π radians, using windows of 16, 32, and 64 samples.
12. Figure 2.12 MVDR spectrum of a signal containing signals of frequencies 0.2π, 0.7π and 0.8π radians, using windows of 16, 32, and 64 samples.
13. Figure 2.13 MUSIC spectrum of a signal containing signals of frequencies 0.2π, 0.7π and 0.8π radians, using windows of 16, 32, and 64 samples.
14. Figure 2.14 Time representation of the analyzed signal.
15. Figure 2.15 Upper panel: spectrogram of the signal with a rectangular windowing. Lower panel: spectrogram with a triangular window.
16. Figure 2.16 Upper panel: spectrogram of the signal with a Hamming windowing. Lower panel: spectrogram with a Blackman window.
17. Figure 2.17 Upper left panel: sources of the ICA experiment. Upper right panel: observations at the four sensors. Lower left panel: result of the principal component analysis (PCA) procedure. Lower right panel: result of the source separation using ICA.
18. Figure 2.18 Full, low‐only and high‐only frequency reconstructions of a low‐frequency sinusoid signal contaminated with high‐frequency noise using DWT and inverse DWT.
19. Figure 2.19 Illustration of the use of wavelet‐based image decomposition (left) and the reconstruction error when the horizontal component is removed (right).
20. Figure 2.20 Illustration of the use of wavelet decomposition for image fusion.
21. Figure 2.21 Eigenvectors (basis) of the tear‐shaped mesh graph Laplacian projected on the surface of this mesh.
22. Figure 2.22 Reconstruction of the tear‐shaped mesh from the eigendecomposition of the graph Laplacian.
23. Figure 2.23 The green mesh is the original tear‐shaped mesh and the red mesh is the modified one by increasing the fourth spectral component (a) or decreasing the fourth spectral component (b).
24. Figure 2.24 Left atrial mesh and its noisy version.
25. Figure 2.25 Reconstruction process of the noisy left atrial mesh.
26. Figure 2.26 MAE for the mesh reconstruction using the graph Laplacian and the geometric Laplacian.
Chapter 3
1. Figure 3.1 Examples of pdfs for widely used noise models.
2. Figure 3.2 Standard cost functions (losses) in signal processing and in machine learning.
3. Figure 3.3 Geometrical interpretation of the most commonly used regularizers. The isocurves for different values of q = {4, 2, 1, 0.5, 0.1} in the 2D space. Note that for the ℓ₀‐norm, the respective values cover the two axes with the exception of the point (0, 0). For the ℓ₁‐norm the isovalue curve is a rhombus, and for the ℓ₂‐norm (Euclidean) norm it is a circle.
4. Figure 3.4 The γ structure. The γ filter can be regarded as a cascade of IIR filters where loops are kept local and loop gain is constant.
5. Figure 3.5 Array of K antennas and L transmitters.
6. Figure 3.6 Illustration of different noise sources in a common standard image: Gaussian white noise, JPEG and JPEG2000 quantization noise, vertical striping acquisition noise, a simulated infrared imaging system (IRIS) noise as a mixture of uncorrelated Gaussian noise with paired salt‐and‐pepper noise and lines interleaving, and WiFi fast‐fading transmission noise.
7. Figure 3.7 Sinc pulse with a period of 1 ms.
8. Figure 3.8 Upper panel: sequence of binary symbols. Lower panel: train of sinc pulses modulated by the symbol sequence of the upper panel.
9. Figure 3.9 Detail of a train of pulses modulated by five binary symbols. The signals represented with thin lines in the lower panel are each one of the pulses whose linear combination is the modulated signal, represented in the upper panel with a thick line. Each circle corresponds to the amplitude of the signal at sampling instants nT.
10. Figure 3.10 Eye diagram of a pulse train modulated by a binary sequence of symbols.
11. Figure 3.11 Representation of the in‐phase or quadrature‐phase eye diagram of a 64 QAM. Upper panel shows the noise‐free diagram, where it can be seen that all symbols cross a specific level during the sampling instant. Lower panel shows the same signal, but corrupted by a white noise of power W.
12. Figure 3.12 Constellation of a 64 QAM. Upper panel shows the noise‐free constellation, where it can be seen that all symbols can be detected without error. Lower panel shows the constellation of the QAM, corrupted by a white noise of power W. Here, the signal detection is subject to an error rate that increases with the noise power.
13. Figure 3.13 Different kinds of noise in ECG recorded during Holter monitoring, from top to bottom: baseline wander; probably lack of electrode contact; step in the middle; some other unknown source of distortion, affecting the ECG morphology; strong quantification noise; and weak quantification noise. Time in seconds from short segments of long‐term monitoring recordings.
14. Figure 3.14 Muscle noise and its overlapping with the ECG spectrum. (a) Clean ECG, muscle noise, and noisy ECG. (b) Spectrum for each of the aforementioned signals.
15. Figure 3.15 Muscle noise and its overlapping with the ECG spectrum. ECG noise filtering (upper panel) and the residuals after filtering (lower panel), for (a) f_s = 80 Hz and (b) f_s = 50 Hz.
16. Figure 3.16 Comparison between the original and the estimated systems in terms of the impulse response (top), the magnitude response (middle), and the phase response (bottom).
17. Figure 3.17 Comparison between the original and the estimated systems in terms of the impulse response (top), the magnitude response (middle), and the phase response (bottom).
18. Figure 3.18 Amplitude‐modulated sinusoidal processed by a linear system and by a Volterra plant identification system. The first panel shows the system output (dots) and the output simulated through linear identification (continuous line), which presents high differences with respect to the original, due to its inability to identify the nonlinearities. The next panels show the approximation with orders P = 3, 5, and 7.
19. Figure 3.19 Examples for LS solution of the sinusoidal signal model: (a) training set of samples and solution for regularized inversion matrix; (b) estimated amplitudes and phases; (c,d) the same for a more reduced set of frequency hypothesis.
20. Figure 3.20 Examples for LS solution of the HRV estimation with the sinusoidal signal model. (a) Original signal in terms of tachogram, of instantaneous cycle, and without ectopic beats. (b) Original and LS reconstructed. (c) Estimated spectrum with FT, and estimated amplitudes with LS.
21. Figure 3.21 Examples of interpolation in the time domain. The second row shows the coefficients obtained to construct the kernel of each interpolation algorithm. The third and fourth rows display the kernel of each algorithm in the time and frequency domains respectively for Gaussian noise ( = 20 dB, L = 32 samples).
22. Figure 3.22 An example of sparse deconvolution results versus data with SNR = 10 dB for the ℓ₁‐penalized algorithm (a) and the GM algorithm (b), with time in samples.
23. Figure 3.23 Experimental setup for the simulation and resulting array beams (training with 100 and 1000 samples).
Chapter 4
1. Figure 4.1 Different machine learning algorithms are available to tackle particular machine learning and signal‐processing problems.
2. Figure 4.2 Illustration of the kernel feature mapping, ϕ, and the kernel trick by which we can measure similarities in using a reproducing kernel that solely works with input examples from .
3. Figure 4.3 Nonlinear problem solved using two 6th‐order polynomials. Circles represent points generated by the function (solid line) plus i.i.d. Gaussian noise. The dashed line corresponds to a regularized solution, while the dotted line corresponds to an unregularized solution (λ = 0).
4. Figure 4.4 SVR signal model and cost functions. Samples in the input space are mapped onto an RKHS, and a linear regression is performed therein. All samples outside a fixed tube of size ε are penalized, and they are the support vectors (double circles). Penalization is given by (a) the Vapnik ε‐insensitive or (b) the ε‐Huber cost functions.
5. Figure 4.5 Example of kernel matrix: (a) the three sets; (b) generated linear kernel.
6. Figure 4.6 Illustration of the operation of alignment for the selection of hyperparameters.
7. Figure 4.7 Estimated symbols from a burst of 1000 training data. Upper panel: the linear equalizer estimation; lower panel: the complex kernel equalizer estimation.
8. Figure 4.8 Histograms of residuals obtained with the ε‐Huber‐SVR model for (a) MERIS and (b) SeaBAM datasets.
9. Figure 4.9 RMSE in the test SeaBAM dataset as a function of the number of training samples.
Chapter 5
1. Figure 5.1 A schematic overview of the SVM framework for signal estimation.
2. Figure 5.2 Time‐transversal vectors used for PSM in the statement of linear SVM‐DSP algorithms.
3. Figure 5.3 Signal model for SVM‐DSP sparse deconvolution. (a) Block diagram, elements, and signals in the PSM. (b) Convolutional model for sinc interpolation using the DSM problem statement, where the impulse response of the convolutional model is a valid Mercer kernel. (c) Convolutional model for sparse deconvolution, for an arbitrary impulse response, where the Mercer kernel is the autocorrelation of the impulse response signal.
4. Figure 5.4 Performance for the system identification by the LS γ‐filter and the SVM γ‐filter for different orders P. The optimal μ parameter is indicated (∘,□, △).
5. Figure 5.5 Evolution of IMSE for different power of outliers for (a) an AR(3) process and (b) an ARMA(4,4) process.
6. Figure 5.6 BER performance of the SVM (continuous) and regular LS (dashed) beamformers.
7. Figure 5.7 Performance of OFDM‐SVM versus LS. BER and MSE performance as a function of SIR.
8. Figure 5.8 Performance of OFDM‐SVM with free parameters, compared with LS in terms of Δ log(BER), for SIR versus C (up), and for SIR versus ε (down).
Chapter 6
1. Figure 6.1 In SVM estimation problems, a nonlinear relationship between data points in the input space is transformed into a linear relationship between mapped data into the (higher dimensional) RKHS. Different signal model equations can be used in the RKHS, as far as the problem statement is expressed in terms of dot products.
2. Figure 6.2 System that generates the input–output signals to be modeled in the SVM‐DSP nonlinear system identification example.
3. Figure 6.3 Machine complexity (SVs (%)), tap delays (P), memory requirements (μ), and memory depth (M = P/μ) for all kernel methods and problems.
4. Figure 6.4 Example of double path calculation in the telecontrol network model with a previously developed Bayesian network.
5. Figure 6.5 Surrogate model for link failure, and conditional components of the event model used.
6. Figure 6.6 Simulations with surrogate data including null events into the training and test set (a, b) and without including them (c, d), after 10 independent realizations. (a, c) Box plot of MAE_n (observed error in n years). (b, d) Box plot of MAE (asymptotic error, computed using Equation 6.91).
7. Figure 6.7 Histograms of the log(MAE₂) in the validation sets for each SVM algorithm in the real case study.
8. Figure 6.8 Quality score ρ for SVM‐SR and for NW‐SR in the simulation example as a function of the number of input features M.
9. Figure 6.9 Examples of results for deal effect curve analysis, showing model fitting (a), CI for parametric variables and intercept (b), and self‐effect of discounts (c) for brands #1 (upper row) and #2 (middle row). Cross‐item effect are shown for brand #1 on brand #3 model (a) and for brand #5 on brand #4 model. Confidence bands for averaged output are shown in brand #2 (detail).
10. Figure 6.10 BER performance as a function of RBF kernel parameter δ, of the TR (squares) and the SR (circles) in an array of seven (top) and five (bottom) elements and with three interferent signals. Continuous line corresponds to the performance of the linear algorithms.
11. Figure 6.11 BER performance as a function of thermal noise power for linear algorithms, SVM SVM‐TR, SVM‐SR, Kernel‐TR, and Kernel‐SR.
Chapter 7
1. Figure 7.1 CESNI for the SVM interpolation algorithm. The interpolated signal ẑ(t) is built by filtering the continuous‐time sampled version of the Lagrange coefficients β(t) with the SVM kernel K(t).
2. Figure 7.2 Illustration of the spectral adaptation of the kernels to the observations for a band‐pass signal.
3. Figure 7.3 MSE of the AKSM algorithm as a function of free parameters γ and C in the ε‐Huber cost function, with SNR = 4 dB.
4. Figure 7.4 An example of spurious peak detection with SNR = 9 dB for (a) the ℓ₁‐penalized algorithm, (b) the GM algorithm, and (c) the AKSM algorithm.
5. Figure 7.5 Example of real data application: sparse deconvolution of the impulse response in an ultrasound transducer for the analysis of a layered composite material.
6. Figure 7.6 Example of the spectra of the original and reconstructed signals in time (top) and frequency domain (bottom) for the MSSF function. T = 0.5 s, L = 32, SNR = 10 dB, and u = T/10. The middle panels show the reconstruction error.
7. Figure 7.7 Spectrally adapted Mercer kernels. (a) Spectrum for the MSSF, DMGF, and modulated sinc kernel. (b) S/E ratio for SVM‐CorrId (solid lines) and SVM‐ModSinc (dashed lines) algorithms when interpolating MSSF (circles) and DMGF (triangles). T = 0.5 s, L = 32, u = T/10.
8. Figure 7.8 Autocorrelation kernels in time and frequency for the HRV segments of patients D and H.
9. Figure 7.9 Examples of HRV segments of patients D (a) and H (b). Two details of the HRV spectrum for patient H are shown in the lower plots.
10. Figure 7.10 Synthetic (up) and real (down) CDMM images used in this example.
11. Figure 7.11 Color Doppler M‐mode trasmitral flow velocity recording for a healthy volunteer (a) and residuals using different sample selection criteria: (b) random; (c) based on edges; (d) based on amplitudes, (e) based on gradients; and (f) based on second derivative.
12. Figure 7.12 Conceptual representation of function v = f(r) for three APs.
13. Figure 7.13 Estimation of the location of a user walking from the office to the elevator and back obtained with the SVM‐CC algorithm and with the k‐NN algorithm.
14. Figure 7.14 Bipolar (left) and unipolar (right) voltage EAM of a left ventricle generated by the Carto 3® system (Biosense Webster®).
15. Figure 7.15 (a) Example of function f(x,y) = sin(x) + sin(y) + 2 used to evaluate the analyzed interpolation methods (i.e., LI, TPS, and ν‐SVR). The vertices (samples) used to interpolate are marked with blue points. (b) LI of the function in (a) using a Delaunay triangulation and a set of 12 vertices (blue points). First, the triangulation is computed from the vertices, and then the value of new vertex (red asterisk) is obtained by using the triangle which encloses this new vertex.
16. Figure 7.16 TPS interpolation of the function f defined in Figure 7.15a using λ = 0, 1, 10, 100, 1000, and 10 000.
17. Figure 7.17 ν‐SVR interpolation of the function f defined in Figure 7.15a using different values of C, ν, and σ: (a) C = [1, 10, 100, 1000], ν = 0.5, and σ = 2; (b) C = 100, ν = [0.05, 0.3, 0.55, 0.8], and σ = 2; (c) C = 100, ν = 0.5, and σ = [0.1, 0.5, 1, 2, 3, 4].
18. Figure 7.18 Original and interpolated bipolar voltage (mV) EAMs using LI, TPS, and ν‐SVR (from left to right). The black points show the position of the known feature positions.
19. Figure 7.19 Original and interpolated activation time (ms) EAMs using LI, TPS, and ν‐SVR (from left to right). The black points show the position of the known feature positions.
Chapter 8
1. Figure 8.1 In the toy graph (left), edges thickness represents the similarity between samples. Graph methods group the unlabeled samples according to the weighted distance, not just to the shortest path lengths. The two clusters are intuitively correct, even being connected by a thin weak edge. In the toy hypergraph (right), the same vertices and three sets contain different potentially interconnected subgraphs.
2. Figure 8.2 Solutions of the different regression models revised in this chapter. Pearson’s correlation coefficient and sparsity level are shown in parentheses.
3. Figure 8.3 The ε‐insensitive region for the standard SVR (a) and for the PD‐SVR with an exponential memory decay profile (b). Only points outside the shaded regions contribute to the cost function. The square symbol indicates the actual sample to predict using the past samples. In the SVR, all samples outside a fixed tube of size ε are penalized, independently from their time sample. In the PD‐SVR case, only the recent past samples are considered to be relevant to the solution. In the PD‐SVR, only the points next to the actual sample are penalized and, consequently, become support vectors. Since recent samples contain, in principle, more information about the future, only these are penalized in a PD‐SVR. This, in turn, yields solutions with a lower fraction of support vectors.
4. Figure 8.4 BER with MSVR, SVR, and RBF network (RBFN) as a function of the length of the training sequence used during the training phase, generated with SNR = 4 dB and 10 dB (a), and with SNR = 2 dB and 6 dB (b). The nonlinearity phenomenon occurs in the transmitter, then affecting solely the input signals. SVR‐based methods improvement is evident.
5. Figure 8.5 Averaged test RMSE for the approximation of a sinc function with different noise sources and levels.
6. Figure 8.6 RMSE in the test set as a function of the rate of training samples for LAI estimation from CHRIS/PROBA data.
7. Figure 8.7 Example of linear basis regression model using nonlocalized polynomial basis functions (see Listing 8.6). (a) Posterior mean (solid line) and (colored area), three random functions drawn according to the posterior distribution (dashed line), as shown in Listing 8.6. (b) Posterior variance as function of x. We can observe that, with nonlocalized basis, is smaller closer to the observed data and higher far away (it is the expected behavior).
8. Figure 8.8 Example of a Gaussian process. Up: some functions drawn at random from the GP prior. Down: some random functions drawn from the posterior; that is, the prior conditioned on six noise‐free observations indicated in big black dots. The shaded area represents the pointwise mean plus and minus two times the standard deviation for each input value (95% confidence region). Note the larger CIs for regions far from the observations.
9. Figure 8.9 Base kernels (top) and two random draws from a GP with each respective kernel and combination of kernels (bottom).
10. Figure 8.10 Example of ensemble GPR considering R = 5 partial GPs. (a) Ensemble solution and R = 5 partial solutions . (b) Ensemble posterior mean and its corresponding posterior mean obtained by a standard complete GP considering jointly all the data.
11. Figure 8.11 Learned warping with WGPR for different rates of training samples.
12. Figure 8.12 Example of an RVM model. (a) Using localized Gaussian basis functions with three random functions drawn according to the posterior distribution (dashed line); see Listing 8.10. (b) GPR solution given the same data points. (c) Comparison of the two predictive and posterior variances. Note that the variance of the RVM with localized basis functions is higher when closer to the data points, which is an undesirable effect. Big dots show the positions of the data (x_i, i = 1,…,N).
Chapter 9
1. Figure 9.1 A linear adaptive filter for system identification.
2. Figure 9.2 A kernel adaptive filter for nonlinear system identification.
3. Figure 9.3 KLMS predictions on the Mackey–Glass time series. Left: learning curve over 500 training iterations. Right: test samples of the Mackey–Glass time‐series and the predictions provided by KLMS.
4. Figure 9.4 KRLS predictions on the Mackey–Glass time series. Left: learning curve over 500 training iterations, including comparison with KLMS. Right: test samples and KRLS predictions.
5. Figure 9.5 Different forms of updating the kernel matrix during online learning. In KRLS‐type algorithms the update involves calculating the inverse of each kernel matrix, given the inverse of the previous matrix. Left: growing kernel matrix, as constructed in KRLS (omitting sparsification for simplicity). Right: sliding‐window kernel matrix of a fixed size, as constructed in SW‐KRLS.
6. Figure 9.6 MSE learning curves of different kernel adaptive filters on a communications channel that shows an abrupt change at iteration 500.
7. Figure 9.7 Dictionary construction processes for different sparsification approaches. Each horizontal line marks the presence of a center in the dictionary. Top left: the ever‐growing dictionary construction, in which the dictionary contains n elements in iteration n. Top right: online sparsification by slowing down the dictionary growth, as obtained by the coherence and ALD criteria. Bottom left: sliding‐window approach, displayed with 10 elements in the dictionary. Bottom right: fixed‐budget approach, in which the pruning criterion discards one element per iteration, displayed with dictionary size 10.
8. Figure 9.8 Functions drawn from a GP with a squared exponential covariance K(x, x^′) = exp(−∥x − x^′∥²/2σ²). The 95% confidence interval is plotted as the shaded area. Up: draws from the prior function distribution. Down: draws from the posterior function distribution, which is obtained after five data points (blue crosses) are observed. The predictive mean is displayed in black.
9. Figure 9.9 Forgetting operation of KRLS‐T. The original predictive mean and variance are indicated as the black line and shaded gray area, as in Figure 9.8. After one forgetting step, the mean becomes the dashed red curve, and the new 95% CI is indicated in blue.
10. Figure 9.10 A snapshot of the respiratory motion trace.
11. Figure 9.11 The respiratory motion trace and the three‐step ahead prediction of a KAF algorithm.
12. Figure 9.12 Sketch of a five‐link all‐revolute robot arm. The data used in the KIN40K experiment were generated by simulating an eight‐link extension of this arm.
13. Figure 9.13 Learning curves of different algorithms on the KIN40K data.
14. Figure 9.14 MSE versus complexity trade‐off comparisons for prediction of the Mackey–Glass time series. Up: maximum number of floating point operations (FLOPS) per iteration as a function of the steady‐state MSE. Down: maximum number of bytes per iteration as a function of the steady‐state MSE. Each marker represents a single run of one of the algorithms with a single set of parameters. The start of each parameter sweep is indicated by a black dot.
15. Figure 9.15 Predictions in two different EEG segments for the recursive K_c, the standard KRR K₂, and the lagged K_l kernels.
16. Figure 9.16 BER as a function of time (training frames) for the desired user.
Chapter 10
1. Figure 10.1 Example of margin in the kernel space. Sample ϕ(x_i) is wrongly classified, so its associated slack variable ξ_i is higher than 1. Sample ϕ(x_j) is correctly classified, but inside the margin, so its associated slack variable ξ_j is positive and less than 1.
2. Figure 10.2 Number of classifiers as a function of the number of classes for the various multiclass classification strategies. In an L class classification problem, the OAA and the single machine approaches have L − 1 outputs, while the OAO uses (1/2)N(N − 1) outputs. The ECC approach has a maximum number of outputs equal to of 2^L−1 − 1.
3. Figure 10.3 Comparison between two kernel classifiers. (a) SVM: linear decision hyperplanes in a nonlinearly transformed, feature space, where slack variables ξ_i are included to deal with errors. (b) KFD: separates the classes by projecting them onto a hyperplane where the difference of the projected means (μ₊, μ₋) is large, and the variance around means σ₊ and σ₋ is small.
4. Figure 10.4 Illustration of the sorted density of Lagrange multipliers α_i for the best SVM and KFD classifiers. The concept of sparsity in SVMs and the nonsparse solution offered by KFD are evident.
5. Figure 10.5 Left: classifier obtained using labeled data (grey and black circles denote different classes). Right: classifier obtained using labeled data plus unlabeled data distribution (black points denote unlabeled data).
6. Figure 10.6 Results for the urban classification. First: Overall accuracy OA[%]; Second: κ statistic over the validation set as a function of the rate of labeled training samples used to build models. Third: κ statistic surface over the validation set for the best RBF‐LapSVM classifier as a function of the rate of both labeled and unlabeled training samples.
7. Figure 10.7 General idea behind MKLg. Given different sources of registered data, a linear combination of the different similarity matrices (the kernels) is found.
8. Figure 10.8 (a) Accuracy rates for different training set sizes. (b) Weights of the four kernels obtained for the four groups of features.
9. Figure 10.9 Toy example of hierarchical structure: the classes of interest are the three leaves in the tree, which can be semantically grouped in superclasses.
10. Figure 10.10 (a) Large‐margin heuristics for a three‐class toy example. The color intensity represents the distance from the hyperplane: (b) MS heuristic; (c) multiclass level uncertainty (MCLU) heuristic; darker areas are the ones with maximal uncertainty, minimizing Equation 10.100 or 10.102 respectively. (d)–(f) Absolute values of per‐class distances.
11. Figure 10.11 Illustration of the effect of randomly sampling D bases from the Fourier domain on the kernel matrix. With sufficiently large D, the kernel matrix generated by an RKS approximates that of the RBF kernel, at a fraction of the time.
12. Figure 10.12 Cost a function of the number of training and test samples (10⁴, 10⁵, 10⁶) and detailed in “synchronization,” “communication,” and “computation” for a number of processors between 1 and 500 used.
13. Figure 10.13 Speedup for different data sizes using the PSVM.
14. Figure 10.14 The decreasing line corresponds to the training error average (estimation of the empirical risk) over 1000 realizations, as a function of parameter C. The soft grey line corresponds to the test error average (estimation of the actual risk), and the black line is the difference between them (estimation of the structural risk).
15. Figure 10.15 Training (R˙e) and test (R˙t) errors as a function of the number of training data for the example in Listing 10.4. It can be seen that, since the SVM is consistent, both errors converge to the same value.
16. Figure 10.16 Results of the XOR example. The first and second panels show a solution where the SVM has been set to give too much importance to the training samples (C = 100 and 10), where the third panel shows a solution with C = 0.1.
17. Figure 10.17 Result of the Fibonacci example for three different values of the SVM parameter C, from a high one (first) to a low one (second). Increasing the value of C increases the complexity of the solution, making it prone to overfitting, as is observed in the first panel.
18. Figure 10.18 Laplacian SVM with RBF kernels for various values of γ_M = {10⁻⁵, 0.07, 1} and fixed γ_L = 10⁻⁵. Labeled points are shown in circles, while unlabeled points are given in black dots.
Chapter 11
1. Figure 11.1 Illustration of the one‐class kernel classifiers. (a) In the SVDD, the hypersphere containing the target data is described by center a and radius R, and samples in the boundary and outside the ball are unbounded and bounded support vectors, respectively. (b) In the case of the OC‐SVM, all samples from the target class are mapped with maximum distance to the origin.
2. Figure 11.2 Boundary comparison between SVDD (sphere centered at a) and KDE (sphere centered at μ_ϕ).
3. Figure 11.3 The standard (linear) OSP method first performs a linear transformation that looks for projections that identify the subspace spanned by the background, and then the background suppression is carried out by projecting the data onto the subspace orthogonal to the one spanned by the background components. The kernel version of this algorithm consists of the same procedure yet performed in the kernel feature space.
4. Figure 11.4 Distance values computed in input space against computed in the Hilbert space (through reproducing kernels). This figure is generated with Listing 11.3.
5. Figure 11.5 Results obtained by k‐means (left) and kernel k‐means (right) in the concentric ring problem.
6. Figure 11.6 Plot of the decision boundaries obtained with GDD, MoGDD, KnnDD, and SVDD for the two Gaussians problem (top) and the mixed Gaussian–logarithm problem (bottom). Note the semilogarithmic scale in this latter case.
7. Figure 11.7 (a) ROC curves and (b) AUC as a function of the kernel length‐scale parameter.
8. Figure 11.8 Detection results for different algorithms (accuracy, κ statistic).
9. Figure 11.9 Hyperspectral image (left panel) captured with AVIRIS, and four illustrative chips of simulated changes (right panel). The original (leftmost) image is used to simulate an anomalous change image (rightmost) by adding Gaussian noise and randomly scrambling 1% of the pixels.
10. Figure 11.10 ROC curves and AUC obtained for the hyperspectral experiments by linear and kernel detectors for 100 (up) and 500 (down) training examples.
11. Figure 11.11 The two‐sample problem reduces to detecting whether two distributions ℙ_x and ℙ_y are different or not. Up, left: whenever we have two Gaussians with different means one can assess statistical differences by means of a standard t test of different means (the distance between the empirical means is d_xy := ∥μ_x − μ_y∥ = 3.91). Up, right: when we have two Gaussians with the same means but different variance (d_xy = 0.02), the idea may be to look (down, left) at difference in means of transformed random variables (for the Gaussian case, second‐order features of the form x² suffice for discrimination, d_xy = 13.26). Down, right: when distributions come from different pdfs (in this case, Gaussian and Laplace distributions) with the same mean and variance, one has to map the RVs to higher order features for discrimination (d_xy = 0.06 in the original domain and d_xy = 75.80 for a fifth‐order mapping).
Chapter 12
1. Figure 12.1 Toy example of MVA methods in a three‐class problem. Top: features extracted using the training set. Figure shows the variance (var), MSE when the projection is used to approximate y, and the largest covariance (cov) and correlation (corr) achievable using a linear projection of the target data. The first thing we see is that linear methods (top row) perform a rotation on the original data, while the features extracted by the nonlinear methods (second row) are based on more complicated transforms. There is a clear difference between the unsupervised methods (PCA and KPCA) and the supervised methods, which try to pull apart the data coming from different classes. Bottom: classification of test data using a linear classifier trained in the transformed domain, numerical performance is given by the overall accuracy (OA). Since in this case there is no dimensionality reduction, all the linear methods obtain the same accuracy. However, the nonlinear methods obtain very different results. In general, nonlinear methods obtain better results than the linear ones. A special case is KPCA, which obtains very poor results. KPCA is unsupervised and in this case looking for high‐variance dimensions is not useful in order to discern between the classes. KPLS obtains lower accuracy than KOPLS and KCCA, which in turn obtain a similar result since their formulation is almost equivalent (only an extra normalization step is performed by KCCA).
2. Figure 12.2 Average classification accuracy (percent) for linear and KMVA methods as a function of the number of extracted features, along some classification maps for the case of n_f = 10 extracted features.
3. Figure 12.3 Extracted features from the original image by PCA, MNF/SNR, KPCA, standard (implicit) KSNR, and explicit KSNR in the kernel space for the first n_f = 18 components, plotted in RGB composite of three top components ordered in descending importance (images converted to grey scale for the book).
4. Figure 12.4 Dependence estimates for three examples revealing (left) high correlation (and hence high dependence), (middle) high dependence but null correlation, and (right) zero correlation and dependence. The Pearson correlation coefficient R and linear HSIC only capture second‐order statistics (linear correlation), while the rest capture in general higher order dependences. Note that, for MMD, the higher the more divergent (independent), while KGV upper bounds kMI and mutual information.
5. Figure 12.5 Projections obtained by different linear projection methods on the Wine data available at the UCI repository (http://archive.ics.uci.edu/ml/). Black and two different grey intensities represent three different classes.
6. Figure 12.6 Demixing example using KICA. From left to right we show the original source signals, the original (linearly) mixed signals, and the results obtained by KICA and ICA. We indicate the kurtosis in every dimension.
7. Figure 12.7 Density estimation for the ring dataset by KECA and OKECA using different number of extracted features n and estimates of the kernel length‐scale parameter σ (ML or squared distance). Black color represents low pdf values and yellow color high pdf values.
8. Figure 12.8 Example of differences between the local properties of the RBF, Jensen–Shannon kernel, Fisher’s kernel, and the PCK. We indicate in parentheses the Frobenius norm of the residuals with the ideal kernel, .
9. Figure 12.9 Illustration of linear and KEMA on the toy experiments. Left to right: data in the original domains (X1: •; X2: •) and per class (•, • and •), data projected with the linear and the RBF kernels, and error rates as a function of the extracted features when predicting data for the first (left inset) or the second (right inset) domain (KEMA_Lin, KEMA_RBF, SSMA, Baseline).