This edition first published 2017
© 2017 by John Wiley & Sons, Ltd
Registered office:
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought
MATLAB^{®} is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book's use or discussion of MATLAB^{®} software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB^{®} software.
Library of Congress Cataloging-in-Publication Data
Names: Mowlaee, Pejman, 1983- author. | Kulmer, Josef, author. | Stahl, Johannes, 1989- author. | Mayer, Florian, 1986- author.
Title: Single channel phase-aware signal processing in speech communication : theory and practice / [compiled and written by] Pejman Mowlaee, Josef Kulmer, Johannes Stahl, Florian Mayer.
Description: Chichester, UK ; Hoboken, NJ : John Wiley & Sons, Inc., 2016. | Includes bibliographical references and index.
Identifiers: LCCN 2016024931 (print) | LCCN 2016033469 (ebook) | ISBN 9781119238812 (cloth) | ISBN 9781119238829 (pdf) | ISBN 9781119238836 (epub)
Subjects: LCSH: Speech processing systems. | Signal processing. | Oral communication. | Phase modulation.
Classification: LCC TK7882.S65 S575 2016 (print) | LCC TK7882.S65 (ebook) | DDC 006.4/54-dc23
LC record available at https://lccn.loc.gov/2016024931
ISBN: 9781119238812
A catalogue record for this book is available from the British Library.
Cover Image: Gettyimages/lestyan4
Pejman Mowlaee was born in Anzali, Iran. He received his BSc and MSc degrees in telecommunication engineering in Iran in 2005 and 2007. He received his PhD degree at Aalborg University, Denmark in 2010. From January 2011 to September 2012 he was a Marie Curie post-doctoral fellow for digital signal processing in audiology at Ruhr University Bochum, Germany. He is currently an assistant professor at the Speech Communication and Signal Processing (SPSC) Laboratory, Graz University of Technology, Austria.
Dr. Mowlaee has received several awards: young researcher's award for MSc study in 2005 and 2006, best MSc thesis award. His PhD work was supported by the Marie Curie EST-SIGNAL Fellowship during 2009–2010. He is a senior member of IEEE. He was an organizer of a special session and a tutorial session in 2014 and 2015. He was the editor for a special issue of the Elsevier journal Speech Communication, and is a project leader for the Austrian Science Fund.
Josef Kulmer was born in Birkfeld, Austria, in 1985. He received the MSc degree from Graz University of Technology, Austria, in 2014. In 2014 he joined the Signal Processing and Speech Communication Laboratory at Graz University of Technology, where he is currently pursuing his PhD thesis in the field of signal processing.
Johannes Stahl was born in Graz, Austria, in 1989. In 2009, he started studying electrical engineering and audio engineering at Graz University of Technology. In 2015, he received his Dipl.-Ing. (MSc) degree with distinction. In 2015 he joined the Signal Processing and Speech Communication Laboratory at Graz University of Technology, where he is currently pursuing his PhD thesis in the field of speechprocessing.
Florian Mayer was born in Dobl, Austria, in 1986. In 2006, he started studying electrical engineering and audio engineering at Graz University of Technology, and received his Dipl.-Ing. (MSc) in 2015.
Speech communication technology has been intensively studied for more than a century since the invention of the telephone in 1876. Today's main target applications are acoustic human–machine communication, digital telephony, and digital hearing aids. Some detailed applications for speech communication, to name a few, are artificial bandwidth extension, speech enhancement, source separation, echo cancellation, speech synthesis, speaker recognition, automatic speech recognition, and speech coding. The signal processing methods used in the aforementioned applications are mostly focused on the short-time Fourier transform. While the Fourier transform spectrum contains both amplitude and phase parts, the phase spectrum has often been neglected or counted as unimportant. Since the spectral phase is typically wrapped due to its periodic nature, the main difficulty in phase processing is associated with extracting a continuous phase representation. In addition, compared to the spectral amplitude, it is a sophisticated task to model the spectral phase across frames.
This book is, in part, an outgrowth of five years of research conducted by the first author, which started with the publication of the first paper on “Phase Estimation for Signal Reconstruction in Single-Channel Source Separation” back in 2012. It is also a product of the research actively conducted in this area by all the authors at the PhaseLab research group. The fact that there is no text book on phase-aware signal processing for speech communication made it paramount to explain its fundamental principles. The need for such a book was even more pronounced as a follow-up to the success of a series of events organized/co-organized by myself, amongst them: a special session on “Phase Importance in Speech Processing Applications” at the International Conference on Spoken Language Processing (INTERSPEECH) 2014, a tutorial session on “Phase Estimation from Theory to Practice” at the International Conference on Spoken Language Processing (INTERSPEECH) 2015, and an editorial for a special issue on “phase-aware signal processing in speech communication” in Speech Communication (Elsevier, 2016), all receiving considerable attention from researchers from diverse speech processing fields. The intention of this book is to unify the recent individual advances made by researchers toward incorporating phase-aware signal processing methods into speech communication applications.
This book develops the tools and methodologies necessary to deal withphase-based signal processing and its application, in particular in single-channel speech processing. It is intended to provide its readers with solid fundamental tools and a detailed overview of the controversial insights regarding the importance and unimportance of phase in speech communication. Phase wrapping, exposed as the main difficulty for analyzing the spectral phase will be presented in detail, with solutions provided. Several useful representations derived from the phase spectrum will be presented. An in-depth analysis for the estimation of a signals' phase observed in noise together with an overview of existing methods will be given. The positive impact of phase-aware processing is demonstrated for three selected applications: speech enhancement, source separation, and speech quality estimation. Through several proof-of-concept examples and computer simulations, we demonstrate the importance and potential of phase processing in each application. Our hope is to provide a sufficient basis for researchers aiming at starting their research projects in different applications in speech communication with a special focus on phase processing.
The book is divided into two parts and consists of seven chapters and an appendix. Part I (Chapters 1–3) gives an introduction to phase-based signal processing, providing the fundamentals and key concepts. Chapters 1–3 introduce an overview of the history of phase processing and reveal the phase importance/unimportance arguments (Chapter 1), the required definitions and tools for phase-based signal processing, such as phase unwrapping and abundant representations for spectral phase to make the phase spectrum more accessible (Chapter 2), and finally phase estimation fundamentals, limits potential, and its application to speech signals will be presented (Chapter 3).
Part II (Chapters 4–7) deals with three applications to demonstrate the benefit of phase processing in single-channel speech enhancement (Chapter 4), single-channel source separation (Chapter 5), and speech quality estimation (Chapter 6). Chapter 7 concludes the book and provides several future prospects to pursue. The appendix is dedicated to the implementations in MATLAB® collected as the PhaseLab toolbox in order to describe most of the implementations that reproduce the experiments included in the book.
The book is mainly targeted at researchers and graduate students with some background in signal processing theory and applications focused on speech signal processing. Although it is not primarily intended as a text book, the chapters may be used as supplementary material for a special-topics course at second-year graduate level. As an academic instrument, the book could be used tostrengthen the understanding of the often mystical field of phase-aware signal processing and provides several interesting applications where phase knowledge is successfully incorporated. To get the maximal benefit from this book, the reader is expected to have a fundamental knowledge of digital signal processing, signals and systems, and statistical signal processing. For the sake of completeness, a summary of phase-based signal processing is provided in Chapter 2.
The book contains a detailed overview of phase processing and a collection of phase estimation methods. We hope that these provide a set of useful tools that will help new researchers entering the field of phase-aware signal processing and inspire them to solve problems related to phase processing. As the theory and practice are linked in speech communication applications, the book is supplemented by various examples and contains a number of MATLAB® experiments. The reader will find the MATLAB® implementations for the simulations presented in the book with some audio samples online at the following website:[https://www.spsc.tugraz.at/PhaseLab]
These implementations are provided in a toolbox called PhaseLab which is explained in the appendix. The authors believe that each chapter of the book itself serves as a valuable resource and reference for researchers and students. The topics covered within the seven chapters cross-link with each other and contribute to the progress of the field of phase-aware signal processing for speech communication.
The intense collaboration in the year of working on this book project together with the three contributors, Josef Kulmer, Johannes Stahl, and Florian Mayer, was a unique experience and I would like to express my deepest gratitude for all their individual efforts. Apart from the very careful and insightful proofreads, their endless helpful discussions in improving the contents of the chapters and in our regular meetings led to a successful outcome that was only possible within such a great team. In particular, I would like to thank Johannes Stahl and Josef Kulmer for their full contribution in preparing Chapters 3 and 4. I would like to thank Florian Mayer for his valuable contribution in Chapter 5 and his endless efforts in preparing all the figures in the book.
Last, but not least, a number of people contributed in various ways and I would like to thank them: Prof. Gernot Kubin, Prof. Rainer Martin, Prof. Peter Vary,Prof. Bastian Kleijn, Prof. Tim Fingscheidt, and Dr. Christiane Antweiler for their enlightening discussions, for providing several helpful hints, and for sharing their experience with the first author. I would like to thank Dr. Thomas Drugman, Dr. Gilles Degottex, and Dr. Rahim Saeidi for their support regarding the experiments in Chapter 2. Special thanks go to Andreas Gaich for his support in preparing the results in Chapter 6. I am also thankful to several of my former Masters students who graduated at PhaseLab at TU Graz, Carlos Chacón, Anna Maly, and Mario Watanabe, for their valuable insights and outstanding support. I am grateful to Nasrin Ordoubazari, Fereydoun, Kamran, Solmaz, Hana, and Fatemeh Mowlaee, and the Almirdamad family who provided support and encouragement during this book project.
I would also like to thank the editorial team at John Wiley & Sons for their friendly assistance. Finally, I acknowledge the financial support from the Austrian Science Fund (FWF) project number P28070-N33.
P. Mowlaee
Graz, Austria
April 4, 2016
absolute value | |
angle | |
clean speech phase spectrum | |
tuning parameter for modified smoothed group delay | |
mean value of the von Mises distribution | |
perturbed clean speech phase | |
clean speech amplitude spectrum | |
amplitude of harmonic h | |
scale factor in the z-transform X(z) | |
clean speech amplitude spectrum estimate | |
coefficients in the numerator polynomial of X(z) | |
continuous phase function | |
principal value of phase | |
coefficients in the denumerator polynomial of X(z) | |
basis matrix for the qth source in NMF | |
smoothing parameter for decision-directed a priori SNR estimation | |
smoothing parameter for the uncertainty in unvoiced speech | |
compression parameter of the parametric speech spectrum estimators | |
coherent gain of a window function | |
compression function | |
baseband phase difference (BPD) | |
−3 dB bandwidth of the window mainlobe | |
distance metric used in geometry-based phase estimator | |
GDD-based distance metric used in geometry-based phase estimator | |
parabolic cylinder function | |
additive noise signal in time domain | |
additive noise along time with applied window function | |
divergence measure | |
DFT coefficient for noise | |
DTFT of additive noise | |
DTFT of windowed noise frame | |
distance measure as squared error between two spectra | |
mask approximation objective measure | |
signal approximation objective measure | |
change in inconsistency | |
group delay deviation | |
phase deviation between the observation and the noisy signal | |
cyclic mean phase error | |
remixing error in MISI for the ith iteration | |
expected value operator | |
conditional expected value operator | |
relative change of inconsistency | |
sampling frequency in Hz | |
fundamental frequency in Hz | |
fundamental frequency of qth source in mixture | |
phase deviation | |
instantaneous phase from STFT | |
relative phase shift | |
confluent hypergeometric function | |
gain function of a speech spectrum estimation scheme | |
STFT(iSTFT(·)) | |
tuning parameter for modified smoothed group delay | |
key adjustment parameter in CWF | |
magnitude-squared coherence (MSC) | |
Gamma function | |
phase-sensitive filter | |
complex mask filter | |
complex ratio mask filter | |
harmonic index | |
desired harmonic | |
number of harmonics | |
hypothesis of no harmonic structure in the phase | |
hypothesis of harmonic structure in the phase | |
iteration index | |
maximum number of iterations | |
modified Bessel function of the first kind and order ν | |
inconsistency operator | |
discretized IF | |
confidence domain for the qth source in PPR approach | |
ideal binary mask | |
ideal ratio mask | |
instantaneous frequency deviation | |
imaginary unit | |
frequency index | |
von Mises distribution concentration parameter | |
frame index | |
integer-valued function used in time series phase unwrapping | |
number of frames | |
local criterion used in IBM | |
phase spectrum compensation function | |
number of periods per window length | |
integer value as phase wrapping number | |
number of atoms used in NMF | |
number of zeros inside of the unit circle | |
number of zeros outside of the unit circle | |
shape parameter of the parametric speech amplitude distribution | |
circular mean parameter for the hth harmonic | |
circular mean parameter of the von Mises distribution | |
mean of the Gaussian distribution fitted to the qth source fundamental frequency | |
standard deviation of Gaussian distribution fitted to the qth source fundamental frequency | |
sample index | |
instantaneous attack time | |
length of a window function | |
length of a frame | |
number of DFT points | |
normalized mean square error | |
normalized angular frequency | |
fundamental radian frequency | |
instantaneous frequency (IF) | |
closest sinusoid to bin k in STFTPI | |
tuning factor to scale mask in IRM | |
phase change in Nashi's phase unwrapping method | |
phase increment in Nashi's phase unwrapping method | |
voicing probability | |
linear phase along time | |
frequency derivative of phase | |
phase value of harmonic h | |
estimated phase value of harmonic h | |
phase distortion | |
probability density function | |
phase spectrum of the analysis window | |
source index in a mixture | |
number of audio sources in a mixture | |
radial step size | |
Pearson's correlation coefficient | |
constant threshold used in ISSIR | |
phase randomization index | |
absolute value of noisy speech signal STFT | |
relative phase shift | |
set of frames for von Mises parameter estimation | |
frame shift, hop size in samples | |
speech variance | |
speech intelligibility | |
signal-to-signal ratio (SSR) | |
SNR amplitude | |
SNR phase | |
local SNR | |
normalized root-mean-square error | |
circular variance | |
noise variance | |
instantaneous harmonic phase | |
objective function used in CWF | |
unwrapped harmonic phase | |
time--frequency smoothed harmonic phase | |
activity domain used in ISSIR approach | |
time index | |
sampling period | |
Kendall's tau | |
group delay | |
modified smoothed group delay function | |
fixed threshold used in PPR approach | |
smoothed group delay function | |
three-dimensional matrix for phase | |
Unwrapped root mean square error | |
Unwrapped harmonic phase SNR | |
unvoiced speech signal components | |
unvoiced speech signal spectrum | |
anti-symmetry function used in phase spectrum compensation | |
prediction error in adaptive numerical integration | |
vocal tract spectrum | |
activation matrix for the qth source in NMF | |
phase spectrum of the vocal tract (minimum phase) | |
von Mises distribution | |
noisy speech phase spectrum | |
window function along time | |
frequency response for the window w(n) | |
band importance function for the kth frequency band | |
clean speech signal in time domain | |
deterministic speech component in time domain | |
windowed deterministic speech spectrum | |
stochastic--deterministic (SD) speech signal in time domain | |
zero-phase signal | |
signal frame | |
DFT of a clean speech signal | |
qth source DFT spectrum in a mixture | |
sequence along time with applied window function | |
DTFT of a windowed speech frame | |
baseband representation | |
product spectrum | |
real part of the clean speech spectrum | |
imaginary part of the clean speech spectrum | |
a priori SNR | |
noisy speech in time domain | |
noisy speech spectrum | |
a posteriori mean for the stochastic–deterministic approach | |
signal's modified STFT | |
modified signal | |
a posteriori SNR | |
mth zero in the z-plane |