reading

3.3. Synoptic presentation of the elementary modules in speech synthesis systems

First published in France in 2002 by Hermes Science/Lavoisier entitled Traitement automatique du langage parlé 1 et 2 © LAVOISIER, 2002

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

The rights of Joseph Mariani to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

1. Automatic speech recognition. 2. Speech processing systems. I. Mariani, Joseph. II. Title.

Preface

This book, entitled Language and Speech Processing, addresses all the aspects covering the automatic processing of spoken language: how to automate its production and perception, how to synthesize and understand it. It calls for existing know-how in the field of signal processing, pattern recognition, stochastic modeling, computational linguistics, human factors, but also relies on knowledge specific to spoken language.

The automatic processing of spoken language covers activities related to the analysis of speech, including variable rate coding to store or transmit it, to its synthesis, especially from text, to its recognition and understanding, should it be for a transcription, possibly followed by an automatic indexation, or for human-machine dialog or human-human machine-assisted interaction. It also includes speaker and spoken language recognition. These tasks may take place in a noisy environment, which makes the problem even more difficult.

The activities in the field of automatic language and speech processing started after the Second World War with the works on the Vocoder and Voder at Bell Labs by Dudley and colleagues, and were made possible by the availability of electronic devices. Initial research work on basic recognition systems was carried out with very limited computing resources in the 1950s. The computer facilities that became available to researchers in the 1970s made it possible to achieve initial progress within laboratories, and microprocessors then led to the early commercialization of the first voice recognition and speech synthesis systems at an affordable price. The steady progress in the speed of computers and in the storage capacity accompanied the scientific advances in the field.

Research investigations in the 1970s, including those carried out in the large DARPA “Speech Understanding Systems” (SUS) program in the USA, suffered from a lack of availability of speech data and of means and methods for evaluating the performance of different approaches and systems. The establishment by DARPA, as part of its following program launched in 1984, of a national language resources center, the Linguistic Data Consortium (LDC), and of a system assessment center, within the National Institute of Standards and Technology (NIST, formerly NBS), brought this area of research into maturity. The evaluation campaigns in the area of speech recognition, launched in 1987, made it possible to compare the different approaches that had coexisted up to then, based on “Artificial Intelligence” methods or on stochastic modeling methods using large amounts of data for training, with a clear advantage to the latter. This led progressively to a quasi-generalization of stochastic approaches in most laboratories in the world. The progress made by researchers has constantly accompanied the increasing difficulty of the tasks which were handled, starting from the recognition of sentences read aloud, with a limited vocabulary of 1,000 words, either speaker-dependent or speaker-independent, to the dictation of newspaper articles for vocabularies of 5,000, 20,000 and 64,000 words, and then to the transcription of radio or television broadcast news, with unlimited size vocabularies. These evaluations were opened to the international community in 1992. They first focused on the American English language, but early initiatives were also carried out on the French, German or British English languages in a French or European context. Other campaigns were subsequently held on speaker recognition, language identification or speech synthesis in various contexts, allowing for a better understanding of the pros and cons of an approach, and for measuring the status of technology and the progress achieved or still to be achieved. They led to the conclusion that a sufficient level of maturation has been reached for putting the technology on the market, in the field of voice dictation systems for example. However, it also identified the difficulty of other more challenging problems, such as those related to the recognition of conversational speech, justifying the need to keep on supporting fundamental research in this area.

This book consists of two parts: a first part discusses the analysis and synthesis of speech and a second part speech recognition and understanding. The first part starts with a brief introduction of the principles of speech production, followed by a broad overview of the methods for analyzing speech: linear prediction, short-term Fourier transform, time-representations, wavelets, cepstrum, etc. The main methods for speech coding are then developed for the telephone bandwidth, such as the CELP coder, or, for broadband communication, such as “transform coding” and quantization methods. The audio-visual coding of speech is also introduced. The various operations to be carried out in a text-to-speech synthesis system are then presented regarding the linguistic processes (grapheme-to-phoneme transcription, syntactic and prosodic analysis) and the acoustic processes, using rule-based approaches or approaches based on the concatenation of variable length acoustic units. The different types of speech signal modeling – articulatory, formant-based, auto-regressive, harmonic-noise or PSOLA-like – are then described. The evaluation of speech synthesis systems is a topic of specific attention in this chapter. The extension of speech synthesis to talking faces animation is the subject of the next chapter, with a presentation of the application fields, of the interest of a bimodal approach and of models used to synthesize and animate the face. Finally, computational auditory scene analysis opens prospects in the signal processing of speech, especially in noisy environments.

The second part of the book focuses on speech recognition. The principles of speech recognition are first presented. Hidden Markov models are introduced, as well as their use for the acoustic modeling of speech. The Viterbi algorithm is depicted, before introducing language modeling and the way to estimate probabilities. It is followed by a presentation of recognition systems, based on those principles and on the integration of those methodologies, and of lexical and acoustic-phonetic knowledge. The applicative aspects are highlighted, such as efficiency, portability and confidence measures, before describing three types of recognition systems: for text dictation, for audio documents indexing and for oral dialog. Research in language identification aims at recognizing which language is spoken, using acoustic, phonetic, phonotactic or prosodic information. The characteristics of languages are introduced and the way humans or machines can achieve that task is depicted, with a large presentation of the present performances of such systems. Speaker recognition addresses the recognition and verification of the identity of a person based on his voice. After an introduction on what characterizes a voice, the different types and designs of systems are presented, as well as their theoretical background. The way to evaluate the performances of speaker recognition systems and the applications of this technology are a specific topic of interest. The use of speech or speaker recognition systems in noisy environments raises especially difficult problems to solve, but they must be taken into account in any operational use of such systems. Various methods are available, either by pre-processing the signal, during the parameterization phase, by using specific distances or by adaptation methods. The Lombard effect, which causes a change in the production of the voice signal itself due to the noisy environment surrounding the speaker, benefits from a special attention. Along with recognition based solely on the acoustic signal, bi-modal recognition combines two acquisition channels: auditory and visual. The value added by bimodal processing in a noisy environment is emphasized and architectures for the audiovisual merging of audio and visual speech recognition are presented. Finally, applications of automatic language and speech processing systems, generally for human-machine communication and particularly in telecommunications, are described. Many applications of speech coding, recognition or synthesis exist in many fields, and the market is growing rapidly. However, there are still technological and psychological barriers that require more work on modeling human factors and ergonomics, in order to make those systems widely accepted.

The reader, undergraduate or graduate student, engineer or researcher will find in this book many contributions of leading French experts of international renown who share the same enthusiasm for this exciting field: the processing by machines of a capacity which used to be specific to humans: language.

Finally, as editor, I would like to warmly thank Anna and Frédéric Bimbot for the excellent work they achieved in translating the book Traitement automatique du langage parlé, on which this book is based.

Chapter 1

Speech Analysis¹

1.1. Introduction

1.1.1. Source-filter model

Speech, the acoustic manifestation of language, is probably the main means of communication between human beings. The invention of telecommunications and the development of digital information processing have therefore entailed vast amounts of research aimed at understanding the mechanisms of speech communication.

Speech can be approached from different angles. In this chapter, we will consider speech as a signal, a one-dimensional function, which depends on the time variable (as in [BOI 87, OPP 89, PAR 86, RAB 75, RAB 77]). The acoustic speech signal is obtained at a given point in space by a sensor (microphone) and converted into electrical values. These values are denoted s(t) and they represent a real-valued function of real variable t, analogous to the variation of the acoustic pressure. Even if the acoustic form of the speech signal is the most widespread (it is the only signal transmitted over the telephone), other types of analysis also exist, based on alternative physiological signals (for instance, the electroglottographic signal, the palatographic signal, the airflow), or related to other modalities (for example, the image of the face or the gestures of the articulators). The field of speech analysis covers the set of methods aiming at the extraction of information on and from this signal, in various applications, such as:

– speech coding: the compression of information carried by the acoustic signal, in order to save data storage or to reduce transmission rate;

– speech recognition and understanding, speaker and spoken language recognition;

– speech signal processing, which covers many applications, such as auditory aid, denoising, speech encrypting, echo cancellation, post-processing for audiovisual applications;

– phonetic and linguistic analysis, speech therapy, voice monitoring in professional situations (for instance, singers, speakers, teachers, managers, etc.).

Two ways of approaching signal analysis can be distinguished: the model-based approach and the representation-based approach. When a voice signal model (or a voice production model or a voice perception model) is assumed, the goal of the analysis step is to identify the parameters of that model. Thus, many analysis methods, referred to as parametric methods, are based on the source-filter model of speech production; for example, the linear prediction method. On the other hand, when no particular hypothesis is made on the signal, mathematical representations equivalent to its time representation can be defined, so that new information can be drawn from the coefficients of the representation. An example of a non-parametric method is the short-term Fourier transform (STFT). Finally, there are some hybrid methods (sometimes referred to as semi-parametric). These consist of estimating some parameters from non-parametric representations. The sinusoidal and cepstral representations are examples of semi-parametric representation.

This chapter is centered on the linear acoustic source-filter speech production model. It presents the most common speech signal analysis techniques, together with a few illustrations. The reader is assumed to be familiar with the fundamentals of digital signal processing, such as discrete-time signals, Fourier transform, Laplace transform, Z-transforms and digital filters.

1.1.2. Speech sounds

The human speech apparatus can be broken down into three functional parts [HAR 76]: 1) the lungs and trachea, 2) the larynx and 3) the vocal tract. The abdomen and thorax muscles are the engine of the breathing process. Compressed by the muscular system, the lungs act as bellows and supply some air under pressure which travels through the trachea (subglottic pressure). The airflow thus expired is then modulated by the movements of the larynx and those of the vocal tract.

The larynx is composed of the set of muscles, articulated cartilage, ligaments and mucous membranes located between the trachea on one side, and the pharyngeal cavity on the other side. The cartilage, ligaments and muscles in the larynx can set the vocal cords in motion, the opening of which is called the glottis. When the vocal cords lie apart from each other, the air can circulate freely through the glottis and no sound is produced. When both membranes are close to each other, they can join and modulate the subglottic airflow and pressure, thus generating isolated pulses or vibrations. The fundamental frequency of these vibrations governs the pitch of the voice signal (F₀).

The vocal tract can be subdivided into three cavities: the pharynx (from the larynx to the velum and the back of the tongue), the oral tract (from the pharynx to the lips) and the nasal cavity. When it is open, the velum is able to divert some air from the pharynx to the nasal cavity. The geometrical configuration of the vocal tract depends on the organs responsible for the articulation: jaws, lips, tongue.

Each language uses a certain subset of sounds, among those that the speech apparatus can produce [MAL 74]. The smallest distinctive sound units used in a given language are called phonemes. The phoneme is the smallest spoken unit which, when substituted with another one, changes the linguistic content of an utterance. For instance, changing the initial /p/ sound of “pig” (/pIg/) into /b / yields a different word: “big” (/bIg/). Therefore, the phonemes /p/ and /b/ can be distinguished from each other.

A set of phonemes, which can be used for the description of various languages [WEL 97], is given in Table 1.1 (described both by the International Phonetic Alphabet, IPA, and the computer readable Speech Assessment Methodologies Phonetic Alphabet, SAMPA). The first subdivision that is observed relates to the excitation mode and to the vocal tract stability: the distinction between vowels and consonants. Vowels correspond to a periodic vibration of the vocal cords and to a stable configuration of the vocal tract. Depending on whether the nasal branch is open or not (as a result of the lowering of the velum), vowels have either a nasal or an oral character. Semivowels are produced when the periodic glottal excitation occurs simultaneously with a fast movement of the vocal tract, between two vocalic positions.

Consonants correspond to fast constriction movements of the articulatory organs, i.e. generally to rather unstable sounds, which evolve over time. For fricatives, a strong constriction of the vocal tract causes a friction noise. If the vocal cords vibrate at the same time, the fricative consonant is then voiced. Otherwise, if the vocal folds let the air pass through without producing any sound, the fricative is unvoiced. Plosives are obtained by a complete obstruction of the vocal tract, followed by a release phase. If produced together with the vibration of the vocal cords, the plosive is voiced, otherwise it is unvoiced. If the nasal branch is opened during the mouth closure, the produced sound is a nasal consonant. Semivowels are considered voiced consonants, resulting from a fast movement which briefly passes through the articulatory position of a vowel. Finally, liquid consonants are produced as the combination of a voiced excitation and fast articulatory movements, mainly from the tongue.

Table 1.1. Computer-readable Speech Assessment Methodologies Phonetic Alphabet, SAMPA, and its correspondence in the International Phonetic Alphabet, IPA, with examples in 6 different languages [WEL 97]

In speech production, sound sources appear to be relatively localized; they excite the acoustic cavities in which the resulting air disturbances propagate and then radiate to the outer acoustic field. This relative independence of the sources with the transformations that they undergo is the basis for the acoustic theory of speech production [FAN 60, FLA 72, STE 99]. This theory considers source terms, on the one hand, which are generally assumed to be non-linear, and a linear filter on the other hand, which acts upon and transforms the source signal. This source-filter decomposition reflects the terminology commonly used in phonetics, which describes the speech sounds in terms of “phonation” (source) and “articulation” (filter). The source and filter acoustic contributions can be studied separately, as they can be considered to be decoupled from each other, in a first approximation. From the point of view of physics, this model is an approximation, the main advantage of which is its simplicity. It can be considered as valid at frequencies below 4 or 5 kHz, i.e. those frequencies for which the propagation in the vocal tract consists of one-dimensional plane waves. For signal processing purposes, the acoustic model can be described as a linear system, by neglecting the source-filter interaction:

where s(t) is the speech signal, v(t) the impulse response of the vocal tract, e(t) the vocal excitation source, l(t) the impulse response of the lip radiation component, p(t) the periodic part of the excitation, r(t) the non-periodic part of the excitation, ug(t) the glottal airflow wave, T₀ the fundamental period, r(t) the noise part of the excitation, d the Dirac distribution, and where S(ω), V(ω), E(ω), L(ω), P(ω), R(ω), Ug(ω) denote the Fourier transforms of s(t), v(t), e(t), l(t), p(t), r(t), ug(t) respectively. F₀=1/T₀ is the voicing fundamental frequency. The various terms of the source-filter model are now going to be studied in more details.

1.1.3. Sources

The source component e(t), E(ω) is a signal composed of a periodic part (vibrations of the vocal cords, characterized by F₀ and the glottal airflow waveform) and a noise part. The various phonemes use both types of source excitation either separately or simultaneously.

1.1.3.1. Glottal airflow wave

The study of glottal activity (phonation) is particularly important in speech science. Physical models of the glottis functioning, in terms of mass-spring systems have been investigated [FLA 72]. Several types of physiological signals can be used to conduct studies on the glottal activity (for example, electroglottography, fast photography, see [TIT 94]). From the acoustic point of view, the glottal airflow wave, which represents the airflow traveling through the glottis as a function of time, is preferred to the pressure wave. It is indeed easier to measure the glottal airflow rather than the glottal pressure, from physiological data. Moreover, the pseudo-periodic voicing source p(t) can be broken down into two parts: a pulse train, which represents the periodic part of the excitation and a low-pass filter, with an impulse response u_g, which corresponds to the (frequency-domain and time-domain) shape of the glottal airflow wave.

The time-domain shape of the glottal airflow wave (or, more precisely, of its derivative) generally governs the behavior of the time-domain signal for vowels and voiced signals [ROS 71]. Time-domain models of the glottal airflow have several properties in common: they are periodical, always non-negative (no incoming airflow), they are continuous functions of the time variable, derivable everywhere except, in some cases, at the closing instant. An example of such a time-domain model is the Klatt model [KLA 90], which calls for 4 parameters (the fundamental frequency F₀, the voicing amplitude AV, the opening ratio O_q and the frequency T_Lof a spectral attenuation filter). When there is no attenuation, the KGLOTT88 model writes:

when T_L ≠ 0, U_g(t) is filtered by an additional low-pass filter, with an attenuation at 3,000 Hz equal to T_L dB.

The LF model [FAN 85] represents the derivative of the glottal airflow with 5 parameters (fundamental period T₀, amplitude at the minimum of the derivative or at the maximum of the wave Ee, instant of maximum excitation Te, instant of maximum airflow wave Tp, time constant for the return phase Ta):

All time-domain models (see Figure 1.1) have at least three main parameters: the voicing amplitude, which governs the time-domain amplitude of the wave, the voicing period, and the opening duration, i.e. the fraction of the period during which the wave is non-zero. In fact, the glottal wave represents the airflow traveling through the glottis. This flow is zero when the vocal chords are closed. It is positive when they are open. A fourth parameter is introduced in some models to account for the speed at which the glottis closes. This closing speed is related to the high frequency part of the speech spectrum.

Figure 1.1. Models of the glottal airflow waveform in the time domain: triangular model, Rosenberg model, KGLOT88, LF and the corresponding spectra

The general shape of the glottal airflow spectrum is one of a low-pass filter. Fant [FAN 60] uses four poles on the negative real axis:

with sn ≈ sr₂ = 2ti x 100 Hz, and sr₃ = 2π x 2,000 Hz, sr₄ = 2π x 4,000 Hz. This is a spectral model with six parameters (F₀, Ug₀ and four poles), among which two are fixed (sr₃ and sr₄). This simple form is used in [MAR 76] in the digital domain, as a second-order low-pass filter, with a double real pole in k:

Two poles are sufficient in this case, as the numerical model is only valid up to approximately 4,000 Hz. Such a filter depends on three parameters: gain Ug₀, which corresponds to the voicing amplitude, fundamental frequency F₀ and a frequency parameter K, which replaces both sr₁ and sr₂. The spectrum shows an asymptotic slope of -12 dB/octave when the frequency increases. Parameter K controls the filter’s cut-off frequency. When the frequency tends towards zero, |U_g(0)| ~ U_g⁰. Therefore, the spectral slope is zero in the neighborhood of zero, and -12 dB/octave, for frequencies above a given bound (determined by K). When the focus is put on the derivative of the glottal airflow, the two asymptotes have slopes of +6 dB/octave and -6 dB/octave respectively. This explains the existence of a maximum in the speech spectrum at low frequencies, stemming from the glottal source.

Another way to calculate the glottal airflow spectrum is to start with time-domain models. For the Klatt model, for example, the following expression is obtained for the Laplace transform L, when there is no additional spectral attenuation:

Figure 1.2. Schematic spectral representation of the glottal airflow waveform. Solid line: abrupt closure of the vocal cords (minimum spectral slope). Dashed line: dampened closure. The cut-off frequency owed to this dampening is equal to 4 times the spectral maximum F_g

It can be shown that this is a low-pass spectrum. The derivative of the glottal airflow shows a spectral maximum located at:

This sheds light on the links between time-domain and frequency-domain parameters: the opening ratio (i.e. the ratio between the opening duration of the glottis and the overall glottal period) governs the spectral peak frequency. The time-domain amplitude rules the frequency-domain amplitude. The closing speed of the vocal cords relates directly to the spectral attenuation in the high frequencies, which shows a minimum slope of –12 dB/octave.

1.1.3.2. Noise sources

The periodic vibration of the vocal cords is not the only sound source in speech. Noise sources are involved in the production of several phonemes. Two types of noise can be observed: transient noise and continuous noise. When a plosive is produced, the holding phase (total obstruction of the vocal tract) is followed by a release phase. A transient noise is then produced by the pressure and airflow impulse generated by the opening of the obstruction. The source is located in the vocal tract, at the point where the obstruction and release take place. The impulse is a wide-band noise which slightly varies with the plosive.

For continuous noise (fricatives), the sound originates from turbulences in the fast airflow at the level of the constriction. Shadle [SHA 90] distinguishes noise caused by the lining and noise caused by obstacles, depending on the incidence angle of the air stream on the constriction. In both cases, the turbulences produce a source of random acoustic pressure downstream of the constriction. The power spectrum of this signal is approximately flat in the range of 0 – 4,000 Hz, and then decreases with frequency.

When the constriction is located at the glottis, the resulting noise (aspiration noise) shows a wide-band spectral maximum around 2,000 Hz. When the constriction is in the vocal tract, the resulting noise (frication noise) also shows a roughly flat spectrum, either slowly decreasing or with a wide maximum somewhere between 4 kHz and 9 kHz. The position of this maximum depends on the fricative. The excitation source for continuous noise can thus be considered as a white Gaussian noise filtered by a low-pass filter or by a wide band-pass filter (several kHz wide).

In continuous speech, it is interesting to separate the periodic and non-periodic contributions of the excitation. For this purpose, either the sinusoidal representation [SER 90] or the short-term Fourier spectrum [DAL 98, YEG 98] can be used. The principle is to subtract from the source signal its harmonic component, in order to obtain the non-periodic component. Such a separation process is illustrated in Figure 1.3.

Figure 1.3. Spectrum of the excitation source for a vowel. (A) the complete spectrum; (B) the non-periodic part; (C) the periodic part

1.1.4. Vocal tract

The vocal tract is an acoustic cavity. In the source-filter model, it plays the role of a filter, i.e. a passive system which is independent from the source. Its function consists of transforming the source signal, by means of resonances and anti-resonances. The maxima of the vocal tract’s spectral gain are called spectral formants, or more simply formants. Formants can generally be assimilated to the spectral maxima which can be observed on the speech spectrum, as the source spectrum is globally monotonous for voiced speech. However, depending on the source spectrum, formants and resonances may turn out to be shifted. Furthermore, in some cases, a source formant can be present. Formants are also observed in unvoiced speech segments, at least those that correspond to cavities located in front of the constriction, and thus excited by the noise source.

1.1.4.1. Multi-tube model

The vocal tract is an acoustic duct with a complex shape. At a first level of approximation, its acoustic behavior may be understood to be one of an acoustic tube. Hypotheses must be made to calculate the propagation of an acoustic wave through this tube:

– the propagation mode is (mono-dimensional) plane waves. This assumption is satisfied if the transverse dimension of the tube is small, compared to the considered wavelengths, which correspond in practice to frequencies below 4,000 Hz for a typical vocal tract (i.e. a length of 17.6 cm and a section of 8 cm² for the neutral vowel);

– the hypothesis of small movements is made (i.e. second-order terms can be neglected).

Let A denote the (constant) section of the tube, x the abscissa along the tube, t the time, p(x, t) the pressure, u(x, t) the speed of the air particles, U(x, t) the volume velocity, the density, L the tube length and C the speed of sound in the air (approximately 340 m/s). The equations governing the propagation of a plane wave in a tube (Webster equations) are:

This result is obtained by studying an infinitesimal variation of the pressure, the air particle speed and the density: images

in conjunction with two fundamental laws of physics:

1) the conservation of mass entering a slice of the tube comprised between x images

By neglecting the second-order term images

, by using the ideal gas law and the fact that the process is adiabatic, (p/ = C²), this equation can be rewritten images

The solutions of these equations are formed by any linear combination of functions f(t) and g(t) of a single variable, twice continuously derivable, written as a forward wave and a backward wave which propagate at the speed of sound:

It is easy to verify that function p satisfies equation [1.12]. Moreover, functions f and g satisfy:

which, when combined for example with Newton’s second law, yields the following expression for the volume velocity (the tube having a constant section A):

It must be noted that if the pressure is the sum of a forward function and a backward function, the volume velocity is the difference between these two functions. The expression Z_c = C/A is the ratio between the pressure and the volume velocity, which is called the characteristic acoustic impedance of the tube. In general, the acoustic impedance is defined in the frequency domain. Here, the term “impedance” is used in the time domain, as the ratio between the forward and backward parts of the pressure and the volume velocity. The following electroacoustical analogies are often used: “acoustic pressure” for “voltage”; “acoustic volume velocity” for “intensity”.

The vocal tract can be considered as the concatenation of cylindrical tubes, each of them having a constant area section A, and all tubes being of the same length. Let Δ denote the length of each tube. The vocal tract is considered as being composed of p sections, numbered from 1 to p, starting from the lips and going towards the glottis. For each section n, the forward and backward waves (respectively from the glottis to the lips and from the lips to the glottis) are denoted f_n and b_n. These waves are defined at the section input, from n+1 to n (on the left of the section, if the glottis is on the left). Let R_n =ρC/A_n denote the acoustic impedance of the section, which depends only on its area section.

Each section can then be considered as a quadripole with two inputs f_n+1 and b_n+1, two outputs f_n and b_n and a transfer matrix T_n+1:

For a given section, the transfer matrix can be broken down into two terms. Both the interface with the previous section (1) and the behavior of the waves within the section (2) must be taken into account:

1) At the level of the discontinuity between sections n and n+1, the following relations hold, on the left and on the right, for the pressure and the volume velocity:

as the pressure and the volume velocity are both continuous at the junction, we have R_n+1 (f_n+1+b_n+1) = R_n (f_n+b_n) and f_n+1-b_n+1 = f_n–b_n, which enables the transfer matrix at the interface to be calculated as:

After defining acoustic reflection coefficient k, the transfer matrix images

at the interface is:

2) Within the tube of section n+1, the waves are simply submitted to propagation delays, thus:

The phase delays and advances of the wave are all dependent on the same quantity Δ/C. The signal can thus be sampled with a sampling period equal to Fs = C/(2Δ) which corresponds to a wave traveling back and forth in a section. Therefore, the z-transform of equations [1.21] can be considered as a delay (respectively an advance) of Δ/C corresponding to a factor z^-1/2 (respectively z^1/2).

from which the transfer matrix images

corresponding to the propagation in section n + 1 can be deduced.

In the z-transform domain, the total transfer matrix T_n₊₁ for section n+1 is the product of images

and

The overall volume velocity transfer matrix for the p tubes (from the glottis to the lips) is finally obtained as the product of the matrices for each tube:

The properties of the volume velocity transfer function for the tube (from the glottis to the lips) can be derived from this result, defined as A_u = (f₀-b₀)/(f_p -b_p). For this purpose, the lip termination has to be calculated, i.e. the interface between the last tube and the outside of the mouth. Let (f_l,b_l) denote the volume velocity waves at the level of the outer interface and (f₀,b₀) the waves at the inner interface. Outside of the mouth, the backward wave b_l is zero. Therefore, b₀ and f₀ are linearly dependent and a reflection coefficient at the lips can be defined as k_l = b₀/f₀. Then, transfer function A_u can be calculated by inverting T, according to the coefficients of matrix T and the reflection coefficient at lips k_l:

It can be verified that the determinant of T does not depend on z, as this is also not the case for the determinant of each elementary tube. As the coefficients of the transfer matrix are the products of a polynomial expression of z and a constant multiplied by z^-1/2 for each section, the transfer function of the vocal tract is therefore an all-pole function with a zero for z=0 (which accounts for the propagation delay in the vocal tract).

1.1.4.2. All-pole filter model

During the production of oral vowels, the vocal tract can be viewed as an acoustic tube of a complex shape. Its transfer function is composed of poles only, thus behaving as an acoustic filter with resonances only. These resonances correspond to the formants of the spectrum, which, for a sampled signal with limited bandwidth, are of a finite number N. In average, for a uniform tube, the formants are spread every kHz; as a consequence, a signal sampled at F=1/T kHz (i.e. with a bandwidth of F/2 kHz), will contain approximately F/2 formants and N=F poles will compose the transfer function of the vocal tract from which the signal originates:

where B_i denotes the formant’s bandwidth at -6 dB on each side of its maximum and f_i its center frequency.

To take into account the coupling with the nasal cavities (for nasal vowels and consonants) or with the cavities at the back of the excitation source (the subglottic cavity during the open glottis part of the vocalic cycle or the cavities upstream the constriction for plosives and fricatives), it is necessary to incorporate in the transfer function a finite number of zeros images

(for a band-limited signal).

Any zero in the transfer function can be approximated by a set of poles, as images

Therefore, an all-pole model with a sufficiently large number of poles is often preferred in practice to a full pole-zero model.

1.1.5. Lip-radiation

The last term in the linear model corresponds to the conversion of the airflow wave at the lips into a pressure wave radiated at a given distance from the head. At a first level of approximation, the radiation effect can be assimilated to a differentiation: at the lips, the radiated pressure is the derivative of the airflow. The pressure recorded with the microphone is analogous to the one radiated at the lips, except for an attenuation factor, depending on its distance to the lips. The time-domain derivation corresponds to a spectral emphasis, i.e. a first-order high-pass filtering. The fact that the production model is linear can be exploited to condense the radiation term at the very level of the source. For this purpose, the derivative of the source is considered rather than the source itself. In the spectral domain, the consequence is to increase the slope of the spectrum by approximately +6 dB/octave, which corresponds to a time-domain derivation and, in the sampled domain, to the following transfer function:

1.2. Linear prediction

Linear prediction (or LPC for Linear Predictive Coding) is a parametric model of the speech signal [ATA 71, MAR 76]. Based on the source-filter model, an analysis scheme can be defined, relying on a small number of parameters and techniques for estimating these parameters.

1.2.1. Source-filter model and linear prediction

The source-filter model of equation [1.4] can be further simplified by grouping in a single filter the contributions of the glottis, the vocal tract and the lip-radiation term, while keeping a flat-spectrum term for the excitation. For voiced speech, P(z) is a periodic train of pulses and for unvoiced speech, N(z) is a white noise.

Considering the lip-radiation spectral model in equation [1.29] and the glottal airflow model in equation [1.9], both terms can be grouped into the flat spectrum source E, with unit gain (the gain factor G is introduced to take into account the amplitude of the signal). Filter H is referred to as the synthesis filter. An additional simplification consists of considering the filter H as an all-pole filter. The acoustic theory indicates that the filter V, associated with the vocal tract, is an all-pole filter only for non-nasal sounds whereas is contains both poles and zeros for nasal sounds. However, it is possible to approximate a pole/zero transfer function with an all-pole filter, by increasing the number of poles, which means that, in practice, an all-pole approximation of the transfer function is acceptable. The inverse filter of the synthesis filter is an all-zero filter, referred to as the analysis filter and denoted A. This filter has a transfer function that is written as an M^th-order polynomial, where M is the number of poles in the transfer function of the synthesis filter H:

Linear prediction is based on the correlation between successive samples in the speech signal. The knowledge of p samples until the instant n–1 allows some prediction of the upcoming sample, denoted images

, with the help of a prediction filter, the transfer function of which is denoted F(z):

The prediction error ε_n between the predicted and actual signals is thus written:

Linear prediction of speech thus closely relates with the linear acoustic production model: the source-filter production model and the linear prediction model can be identified with each other. The residual error ε_n can then be interpreted as the source of excitation e and the inverse filter A is associated with the prediction filter (by setting M = p).

The identification of filter A assumes a flat spectrum residual, which corresponds to a white noise or a single pulse excitation. The modeling of the excitation source in the framework of linear prediction can therefore be achieved by a pulse generator and a white noise generator, piloted by a voiced/unvoiced decision. The estimation of the prediction coefficients is obtained by minimizing the prediction error. Let images

denote the square prediction error and E the total square error over a given time interval, between n₀ and n₁:

The expression of coefficients a_k that minimizes the prediction error E over a frame is obtained by zeroing the partial derivatives of E with respect to thea_k coefficients, i.e., for k = 1, 2, …, p:

Several fast methods for computing the prediction coefficients have been proposed. The two main approaches are the autocorrelation method and the covariance method. Both methods differ by the choice of interval [n₀, n₁] on which total square error E is calculated. In the case of the covariance method, it is assumed that the signal is known only for a given interval of N samples exactly. No hypothesis is made concerning the behavior of the signal outside this interval. On the other hand, the autocorrelation method considers the whole range -∞, +∞ for calculating the total error. The coefficients are thus written:

The covariance method is generally employed for the analysis or rather short signals (for instance, one voicing period, or one closed glottis phase). In the case of the covariance method, matrix [c_ki] is symmetric. The prediction coefficients are calculated with a fast algorithm [MAR 76], which will not be detailed here.

1.2.2. Autocorrelation method: algorithm

For this method, signal s is considered as stationary. The limits for calculating the total error are -∞, +∞. However, only a finite number of samples are taken into account in practice, by zeroing the signal outside an interval [0, N-1], i.e. by applying a time window to the signal. Total quadratic error E and coefficients c_ki become:

Those are the autocorrelation coefficients of the signal, hence the name of the method. The roles of k and i are symmetric and the correlation coefficients only depend on the difference between k and i.

The samples of the signal s_n (resp. s_n+|k-i|) are non-zero only for n Є [0, N–1] (n+|k-i| Є [0, N–1] respectively). Therefore, by rearranging the terms in the sum, it can be written for k = 0, …, p:

as a consequence of the above set of equations [1.48]. An efficient method to solve this system is the recursive method used in the Levinson algorithm.

The matrix is symmetric and it is a Toeplitz matrix. In order to solve this system, a recursive solution on prediction order n is searched for. At each step n, a set of n+1 prediction coefficients is calculated: images

The process is repeated up to the desired prediction order p, at which stage: images

If we assume that the system has been solved at step n–1, the coefficients and the error at step n of the recursion are obtained as:

i.e.

where it can be easily shown from equations [1.50], [1.51] and [1.52] that:

As a whole, the algorithm for calculating the prediction coefficients is (coefficients k_i are called reflection coefficients):

These equations are solved recursively, until the solution for order p is reached.

In many applications, one of the goals is to identify the filter associated with the vocal tract, for instance to extract the formants [MCC 74]. Let us consider vowel signals, the spectra of which are shown in Figures 1.4, 1.5 and 1.6 (these spectra were calculated with a short-term Fourier transform (STFT) and are represented on a logarithmic scale). The linear prediction analysis of these vowels yields filters which correspond to the prediction model which could have produced them. Therefore, the magnitude of the transfer function of these filters can be viewed as the spectral envelope of the corresponding vowels.

Linear prediction thus estimates the filter part of the source-filter model. To estimate the source, the speech signal can be filtered by the inverse of the analysis filter. The residual signal subsequently obtained represents the derivative of the source signal, as the lip-radiation term is included in the filter (according to equation [1.30]). The residual signal must thus be integrated in order to obtain an estimation of the actual source, which is represented in Figure 1.7, both in the frequency and time domains.

Figure 1.4. Vowel /a/. Hamming windowed signal (F_e = 16 kHz). Magnitude spectrum on a logarithmic scale and gain of the LPC model transfer function (autocorrelation method). Complex poles of the LPC model (16 coefficients)

Figure 1.5. Vowel /u/. Hamming windowed signal (F_e

Preface

Chapter 1

Speech Analysis¹

1.1. Introduction

1.1.1. Source-filter model

1.1.2. Speech sounds

1.1.3. Sources

1.1.3.1. Glottal airflow wave

1.1.3.2. Noise sources

1.1.4. Vocal tract

1.1.4.1. Multi-tube model

1.1.4.2. All-pole filter model

1.1.5. Lip-radiation

1.2. Linear prediction

1.2.1. Source-filter model and linear prediction

1.2.2. Autocorrelation method: algorithm

Preface

Chapter 1

Speech Analysis1

1.1. Introduction

1.1.1. Source-filter model

1.1.2. Speech sounds

1.1.3. Sources

1.1.3.1. Glottal airflow wave

1.1.3.2. Noise sources

1.1.4. Vocal tract

1.1.4.1. Multi-tube model

1.1.4.2. All-pole filter model

1.1.5. Lip-radiation

1.2. Linear prediction

1.2.1. Source-filter model and linear prediction

1.2.2. Autocorrelation method: algorithm

Speech Analysis¹