## Chapter 1 Basic Acoustics and Acoustic Filters

2
7 The pass bands and reject bands of this double pass filter are shown in figure A1.

## Chapter 2 The Acoustic Theory of Speech Production

1 Amplitude at
The spectrum seems to be tilted so that the amplitude goes down by 3 dB over this frequency interval.
3 Using formula:

## Chapter 3 Digital Signal Processing

2 Nyquist frequencies are:
8,000 Hz, 5501.25 Hz, 10 Hz
5 Because the ideal lag duration in autocorrelation pitch tracking is the real pitch period of signal (which is what we are doing in pitch tracking) the best lag durations are 1/F0
1/100 = 0.01 sec, 1/200 = 0.005 sec, 1/204 = 0.0049 sec
8 The window is: 512/22,000 (0.02327) seconds. Actually, the terms “22 k” or “22 kHz” sampling almost always refer to a 22,050 Hz sampling rate – which would alter this answers a bit.
9 22,000 samples per second multiplied by 0.02 seconds equals 440 samples.
10 The interval between points in the FFT spectrum is 21.53 Hz. This is derived by dividing the Nyquist frequency, i.e. one-half of the sampling rate (we’ll assume 11,025 Hz), by 256 (½ of the FFT window – the other half codes the “imaginary” part of the spectrum).

## Chapter 4 Basic Audition

1 Six sones = 50,000 µPa. Subjectively double is 12 sones, which equals 137,000 µPa. This sound, which is subjectively twice as loud, has a sound pressure level that is almost three times greater.
3

## Chapter 5 Speech Perception

4 With the submatrix for [d] and [ð]
 d ð d 0.727 0 ð 0.015 0.515
we can calculate similarity (S),
S = (0.0 + 0.015)/(0.727 + 0.515) = 0.0124
and then distance (d),
which is greater than the distance between [z] and [ð].
5 The data on how auditory and visual perceptual spaces combine to let us predict the audio/visual McGurk effect are clearly compatible with the view that listeners perceive speech gestures – conveyed by both audition and vision. Of course, “compatibility” doesn’t equal proof, but scientists do tend to accept the theory that is compatible with the widest range of available data.

## Chapter 6 Vowels

2 Using equation (6.2) with Ab = 3, lb = 4, and lc = 2 we have Ablblc = 24 and c/2π = 5,570.4. Thus, for values of Ac between 0.05 and 2 we have:
When Ac equals 0, F1 also equals 0.
6 For the mid and low vowels in Mazatec we can see a peak for the second harmonic between the F0 (first harmonic) and F1 peaks. Unlike the LPC spectra in figure 6.10a, in which the lowest peak is always the F1, in the auditory spectra of figure 6.10a the lowest vocal tract resonance (F1) may be the second or third peak in the spectrum. With a high-pitched voice we might even find that F0 and F1 form a single peak in the auditory spectrum. So it would seem that if F1 can be the first, second, or third peak it would be difficult to devise an automatic formant tracker for F1 in auditory spectra.

## Chapter 7 Fricatives

2 As the articulators come together at the beginning of a fricative, and as the constriction is released at the end of a fricative, the assumption that the front and back cavities are uncoupled is likely to be wrong. This is because the constriction is less narrow at these times. Thus, at fricative onset and offset we might expect to see spectral peaks at the resonant frequencies of the back cavity.
4 This question is mysterious to some students. Acoustic coupling is the key. When the front and back cavities are strongly coupled (as in vowels) their resonances merge to form stable regions. But when the front and back cavities are not strongly coupled (as in fricatives) the front cavity resonances “skip” the back cavity resonances (figure 7.6). So, if quantal regions define distinctive features, and if, as the figures imply, quantal regions for fricatives and vowel place of articulation differ, then vowel and fricative place features must be different. This falls under a general observation that acoustic considerations tend to lead us to conclude that vowels and consonants have different features, while articulatory considerations lead us to conclude that vowels and consonants can be described with the same features.

## Chapter 8 Stops and Affricates

2 Figure A2 is a picture of the formant transitions measured from figure 8.7. The vertical line-up point is the onset of the vowel steady-state for each of the dV syllables in that figure. The heavy lines show the formant transitions and the light lines show these extended back so that they intersect with each other. I chose the locus frequency giving greater weight to the longer, easier-to-measure transitions of [u] and [i].
4 Yes, affricates do have stop release bursts. Because they have a stop component, it stands to reason that the same pressure buildup and release that we see in nonaffricated stops would also be present in affricates. However, affricate release bursts might be auditorily masked by the sudden onset of frica-tion that occurs soon after stop release.

## Chapter 9 Nasals and Laterals

1 The light damping peak in figure 9.2 has a steeper “skirt.” Energy falls off more gradually from the heavy damping peak than from the light damping peak.
3 Assume that the mouth cavity is 4 cm long in a palatal nasal. Then the lowest anti-formant frequency is:
5 What breathy vowels and nasalized vowels have in common is that they both have increased energy at low frequencies, as compared with the spectra of non-nasalized modal vowels. The voice spectrum in breathy vowels has a steeper spectral slope, and thus relatively more low-frequency energy. In nasalized vowels there are two low-frequency resonances (the “nasal” and “oral” first formants), and the anti-formants in nasalized vowels attenuate higher-frequency energy in the spectrum.
So “spontaneous nasalization” may result from a perceptual confusion in which listeners hearing breathy vowels may falsely think that the speaker is producing a nasalized vowel. If nasalized vowels occur only in the context of a nasal consonant, the listener may then parse the mistakenly identified nasalized vowel as a vowel–nasal sequence comparable to other nasalized vowels in the language.

References

Asher, R. E. and Kumari, T. C. (1997) Malayalam, London: Routledge.

Best, C. T. (1995) A direct realist perspective on cross-language speech perception. In W. Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodological issues in cross-language speech research, Timonium, MD: York Press, 167–200.

Bladon, A. and Lindblom, B. (1981) Modeling the judgment of vowel quality differences. Journal of the Acoustical Society of America, 69, 1414–22.

Bless, D. M. and Abbs, J. H. (1983) Vocal Fold Physiology: Contemporary research and clinical issues, San Diego: College Hill Press.

Bond, Z. S. (1999) Slips of the Ear: Errors in the perception of casual conversation, San Diego: Academic Press.

Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge, MA: MIT Press.

Brödel, M. (1946) Three Unpublished Drawings of the Anatomy of the Human Ear, Philadelphia: Saunders.

Campbell, R. (1994) Audiovisual speech: Where, what, when, how? Current Psychology of Cognition, 13, 76–80.

Catford, J. C. (1977) Fundamental Problems in Phonetics, Bloomington: Indiana University Press.

Chiba, T. and Kajiyama, M. (1941) The Vowel: Its nature and structure, Tokyo: Kaiseikan.

Cole, R. A. (1973) Listening for mispronunciations: A measure of what we hear during speech. Perception & Psychophysics, 13, 153–6.

Cooley, J. W., Lewis, P. A. W., and Welch, P. D. (1969) The fast Fourier transform and its applications. IEEE Transactions on Education, 12, 27–34.

Cooper, F. S., Liberman, A. M., and Borst, J. M. (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Science, 37, 318–25.

Davis, S. and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP 28, 357–66.

Delattre, P. C., Liberman, A. M., and Cooper, F. S. (1955) Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America, 27, 769–73.

Egan, J. P. and Hake, H. W. (1950) On the masking pattern of a simple auditory stimulus. Journal of the Acoustical Society of America, 22, 622–30.

Elman, J. L. and McClelland, J. L. (1988) Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143–65.

Fant, G. (1960) Acoustic Theory of Speech Production, The Hague: Mouton.

Flanagan, J. L. (1965) Speech Analysis Synthesis and Perception, Berlin: Springer-Verlag.

Flege, J. E. (1995) Second language speech learning: Theory, findings, and problems. In W. Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodological issues in cross-language speech research, Timonium, MD: York Press, 167–200.

Forrest, K., Weismer, G., Milenkovic, P., and Dougall, R. N. (1988) Statistical analysis of word-initial voiceless obstruents: Preliminary data. Journal of the Acoustical Society of America, 84, 115–23.

Fry, D. B. (1979) The Physics of Speech, Cambridge: Cambridge University Press.

Fujimura, O. (1962) Analysis of nasal consonants. Journal of the Acoustical Society of America, 32, 1865–75.

Ganong, W. F. (1980) Phonetic categorization in auditory word recognition. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–25.

Green, K. P., Kuhl, P. K., Meltzoff, A. N., and Stevens, E. B. (1991) Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50, 524–36.

Guion, S. G. (1998) The role of perception in the sound change of velar palatalization. Phonetica, 55, 18–52.

Hagiwara, R. (1995) Acoustic realizations of American /r/ as produced by women and men. UCLA Working Papers in Phonetics, 90, 1–187.

Halle, M. and Stevens, K. N. (1969) On the feature “advanced tongue root.” Quarterly Progress Report, 94, 209–15. Research Laboratory of Electronics, MIT.

Harnsberger, J. D. (2001) The perception of Malayalam nasal consonants by Marathi, Punjabi, Tamil, Oriya, Bengali, and American English listeners: A multidimensional scaling analysis. Journal of Phonetics, 29, 303–27.

Heinz, J. M. and Stevens, K. N. (1961) On the properties of voiceless fricative consonants. Journal of the Acoustical Society of America, 33, 589–96.

Jakobson, R., Fant, G., and Halle, M. (1952) Preliminaries to Speech Analysis, Cambridge, MA: MIT Press.

Jassem, W. (1979) Classification of fricative spectra using statistical discriminant functions. In B. Lindblom and S. Öhman (eds.), Frontiers of Speech Communication Research, New York: Academic Press, 77–91.

Johnson, K. (1989) Contrast and normalization in vowel perception. Journal of Phonetics, 18, 229–54.

Johnson, K. (1992) Acoustic and auditory analysis of Xhosa clicks and pulmonics. UCLA Working Papers in Phonetics, 83, 33–47.

Johnson, K. (2008) Quantitative Methods in Linguistics, Oxford: Wiley-Blackwell.

Johnson, K. and Ralston, J. V. (1994) Automaticity in speech perception: Some speech/nonspeech comparisons. Phonetica, 51(4), 195–209.

Johnson, K., Ladefoged, P., and Lindau, M. (1993) Individual differences in vowel production. Journal of the Acoustical Society of America, 94, 701–14.

Joos, M. (1948) Acoustic phonetics. Language, 23, suppl. 1.

Klatt, D. H. and Klatt, L. (1990) Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87, 820–57.

Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992) Linguistic experiences alter phonetic perception in infants by 6 months of age. Science, 255, 606–8.

Ladefoged, P. (1996) Elements of Acoustic Phonetics, 2nd edn., Chicago: University of Chicago Press.

Ladefoged, P. and Maddieson, I. (1996) The Sounds of the World’s Languages, Oxford: Blackwell.

Ladefoged, P., DeClerk, J., Lindau, M., and Papcun, G. (1972) An auditory-motor theory of speech production. UCLA Working Papers in Phonetics, 22, 48–75.

Lambacher, S., Martens, W., Nelson, B., and Berman, J. (2001) Identification of English voiceless fricatives by Japanese listeners: The influence of vowel context on sensitivity and response bias. Acoustic Science & Technology, 22, 334–43.

Laver, J. (1980) The Phonetic Description of Voice Quality, Cambridge: Cambridge University Press.

Liberman, A. M., Harris, K. S., Hoffman H. S., and Griffith, B. C. (1957) The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358–68.

Liljencrants, J. and Lindblom, B. (1972) Numerical simulation of vowel quality systems: The role of perceptual contrast. Language, 48, 839–62.

Lindau, M. (1978) Vowel features. Language, 54, 541–63.

Lindau, M. (1979) The feature “expanded.” Journal of Phonetics, 7, 163–76.

Lindau, M. (1984) Phonetic differences in glottalic consonants. Journal of Phonetics, 12, 147–55.

Lindau, M. (1985) The story of /r/. In V. Fromkin (ed.), Phonetic Linguistics: Essays in honor of Peter Ladefoged, Orlando, FL: Academic Press.

Lindblom, B. (1990) Explaining phonetic variation: A sketch of the H&H theory. In W. J. Hardcastle and A. Marchal (eds.), Speech Production and Speech Modeling, Dordrecht: Kluwer, 403–39.

Lindqvist-Gauffin, J. and Sundberg, J. (1976) Acoustic properties of the nasal tract. Phonetica, 33, 161–8.

Lotto, A. J. and Kluender, K. R. (1998) General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602–19.

Lubker, J. (1968) An EMG-cinefluorographic investigation of velar function during normal speech production. Cleft Palate Journal, 5, 1–18.

Lyons, R. F. (1982) A computational model of filtering, detection and compression in the cochlea. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1282–5.

Maddieson, I. (1984) Patterns of Sounds, Cambridge: Cambridge University Press.

Maeda, S. (1993) Acoustics of vowel nasalization and articulatory shifts in French nasal vowels. In M. K. Huffman and R. A. Krakow (eds.), Phonetics and Phonology, vol. 5: Nasals, nasalization, and the velum, New York: Academic Press, 147–67.

Mann, V. A. (1980) Influence of preceding liquid on stop-consonant perception. Perception & Psychophysics, 28, 407–12.

Marple, L. (1987) Digital Spectral Analysis with Applications, Englewood Cliffs, NJ: Prentice Hall.

McGurk, H. and MacDonald, J. (1976) Hearing lips and seeing voices. Nature, 264, 746–8.

McDonough, J. (1993) The phonological representation of laterals. UCLA Working Papers in Phonetics, 83, 19–32.

McDonough, J. and Ladefoged, P. (1993) Navajo stops. UCLA Working Papers in Phonetics, 84, 151–64.

Miller, G. A. and Nicely, P. E. (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338–52.

Miller, J. D. (1989) Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–34.

Moll, K. L. (1962) Velopharyngeal closure in vowels. Journal of Speech and Hearing Research, 5, 30–7.

Moore, B. C. J. (1982) An Introduction to the Psychology of Hearing, 2nd edn., New York: Academic Press.

Moore, B. C. J. and Glasberg, B. R. (1983) Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. Journal of the Acoustical Society of America, 74, 750–3.

Mrayati, M., Carré, R., and Guérin, B. (1988) Distinctive regions and modes: A new theory of speech production. Speech Communication, 7, 257–86.

Parzen, E. (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–76.

Pastore, R. E. and Farrington, S. M. (1996) Measuring the difference limen for identification of order of onset for complex auditory stimuli. Perception & Psychophysics, 58(4), 510–26.

Patterson, R. D. (1976) Auditory filter shapes derived from noise stimuli. Journal of the Acoustical Society of America, 59, 640–54.

Perkell, J. (1971) Physiology of speech production: A preliminary study of two suggested revisions of the features specifying vowels. Quarterly Progress Report, 102, 123–39. Research Institute of Electronics, MIT.

Petursson, M. (1973) Quelques remarques sur l’aspect articulatoire et acoustique des constrictives intrabuccales Islandaises. Travaux de l’Institut de Phonétique de Strasbourg, 5, 79–99.

Pickles, J. O. (1988) An Introduction to the Physiology of Hearing, 2nd edn., New York: Academic Press.

Pisoni, D. B. (1977) Identification and discrimination of the relative onset time of two-component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America, 61, 1352–61.

Potter, R. K., Kopp, G. A., and Green, H. (1947) Visible Speech, Dordrecht: Van Nostrand.

Qi, Y. (1989) Acoustic features of nasal consonants. Unpublished Ph.D. diss., Ohio State University.

Rand, T. C. (1974) Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55(3), 678–80.

Raphael, L. J. and Bell-Berti, F. (1975) Tongue musculature and the feature of tension in English vowels. Phonetica, 32, 61–73.

Rayleigh, J. W. S. (1896) The Theory of Sound, London: Macmillan; repr. 1945, New York: Dover.

Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981) Speech perception without traditional speech cues. Science, 212, 947–50.

Repp, B. (1986) Perception of the [m]–[n] distinction in CV syllables. Journal of the Acoustical Society of America, 79, 1987–99.

Rosenblum, L. D., Schmuckler, M. A., and Johnson, J. A. (1997) The McGurk effect in infants. Perception & Psychophysics, 59, 347–57.

Samuel, A. G. (1991) A further examination of the role of attention in the phonemic restoration illusion. Quarterly Journal of Experimental Psychology, 43A, 679–99.

Schroeder, M. R., Atal, B. S., and Hall, J. L. (1979) Objective measure of certain speech signal degradations based on masking properties of human auditory perception. In B. Lindblom and S. Öhman (eds.), Frontiers of Speech Communication Research, London: Academic Press, 217–29.

Sekiyama, K. and Tohkura, Y. (1993) Inter-language differences in the influence of visual cues in speech perception. Journal of Phonetics, 21, 427–44.

Seneff, S. (1988) A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16, 55–76.

Shadle, C. (1985) The acoustics of fricative consonants. RLE Technical Report, 506, MIT.

Shadle, C. (1991) The effect of geometry on source mechanisms of fricative consonants. Journal of Phonetics, 19, 409–24.

Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication, Urbana: University of Illinois.

Shepard, R. N. (1972) Psychological representation of speech sounds. In E. E. David and P. B. Denes (eds.), Human Communication: A unified view. New York: McGraw-Hill, 67–113.

Slaney, M. (1988) Lyons’ cochlear model. Apple Technical Report, 13. Apple Corporate Library, Cupertino, CA.

Stevens, K. N. (1972) The quantal nature of speech: Evidence from articulatory-acoustic data. In E. E. David, Jr. and P. B. Denes (eds.), Human Communication: A unified view, New York: McGraw-Hill, 51–66.

Stevens, K. N. (1987) Interaction between acoustic sources and vocal-tract configurations for consonants. Proceedings of the Eleventh International Conference on Phonetic Sciences, 3, 385–9.

Stevens, K. N. (1989) On the quantal nature of speech. Journal of Phonetics, 17, 3–45.

Stevens, K. N. (1999) Acoustic Phonetics, Cambridge, MA: MIT Press.

Stevens, S. S. (1957) Concerning the form of the loudness function. Journal of the Acoustical Society of America, 29, 603–6.

Stockwell, R. P. (1973) Problems in the interpretation of the Great English Vowel Shift. In M. E. Smith (ed.), Studies in Linguistics in Honor of George L. Trager, The Hague: Mouton, 344–62.

Stone, M. (1991) Toward a model of three-dimensional tongue movement. Journal of Phonetics, 19, 309–20.

Straka, G. (1965) Album phonétique, Laval: Les Presses de l’Université Laval.

Syrdal, A. K. and Gophal, H. S. (1986) A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086–1100.

Terbeek, D. (1977) A cross-language multidimensional scaling study of vowel perception. UCLA Working Papers in Phonetics, 37, 1–271.

Traunmüller, H. (1981) Perceptual dimension of openness in vowels. Journal of the Acoustical Society of America, 69, 1465–75.

Walker, S., Bruce, V., and O’Malley, C. (1995) Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124–33.

Warren, R. M. (1970) Perceptual restoration of missing speech sounds. Science, 167, 392–3.

Wright, J. T. (1986) The behavior of nasalized vowels in the perceptual vowel space. In J. J. Ohala and J. J. Jaeger (eds.), Experimental Phonology, New York: Academic Press, 45–67.

Zwicker, E. (1961) Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Journal of the Acoustical Society of America, 33, 248.

Zwicker, E. (1975) Scaling. In W. D. Keidel and W. D. Neff (eds.), Auditory System: Physiology (CNS), behavioral studies, psychoacoustics, Berlin: Springer-Verlag.

# 1.1 The Sensation of Sound

Several types of events in the world produce the sensation of sound. Examples include doors slamming, plucking a violin string, wind whistling around a corner, and human speech. All these examples, and any others we could think of, involve movement of some sort. And these movements cause pressure fluctuations in the surrounding air (or some other acoustic medium). When pressure fluctuations reach the eardrum, they cause it to move, and the auditory system translates these movements into neural impulses which we experience as sound. Thus, sound is produced when pressure fluctuations impinge upon the eardrum. An acoustic waveform is a record of sound-producing pressure fluctuations over time. (Ladefoged, 1996, Fry, 1979, and Stevens, 1999, provide more detailed discussions of the topics covered in this chapter.)

Acoustic medium
Normally the pressure fluctuations that are heard as sound are produced in air, but it is also possible for sound to travel through other acoustic media. So, for instance, when you are swimming under water, it is possible to hear muffled shouts of the people above the water, and to hear noise as you blow bubbles in the water. Similarly, gases other than air can transmit pressure fluctuations that cause sound. For example, when you speak after inhaling helium from a balloon, the sound of your voice travels through the helium, making it sound different from normal. These examples illustrate that sound properties depend to a certain extent on the acoustic medium, on how quickly pressure fluctuations travel through the medium, and how resistant the medium is to such fluctuations.

# 1.2 The Propagation of Sound

Pressure fluctuations impinging on the eardrum produce the sensation of sound, but sound can travel across relatively long distances. This is because a sound produced at one place sets up a sound wave that travels through the acoustic medium. A sound wave is a traveling pressure fluctuation that propagates through any medium that is elastic enough to allow molecules to crowd together and move apart. The wave in a lake after you throw in a stone is an example. The impact of the stone is transmitted over a relatively large distance. The water particles don’t travel; the pressure fluctuation does.

A line of people waiting to get into a movie is a useful analogy for a sound wave. When the person at the front of the line moves, a “vacuum” is created between the first person and the next person in the line (the gap between them is increased), so the second person steps forward. Now there is a vacuum between person two and person three, so person three steps forward. Eventually, the last person in the line gets to move; the last person is affected by a movement that occurred at the front of the line, because the pressure fluctuation (the gap in the line) traveled, even though each person in the line moved very little. The analogy is flawed, because in most lines you get to move to the front eventually. For this to be a proper analogy for sound propagation, we would have to imagine that the first person is shoved back into the second person and that this crowding or increase of pressure (like the vacuum) is transmitted down the line.

Figure 1.2 shows a pressure waveform at the location indicated by the asterisk in figure 1.1. The horizontal axis shows the passage of time, the vertical axis the degree of crowdedness (which in a sound wave corresponds to air pressure). At time 3 there is a sudden drop in crowdedness because person two stepped up and left a gap in the line. At time 4 normal crowdedness is restored when person 3 steps up to fill the gap left by person 2. At time 10 there is a sudden increase in crowdedness as person 2 steps back and bumps into person 3. The graph in figure 1.2 is a way of representing the traveling rarefaction and compression waves shown in figure 1.1. Given a uniform acoustic medium, we could reconstruct figure 1.1 from figure 1.2 (though note the discussion in the next paragraph on sound energy dissipation). Graphs like the one shown in figure 1.2 are more typical in acoustic phonetics, because this is the type of view of a sound wave that is produced by a microphone – it shows amplitude fluctuations as they travel past a particular point in space.

An analogy for sound propagation
Figure 1.1 shows seven people (represented by numbers) standing in line to see a show. At time 2 the first person steps forward and leaves a gap in the line. So person two steps forward at time 3, leaving a gap between the second and third persons in the line. The gap travels back through the line until time 8, when everyone in the line has moved forward one step. At time 9 the first person in the line is shoved back into place in the line, bumping into person two (this is symbolized by an X). Naturally enough, person two moves out of person one’s way at time 10, and bumps into person three. Just as the gap traveled back through the line, now the collision travels back through the line, until at time 15 everyone is back at their starting points.
We can translate the terms of the analogy to sound propagation. The people standing in line correspond to air molecules, the group of them corresponding to an acoustic medium. The excess gap between successive people is negative air pressure, or rarefaction, and collisions correspond to positive air pressure, or compression. Zero air pressure (which in sound propagation is the atmospheric pressure) is the normal, or preferred, distance between the people standing in line. The initial movement of person one corresponds to the movement of air particles adjacent to one of the tines of a tuning fork (for example) as the tine moves away from the particle. The movement of the first person at time 9 corresponds to the opposite movement of the tuning fork’s tine.

Sound waves lose energy as they travel through air (or any other acoustic medium), because it takes energy to move the molecules. Perhaps you have noticed a similar phenomenon when you stand in a long line. If the first person steps forward, and then back, only a few people at the front of the line may be affected, because people further down the line have inertia; they will tolerate some change in pressure (distance between people) before they actually move in response to the change. Thus the disturbance at the front of the line may not have any effect on the people at the end of a long line. Also, people tend to fidget, so the difference between movement propagated down the line and inherent fidgeting (the signal-to-noise ratio) may be difficult to detect if the movement is small. The rate of sound dissipation in air is different from the dissipation of a movement in a line, because sound radiates in three dimensions from the sound source (in a sphere). This means that the number of air molecules being moved by the sound wave greatly increases as the wave radiates from the sound source. Thus the amount of energy available to move each molecule on the surface of the sphere decreases as the wave expands out from the sound source; consequently the amount of particle movement decreases as a function of the distance from the sound source (by a power of 3). That is why singers in heavy metal bands put the microphone right up to their lips. They would be drowned out by the general din otherwise. It is also why you should position the microphone close to the speaker’s mouth when you record a sample of speech (although it is important to keep the microphone to the side of the speaker’s lips, to avoid the blowing noises in [p]’s, etc.).

# 1.3 Types of Sounds

There are two types of sounds: periodic and aperiodic. Periodic sounds have a pattern that repeats at regular intervals. They come in two types: simple and complex.

## 1.3.1 Simple periodic waves

Simple periodic waves are also called sine waves: they result from simple harmonic motion, such as the swing of a pendulum. The only time we humans get close to producing simple periodic waves in speech is when we’re very young. Children’s vocal cord vibration comes close to being sinusoidal, and usually women’s vocal cord vibration is more sinusoidal than men’s. Despite the fact that simple periodic waves rarely occur in speech, they are important, because more complex sounds can be described as combinations of sine waves. In order to define a sine wave, one needs to know just three properties. These are illustrated in figures 1.3–1.4.

The first is frequency: the number of times the sinusoidal pattern repeats per unit time (on the horizontal axis). Each repetition of the pattern is called a cycle, and the duration of a cycle is its period. Frequency can be expressed as cycles per second, which, by convention, is called hertz (and abbreviated Hz). So to get the frequency of a sine wave in Hz (cycles per second), you divide one second by the period (the duration of one cycle). That is, frequency in Hz equals 1/T, where T is the period in seconds. For example, the sine wave in figure 1.3 completes one cycle in 0.01 seconds. The number of cycles this wave could complete in one second is 100 (that is, one second divided by the amount of time each cycle takes in seconds, or 1/0.01 = 100). So, this waveform has a frequency of 100 cycles per second (100 Hz).

The second property of a simple periodic wave is its amplitude: the peak deviation of a pressure fluctuation from normal, atmospheric pressure. In a sound pressure waveform the amplitude of the wave is represented on the vertical axis.

The third property of sine waves is their phase: the timing of the waveform relative to some reference point. You can draw a sine wave by taking amplitude values from a set of right triangles that fit inside a circle (see exercise 4 at the end of this chapter). One time around the circle equals one sine wave on the paper. Thus we can identify locations in a sine wave by degrees of rotation around a circle. This is illustrated in figure 1.4. Both sine waves shown in this figure start at 0° in the sinusoidal cycle. In both, the peak amplitude occurs at 90°, the downward-going (negative-going) zero-crossing at 180°, the negative peak at 270°, and the cycle ends at 360°. But these two sine waves with exactly the same amplitude and frequency may still differ in terms of their relative timing, or phase. In this case they are 90° out of phase.

## 1.3.2 Complex periodic waves

Complex periodic waves are like simple periodic waves in that they involve a repeating waveform pattern and thus have cycles. However, complex periodic waves are composed of at least two sine waves. Consider the wave shown in figure 1.5, for example. Like the simple sine waves shown in figures 1.3 and 1.4, this waveform completes one cycle in 0.01 seconds (i.e. 10 milliseconds). However, it has a additional component that completes ten cycles in this same amount of time. Notice the “ripples” in the waveform. You can count ten small positive peaks in one cycle of the waveform, one for each cycle of the additional frequency component in the complex wave. I produced this example by adding a 100 Hz sine wave and a (lower-amplitude) 1,000 Hz sine wave. So the 1,000 Hz wave combined with the 100 Hz wave produces a complex periodic wave. The rate at which the complex pattern repeats is called the fundamental frequency (abbreviated F0).

Fundamental frequency and the GCD
The wave shown in figure 1.5 has a fundamental frequency of 100 Hz and also a 100 Hz component sine wave. It turns out that the fundamental frequency of a complex wave is the greatest common denominator (GCD) of the frequencies of the component sine waves. For example, the fundamental frequency (F0) of a complex wave with 400 Hz and 500 Hz components is 100 Hz. You can see this for yourself if you draw the complex periodic wave that results from adding a 400 Hz sine wave and a 500 Hz sine wave. We will use the sine wave in figure 1.3 as the starting point for this graph. The procedure is as follows:
1 Take some graph paper.
2 Calculate the period of a 400 Hz sine wave. Because frequency is equal to one divided by the period (in math that’s f = 1/T), we know that the period is equal to one divided by the frequency (T = 1/f). So the period of a 400 Hz sine wave is 0.0025 seconds. In milliseconds (1/1,000ths of a second) that’s 2.5 ms (0.0025 times 1,000).
3 Calculate the period of a 500 Hz sine wave.
4 Now we are going to derive two tables of numbers that constitute instructions for drawing 400 Hz and 500 Hz sine waves. To do this, add some new labels to the time axis on figure 1.3, once for the 400 Hz sine wave and once for the 500 Hz sine wave. The 400 Hz time axis will have 2.5 ms in place of 0.01 sec, because the 400 Hz sine wave completes one cycle in 2.5 ms. In place of 0.005 sec the 400 Hz time axis will have 1.25 ms. The peak of the 400 Hz sine wave occurs at 0.625 ms, and the valley at 1.875 ms. This gives us a table of times and amplitude values for the 400 Hz wave (where we assume that the amplitude of the peak is 1 and the amplitude of the valley is -1, and the amplitude value given for time 3.125 is the peak in the second cycle):
The interval between successive points in the waveform (with 90° between each point) is 0.625 ms. In the 500 Hz sine wave the interval between comparable points is 0.5 ms.
5 Now on your graph paper mark out 20 ms with 1 ms intervals. Also mark an amplitude scale from 1 to -1, allowing about an inch.
6 Draw the 400 Hz and 500 Hz sine waves by marking dots on the graph paper for the intersections indicated in the tables. For instance, the first dot in the 400 Hz sine wave will be at time 0 ms and amplitude 0, the second at time 0.625 ms and amplitude 1, and so on. Note that you may want to extend the table above to 20 ms (I stopped at 3.125 to keep the times right for the 400 Hz wave). When you have marked all the dots for the 400 Hz wave, connect the dots with a freehand sine wave. Then draw the 500 Hz sine wave in the same way, using the same time and amplitude axes. You should have a figure with overlapping sine waves something like figure 1.6.
7 Now add the two waves together. At each 0.5 ms point, take the sum of the amplitudes in the two sine waves to get the amplitude value of the new complex periodic wave, and then draw the smooth waveform by eye.
Take a look at the complex periodic wave that results from adding a 400 Hz sine wave and a 500 Hz sine wave. Does it have a fundamental frequency of 100 Hz? If it does, you should see two complete cycles in your 20 ms long complex wave; the waveform pattern from 10 ms to 20 ms should be an exact copy of the pattern that you see in the 0 ms to 10 ms interval.

Figure 1.6 shows another complex wave (and four of the sine waves that were added together to produce it). This wave shape approximates a sawtooth pattern. Unlike in the previous example, it is not possible to identify the component sine waves by looking at the complex wave pattern. Notice how all four of the component sine waves have positive peaks early in the complex wave’s cycle and negative peaks toward the end of the cycle. These peaks add together to produce a sharp peak early in the cycle and a sharp valley at the end of the cycle, and tend to cancel each other over the rest of the cycle. We can’t see individual peaks corresponding to the cycles of the component waves. Nonetheless, the complex wave was produced by adding together simple components.

Now let’s look at how to represent the frequency components that make up a complex periodic wave. What we’re looking for is a way to show the component sine waves of the complex wave when they are not easily visible in the waveform itself. One way to do this is to list the frequencies and amplitudes of the component sine waves like this:

In this discussion I am skipping over a complicated matter. We can describe the amplitudes of sine waves on a number of different measurement scales, relating to the magnitude of the wave, its intensity, or its perceived loudness (see chapter 4 for more discussion of this). In this chapter, I am representing the magnitude of the sound wave in relative terms, so that I don’t have to introduce units of measure for amplitude (instead I have to add this long apology!). So, the 200 Hz component has and amplitude that is one half the magnitude of the 100 Hz component, and so on.

Figure 1.7 shows a graph of these values with frequency on the horizontal axis and amplitude on the vertical axis. The graphical display of component frequencies is the best method for showing the simple periodic components of a complex periodic wave, because complex waves are often composed of so many frequency components that a table is impractical. An amplitude versus frequency plot of the simple sine wave components of a complex wave is called a power spectrum.

Here’s why it is so important that complex periodic waves can be constructed by adding together sine waves. It is possible to produce an infinite variety of complex wave shapes by combining sine waves that have different frequencies, amplitudes, and phases. A related property of sound waves is that any complex acoustic wave can be analyzed in terms of the sine wave components that could have been used to produce that wave. That is, any complex waveform can be decomposed into a set of sine waves having particular frequencies, amplitudes, and phase relations. This property of sound waves is called Fouriers theorem, after the seventeenth-century mathematician who discovered it.

In Fourier analysis we take a complex periodic wave having an arbitrary number of components and derive the frequencies, amplitudes, and phases of those components. The result of Fourier analysis is a power spectrum similar to the one shown in figure 1.7. (We ignore the phases of the component waves, because these have only a minor impact on the perception of sound.)

## 1.3.3 Aperiodic waves

Aperiodic sounds, unlike simple or complex periodic sounds, do not have a regularly repeating pattern; they have either a random waveform or a pattern that doesn’t repeat. Sound characterized by random pressure fluctuation is called white noise. It sounds something like radio static or wind blowing through trees. Even though white noise is not periodic, it is possible to perform a Fourier analysis on it; however, unlike Fourier analyses of periodic signals composed of only a few sine waves, the spectrum of white noise is not characterized by sharp peaks, but, rather, has equal amplitude for all possible frequency components (the spectrum is flat). Like sine waves, white noise is an abstraction, although many naturally occurring sounds are similar to white noise; for instance, the sound of the wind or fricative speech sounds like [s] or [f].

Figures 1.8 and 1.9 show the acoustic waveform and the power spectrum, respectively, of a sample of white noise. Note that the waveform shown in figure 1.8 is irregular, with no discernible repeating pattern. Note too that the spectrum shown in figure 1.9 is flat across the top. As we will see in chapter 3 (on digital signal processing), a Fourier analysis of a short chunk (called an “analysis window”) of a waveform leads to inaccuracies in the resultant spectrum. That’s why this spectrum has some peaks and valleys even though, according to theory, white noise should have a flat spectrum.

Figure 1.10Figure 1.11figure 1.10figure 1.11figure 1.9